3.3 Data review and appraisal

The aim of this step is to ensure that the submitted data

  • meets data collection and acquisition criteria,
  • meets the quality requirements of the data archive for data.

The actual process and steps taken as well as the extent of checks and curation level might differ by the collection data has been submitted to, data type, as well as tools used in the Pre-ingest phase.

The result of the data review and appraisal process should lead to a decision on acceptance or rejection of data, and delivery of information package to the Ingest including information they need.

3.3.1 Compliance with acquisition criteria

Criteria for data to be included in data collection should be in the Acquisition policy. Criteria might include data that

  • has a high potential for reuse in research and teaching,
  • is important for validation of research,
  • supports a scientific article,
  • has high quality (like good spatial and temporal coverage, comparability, reliable with good provenance),
  • meets strategic needs of a designated community (like new data types or sources, longitudinal data, time-series data, specific research topics like COVID-19 data, migration data),
  • has scientific and/or historical value,
  • is unique or at risk of loss.

If data falls outside of the scope of archives collection, the material can still be reviewed to see if it might suit an institutional repository, generalist repository or another domain repository.

How we do it: UK Data Service Collection Development Appraisal Grid (UK Data Service 2022)

3.3.2 Control of the submitted material

Data review illustration

In addition to the decisions based on the value of data, data deposit is controlled and reviewed for issues important for preservation and reusability.

Control of the depositor ID

Depositor ID can be controlled automatically with identity federation, when asking data depositor to log in with academic institution credentials when submitting data.

Completeness control

Ensures that data and documentation in the data description are included in the delivery.

For example, if a study includes data from three surveys, there should be three data files or one file with a matching number of cases, and documentation of all three surveys as well as three questionnaires. It might need some manual inspection of files.

Virus and readability control

Data (and any command files, if delivered) is free of viruses, readable (file is not corrupt and in accepted data formats). Virus control can be done automatically. Formats can be checked automatically as well, but readability control will most probably involve manual work and visual inspection of file contents.

Data description control

This step should ensure that the metadata is complete according to the metadata profile that the data archive has, and there is enough context information to understand the data.

For data to be included in the CESSDA Data Catalogue (CESSDA n.d. [Accessed July 30, 2022a]), the CESSDA Metadata Model (Akdeniz, Esra et al. 2021) identifies mandatory and optional metadata fields, so it could be a good idea to check if CMM mandatory metadata elements are included in the data description.

Ethical and legal aspects

The pre-ingest review should ensure that the ethical and legal aspects that are binding to the data archive are followed.

If the data archive does not accept personal data, in this step of review:

  • data should be checked for direct personal identifiers (addresses, phone numbers, IP-addresses, personal numbers etc.);
  • researchers should be asked, if data are completely anonymised or there is a ‘code key’ (a ‘key’ matching research persons’ identities with a code number in data file),
  • risks of disclosure created by detailed background in data should be evaluated - can research persons still be traced, even if the direct identifiers are removed?

For quantitative data, some common data transformation techniques can be used to minimize the risks might be aggregating more detailed response categories and geographical variables; creating intervals instead of precise values (for age, income); coding free-text open-ended answers (for occupation etc). For qualitative data manual removal of place-, time- and other specific references in the background data can be done.

Sensitive personal data (racial or ethnic origin, political opinions, religious or philosophical beliefs, or trade union membership, genetic data, biometric data, data concerning health or data concerning a natural person's sex life or sexual orientation) require additional protection and special treatment in processing and access. If the data archive does not have the capacity to accept sensitive data, it might be reasonable to consider other options, for example, accepting a partial data set or collaborating with another repository.

3.3.3 Review of documentation

After initial controls, Pre-ingest should ensure that there is enough metadata in the data file, and/or added documentation to understand the contents of data files, and to facilitate reuse of the data.

In this step, the review should go beyond formal check if the information is there, and the main focus should be to ensure there is enough context information to reuse the data.

What information should be included and how it is usually structured (in the data file, in a separate file, in technical documentation) is data collection method- and discipline specific. It might differ for a qualitative study based on video data material, and a quantitative survey resulting in an SPSS database.

  • For a survey, a questionnaire and/or a codebook and a technical report on fieldwork should be accompanying data files. For a survey data file data archive staff in Pre-ingest might check, if all variables named are in the file, and all variables in the file are properly documented with variable names and response category names.
  • For a relatively recent research project, it might be rather easy to get the missing information, but for studies going further back in time and involving different researcher groups it might take time and effort to get the necessary information needed to understand the data.

It is difficult to estimate reasonable timing for finishing the first review, as it varies depending on the scope and complexity of material, type of data as well as resources available.

3.3.4 Communication with researchers

It may be good to ensure that communication with researchers can be tracked so that a colleague can take over in case of absence, or it is easy to resume communication after a longer time (for example, summer holidays, other commitments). It might also be of interest for data curators working in ingest to be able to review previous communication during Pre-ingest.

If communication goes mainly through curators’ e-mail address, or an e-mail address that is accessible to all data curators working with Pre-ingest/ Ingest, it is important at some moment to tag the communication thread with the pre-ingest ID, or copy the most important parts of communication and decisions in system/tool you use for Pre-Ingest. Issue management systems like Jira can be used as well. If the data archive uses an in-house developed web-based system for data deposit, it may include a communication module as well.

In case some issues with data or metadata are discovered, data producers should be asked to complement the missing information and/or presented with suggestions to acceptable solutions.

The time for finishing the first review may vary. In case it takes longer than 2-3 days to review the deposit, it is good to send short feedback to the data depositor indicating the status of the deposit and describing the next steps (for example, ‘we will provide you feedback by [date]’).

Sometimes it takes time for data producers to answer the questions, or it takes time to solve the more complex issues with data or documentation. It might feel frustrating for researchers. Perhaps, it might be good to see if the issues can be divided into smaller steps and actions. As a curator, it might be good to be proactive, for example, to ask for a good time for a (web) call to address the issues that can be addressed right away. Perhaps some of the issues can be fixed by data curators directly after approval from data depositors, giving the impression that it is not that much work for them without adding much work on data curators.

There might, however, be a critical upper limit of time how long data deposits with unresolved issues should stay in Pre-ingest. In such cases, when reaching the time limit you could ask the researchers if they still would like to proceed with sharing data. If the issues with data cannot be resolved, or researchers are no longer interested in depositing data in your archive, the deposit should be removed. It should be done about a month or two after informing researchers about this decision, allowing researchers enough time to consider and address the issues, if appropriate.