3.4 Data review and appraisal
The aim of this step is to ensure that the submitted data:
- Meets collection and acquisition criteria, and
- Meets the quality requirements.
The actual process and steps taken as well as the extent of checks and curation level might differ by the collection data has been submitted to, data type, as well as tools used in the pre-ingest phase.
The result of the data review and appraisal process should lead to a decision on acceptance or rejection of data, and delivery of information package to the ingest phase.
3.4.1 Compliance with acquisition criteria
Criteria for data to be included in data collection should be in the acquisition policy. Such criteria may include data that:
- has a high potential for reuse in research and teaching,
- is important for the validation of research,
- supports a journal article,
- is of high quality (good spatial and temporal coverage, comparability, reliable with good provenance),
- meets strategic needs of a designated community (like new data types or sources, longitudinal data, time-series data, specific research topics like COVID-19 data, migration data),
- has scientific and/or historical value,
- is unique or at risk of loss.
If data falls outside of the scope of archives collection, the material can still be referred to institutional repository, generalist repository or another domain repository, as appropriate.
How we do it: UK Data Service Collection Development Appraisal Grid (UK Data Service 2022)
3.4.2 Control of the submitted material
In addition to the decisions based on the value of data, data deposit is controlled and reviewed for issues important for preservation and reusability.
Control of the depositor ID
Depositor ID can be controlled automatically with identity federation, when asking the data depositor to log in with academic institution credentials when submitting data.
Completeness control
Ensures that data and documentation mentioned in the data description are included in the delivery.
For example, if a study includes data from three surveys, there should be three data files or one file with a matching number of cases, and documentation of all three surveys as well as three questionnaires. It might need some manual inspection of files.
Virus and readability control
Data (and any command files, if delivered) is free of viruses, readable (files are not corrupt and are in an accepted format). Virus control can be performed automatically. Formats can be checked automatically as well, but readability control will most probably involve manual work and visual inspection of file contents.
Data description control
This step should ensure that the metadata is complete according to the metadata profile that the data archive has, and there is enough context information to understand the data.
How we do it: For data to be included in the CESSDA Data Catalogue (CESSDA n.d. [Accessed July 30, 2022a]), the CESSDA Metadata Model (Akdeniz, Esra et al. 2021) identifies mandatory and optional metadata fields. Therefore, it may be a good idea to check if CMM mandatory metadata elements are included in the data description provided by your repository.
Ethical and legal aspects
The pre-ingest review should ensure that the ethical and legal aspects that are binding to the data archive are followed.
If the data archive does not accept personal data, the reviewer should:
- check data for direct personal identifiers (addresses, phone numbers, IP-addresses, personal numbers etc.);
- ask the researchers if data are completely anonymised or if there is a ‘code key’ (a ‘key’ matching research persons’ identities with a code number in the data file),
- assess risks of disclosure created by the detailed background of data becoming available - can researched persons still be traced, even if the direct identifiers are removed?
For quantitative data, some common data transformation techniques can be used to minimise the risks of disclosure. This could include aggregating more detailed response categories and geographical variables; creating intervals instead of precise values (for age, income); or coding free-text open-ended answers (for occupation etc). For qualitative data, the manual removal of place-, time- and other specific references in the background data can be done.
Sensitive personal data (such as racial or ethnic origin, political opinions, religious or philosophical beliefs, trade union membership, genetic data, biometric data, data concerning health or data concerning a natural person's sex life or sexual orientation) require additional protection and special treatment in consent to publish, processing and access. If the data archive does not have the capacity to accept sensitive data, it might be reasonable to consider other options, for example, accepting a partial data set or collaborating with another repository.
More details and resources related to sensitive personal data can be found in Chapter 1, related to GDPR.
3.4.3 Review of documentation
After initial controls, the pre-ingest phase should ensure that there is enough metadata in the data file, and/or added documentation to understand the contents of data files, and to facilitate reuse of the data.
In this step, the review should go beyond formal check if the information is there, and the main focus should be to ensure there is enough context information to reuse the data.
What information should be included and how it is usually structured (in the data file, in a separate file, in technical documentation) is data collection method- and discipline specific. It might differ for a qualitative study based on video data material, and a quantitative survey resulting in an SPSS database.
- For a survey, a questionnaire and/or a codebook and a technical report on fieldwork should be accompanying data files. For a survey data file data archive staff in the pre-ingest phase might check, if all variables named are in the file, and if all variables in the file are properly documented with variable names and response category names.
- For a relatively recent research project, it might be rather easy to get the missing information, but for studies going further back in time and involving different researcher groups it might take time and effort to get the necessary information needed to understand the data.
It is difficult to estimate reasonable timing for finishing the first review, as it varies depending on the scope and complexity of material, type of data as well as resources available.
3.4.4 Communication with researchers
It may be valuable to store all communication with researchers, as this allows others to pick up the task and is especially useful for longer intervals between communication (for example, summer holidays, other commitments). It might also be of interest for data curators working in ingest to be able to review previous communication during pre-ingest.
Regardless, if communication goes mainly through curators’ e-mail address, or an e-mail address that is accessible to all data curators working with pre-ingest/ingest, it is important at some moment to tag the communication thread with the pre-ingest ID, or copy the most important parts of communication and decisions in system/tool you use for pre-Ingest. Issue management systems like Jira can be used as well. If the data archive uses an in-house developed web-based system for data deposit, it may include a communication module as well.
In case issues with data or metadata are discovered, data producers should be asked to complement the missing information.
The time for finishing the first review may vary. In case it takes longer than 2-3 days to review the deposit, it is good to send short feedback to the data depositor indicating the status of the deposit and describing the next steps (for example, ‘we will provide you feedback by [date]’).
Sometimes it takes time for data producers to answer the questions, or it takes time to solve the more complex issues with data or documentation. It might feel frustrating for researchers. Perhaps, it might be good to see if the issues can be divided into smaller steps and actions. As a curator, it might be good to be proactive, for example, to ask for a good time for a (web) call to address the issues that can be addressed right away. Perhaps some of the issues can be fixed by data curators directly after approval from data depositors, giving the impression that it is not that much work for them without adding much work on data curators.
There might, however, be a critical upper limit of time how long data deposits with unresolved issues should stay in pre-ingest phase. In such cases, when reaching the time limit you could ask the researchers if they still would like to proceed with sharing data. If the issues with data cannot be resolved, or researchers are no longer interested in depositing data in your archive, the deposit should be removed. It should be done about a month or two after informing researchers about this decision, allowing researchers enough time to consider and address the issues.