4.4 Quality assurance of data and documentation material

Image showing data archivist that is checking quality of the data and documentation.

Depending on the workflow, ingest and preservation of data and documentation material might be connected. Best practice would be that your archive has established a data curation policy where it is clearly covered how the data can be maintained and how value is added to enhance data re-use and preservation.

According to the OAIS, the preservation planning function "supports all tasks to keep the archived data accessible and understandable in the long term, even if the original computing system becomes obsolete, e.g., development of detailed preservation/migration plans, technology watch, evaluation and risk analysis of content and recommendation of updates and migration." See Chapter 1.8 for more information.

The process of data curation can be intertwined with preservation steps, for example, if  the data curator receives data in an unusual format. Then, a preservation staff member could transform the data to the desired format and hand the data back to the data curator. It is also possible that the roles of preservation and ingest staff are not so clearly separated and hence, the workflow does not seem to be interrupted by data transformation.

Steps may differ with regard to the agreed service the repository offers: Re-use vs. preservation only vs. self-deposit/self-archive. The service that archives offer probably also differs in terms of the undertaken checks and changes that are made to the data. It is possible e.g., that self-deposited data will not be reviewed by the ingest team at all.

For traceability reasons, it is important that the data curator keeps track of agreements reached with the depositor and - in case of an absence of the responsible data curator - saves them in such a way that other data curators can also continue to work with the data. Different options can be implemented for this purpose: project management software, simple Excel files (stored in a common shared place), ticket systems, e-mail correspondence with copied recipients. For the most part, the steps taken by the various data archives to prepare and archive data for the long term are similar. The CESSDA Training Team (n.d.) has collected some information on how the various archives assure a good quality of the data that is deposited at their repository.

4.4.1 Risks for integrity (checks for compliance with General Data Protection Regulation - GDPR)

The archiving process of data is regulated by both national laws and international regulations. Intellectual property rights and personal data are the most important legal aspects that data curators need to consider. Make sure that the data (and documentation material, e.g., codebook) are anonymised sufficiently (FSD n.d. [Accessed July 30, 2022a]). Most often, this means that no respondent can be identified, and all personal information has been removed:

  • Removal of all direct identifiers and strong indirect identifiers (direct identifiers: e.g., full name, social security number; strong indirect identifiers: e.g., postal address, phone number, vehicle registration number, IP address of a computer)
  • Indirect identifiers (often standard demographic variables like age, gender, education, status in employment, economic activity and occupational status, socio-economic status, household composition, income, marital status, mother tongue, ethnic background, place of work or study and regional variables) need to be modified (e.g., categorized in broader categories) or deleted in order to anonymise the data because cross-tabulating the variables could reveal a respondent’s identity
  • Check or delete answers to open questions (in quantitative and qualitative data). String variables can contain unnecessary detailed information (e.g., “Thanks for the interview. Here is my email address for further communication.”)
  • Check data whilst always keeping population group (e.g., especially sensitive groups like children, minorities or victims of crimes) and sample size (in case it is small) in mind because the re-identification risk could be higher.
  • Are any ID variables left that stem from a data collector database or survey institute and trace back to the respondent?
  • A spellcheck of variable names, labels, value labels and string variables is recommended (e.g., by exporting all labels to Excel and conducting a spell check).

Have in mind, that it might be possible to deliver from the published data set removed indirect identifiers (such as birth date, in the form MM-YYYY) in a restricted controlled access procedure.

4.4.2 Checks for compatibility with other formats

Data:

  • Technical setting (e.g., system missing): In Stata, system missing values can be implemented as “.a” or “.b” etc. After conversion of the dataset to SPSS, these system missing values are not displayed correctly. Therefore, it is recommended to use numerical values (negative values are also possible, e.g., -9 = no answer; -8 = do not know).
  • Length of variable labels: In Stata, variable labels are truncated after 79 characters and remain incomplete. In SPSS, longer variable labels are allowed. A conversion from SPSS to Stata might leave variable labels unreadable in Stata format.
  • Scanning for unlabelled values: Can the variable be understood in case the values are not completely labelled?
  • Is any information attached to the dataset that shall not be archived?

Documentation:

  • Documentation material accompanying the data should also be stored in long-term archive format. For PDF this would be PDF/A as it is an archival format of PDF that embeds all fonts used in the document. The PDF/A standard that is applied should be at least 1a (check in settings) because then the whole text can be displayed in unicode characteristics. The recommended standard is called ISO 19005-2 (2011) and it is based on ISO 32000-1 (2008). In case there are problems with saving the document as PDF/A in 2u, the order precedence should be: 2u>2a>2b>1a(>1b).
  • Programming and syntax code should be stored in the assigned ASCII text format of the used software to guarantee long-term preservation.
  • The structure of the respective organisation affects the ordering of ingest and preservation steps and which tasks are undertaken by which staff member in the unit.

4.4.3 Plausibility

Data:

  • Includes checks on logical errors: Example: If a respondent indicates being single, the observation for the spouses’ occupation should be either missing or not applicable. Other example: Are there 13-year-old pensioners in the data set?
  • Conduct a series of tests to check for out of range or improbable values. These could be: making sure the value for age is probable, cross-tabulating age and employment, checking the gender variable, etc.
  • Are there implausible outliers in the data? Can the depositor trace the rise of this error back and correct it? Example: A body weight of 800 kg is given for one respondent.
  • Are there any duplication errors?
  • Check if filters described in documentation are the same and correct in the data set.

4.4.4 Comparison of data and documentation material

  • Compare data and documentation material with regard to variable names and labels, value labels, correct numbering and spelling. Example: Variable q1_a1 has the option “do not know” in the dataset coded as -8, but in the codebook, it says “do not know” is coded as -9.
  • Include suggested citation and information about license of the document in each document (e.g., CCBY).

4.4.5 Give feedback to depositors

Some archives do not do any changes themselves, except for adding two variables to the dataset (version, DOI). Whether the comments and suggested changes are made by the data curator herself/himself or by the depositor depends on the service level of the archive and the agreement between data curator and data depositor. It also depends on the agreement of the respective archive whether small changes such as spelling mistakes have to be approved by the depositor.

4.4.6 Adding DOI and version as variables to the dataset

Once the data curator and depositor on the archive material being in a final status, the data curator adds the variables DOI (=”digital object identifier” or any other PID=persistent identifier; with the content “doi:10.11000/ABC123”) and version (e.g., “AUSSDA archive version”, with the content “v1.0 (YYY-MM-DD)”) to the dataset. It is recommended that DOI and version are ordered as first variables of the dataset so that they are easily found after opening the dataset.

Finally, all data (and documentation materials) are edited and saved in unicode modes.

4.4.7 File naming and managing files

The files are named after an agreed upon file naming scheme. It is common that already from the file naming the user could derive various information about the file type and content, e.g., the archival number, whether the file is a dataset (or e.g., a codebook), which language is used and the version number. This allows the user to quickly understand the content of the files and also makes the file names consistent over different data deposits in the repository. The single components of the file naming scheme can be represented by two-digit abbreviations. If there are more files per category (e.g., two codebooks), you should add numbers after the abbreviation.

An example how the file naming can look like:

suffix of archival number/pid_description_language_version.fileextension

  • 12345_co01_en_v1_0.pdf

In the example above, 12345 would represent the archive number, it can also be the DOI or another PID.

The description spans abbreviations for data (da), questionnaire (qu), interviewer manual (im), field report (fr), codebook (co), research report (rr), method report (mr), code/syntax (sy), tabulation report (ta), other material (om), and variable identifiers and descriptions (vi). Of course, these abbreviations and listed files might differ from archive to archive.

The language should follow the ISO 639-1 language code list. For English this would be “en”.

The version can be preceded by the letter “v” or not. In our example, we distinguish between major and minor versions that are separated by underscores. Minor versions changes would reflect minor changes to the file.

4.4.8 Conversion to other formats and provenance

The conversion of all archive material can be undertaken by staff from the preservation entity, but it can also be done by ingest staff. It must be clear when and which changes and conversions have been done to the data and documents. For the purpose of documentation and traceability reasons it is recommended to record all the steps done to the data and documentation files in a sheet or some tool. Therefore, it is also recommended to note how conversion has been done or which processing files have been used.

4.4.9 Dissemination

Repositories  deploy diverse channels to provide access to data. In most archives, users can download the data free of charge, but a registration is often required. Sometimes open access datasets are available without registration. Online data repositories offer their users ways to share, preserve and access different kinds of data using different software systems.

Software systems used for data publishing (and online analysis) are:

  • Dataverse
  • Nesstar
  • Customized solutions (like at FSD (n.d. [Accessed July 30, 2022b]) or SND (n.d. [Accessed July 30, 2022a]))