1.8 What is the process of archiving from beginning to end?
The Open Archival Information System (OAIS) can help archives define their working process
CESSDA archives are different in size and capacity but they have a common goal making research data more accessible and findable in the short run as well as in the long run. To fulfill this role, many data archives follow the Reference Model for an Open Archival Information System (short OAIS) (The CCSDS 2012).
Open Archival Information System - OAIS
The OAIS model is composed of three complex information objects:
- Submission Information Package (SIP)
- Archival Information Package (AIP)
- Dissemination Information Package (DIP)
The Submission Information Package (SIP) is the package that is sent to an archive by a data producer. Its form and detailed content are typically negotiated between the producer and the archive. Most SIPs will have some content information and some so-called Preservation Description Information (PDI). In practice, SIPs are related to the phase of data ingest, when the archives receive all the data and affiliated documentation from the data producer (who is called a depositor in the context of the ingest of the data). See Chapter 4 for more information on the ingest phase.
Within the archive one or more SIPs are transformed into one or more Archival Information Packages (AIPs) for preservation. The AIP has a complete set of PDI for the associated content information. In this phase the data and documentation are transformed into archival, long-term preservation formats and archival storage. In practice AIPs are related to the phase of treating the data and documentation for long-term preservation. This involves, for example, changing formats into new, long-term preservation formats, making various versions of documents, anonymisation.
The archive provides all or a part of an AIP to a consumer (i.e. user) in the form of a Dissemination Information Package (DIP). The DIP may also include collections of AIPs, and it may or may not have complete PDI. DIPs in the OAIS are related to the final step in preparing the data and documentation for dissemination and making it available for users.
As mentioned in the section What is data acquisition?, data archives also include a pre-SIP or pre-ingest phase in their workflow, but these are not recognised as official parts of the OAIS. This phase happens prior to the point of ingestion and covers the first communication with possible data producers including an evaluation of the suitability of the study (The CCSDS 2012). See Chapter 3 for more information on the pre-ingest phase.
The OAIS covers six crucial functions that data archivists deal with. Functions follow the three main information objects mentioned above (SIP, AIP, DIP) and give us a better insight on how the work in data archives proceeds:
- Ingest function: receives data and metadata from a producer and packages it for storage in the archive. A data archivist checks the data, its format, documentation, metadata and its licence and creates the additional documentation needed for the archiving purposes.
- Archival storage function: stores, maintains and retrieves data documentation and the data, assigns them to long-term storage and provides them to the Access function.
- Data management function: maintains and updates the database.
- Administration function: manages the daily operations of the archive. It processes submissions and agreements with data producers, manages the software systems, develops policies and standards or compliance to such, and handles customer service.
- Preservation planning function: supports all tasks to keep the archived data accessible and understandable in the long term, even if the original computing system becomes obsolete, e.g. development of detailed preservation/migration plans, technology watch, evaluation and risk analysis of content and recommendation of updates and migration.
- Access function: This function includes the user interface that allows users to retrieve information from the archive. It provides data and data documentation for users.
Persistent identifiers (PIDs) in the archiving process
A general definition of persistent identifiers is given in the section What does an archive look like and what does it do? Here we will briefly explain their role in the archiving process.
PIDs play a crucial role in uniquely identifying the datasets available in a data archive. In the dissemination process (DIP - access phase), PIDs are important for data citation, providing a way for datasets to be referred to consistently and persistently allowing attribution to the data creators. There are various services (e.g. DaRa (n.d.), DataCite (n.d.)) through which data archives can register PIDs for their datasets in the curation process, see Citing your data (CESSDA Training Team 2017-2022).
Find out more about your archive
Here are some questions you can ask yourself to learn more about your own archive:
- Does your archive implement a pre-ingest phase?
- Which OAIS function does your archive find the most challenging and why?
- How are different functions implemented in your archive?
- Which PID(s) does your archive use?
Expert Tip: Watch this video 'What Are Persistent Identifiers and Why to Use Them?' by FAIRsFAIR EU (2022).
Expert Tip: This video by Research Data Netherlands (2014)explains the working of PIDs and in particular DOIs in more detail.