us-eu.small.jpg (31333 bytes)

 


Digital Information Preservation Perspectives
Don Sawyer/National Space Science Data Center/NASA/GSFC
Lou Reich/Computer Sciences Corporation
Ben Kobler/Earth Science Data and Information Systems/NASA/GSFC
John Garrett/Raytheon ITSS

The problems associated with the preservation of digital information cut across many of the issues raised by the EU/US Workshop organizing committee, but they have not been called out explicitly. The purpose of this short paper is to raise the visibility of preservation perspectives as these have near term as well as long term impacts on large collections of scientific data.

The authors of this short paper, along with many others, have recently been involved in developing a draft standard entitled "Reference Model for an Open Archival Information System (OAIS)". This draft is currently under formal review by the Consultative Committee for Space Data Systems and ISO TC20/SC13. It is available from http://ssdoo.gsfc.nasa.gov/nost/isoas/ref_model.html where there are also instructions on how to provide comments in the review process.

For a number of reasons, including the rapid pace of technology evolution, the problems of digital information preservation can become apparent in the space of only a few years. However there has been no generally accepted terminology and framework in which to discuss these preservation issues, to contrast and compare archives, to identity areas where additional standards may be needed, and to encourage vendor support for preservation requirements. These factors led to the work on the OAIS reference model.

The reference model addresses both functional and information model concepts at a high level and in detail. It highlights the need for Preservation Description Information to be associated with the digital objects being preserved so that

the chain of custody and processing history are available,
relationships to other digital objects are recognized,
the digital objects can be unambiguously identified, and
their information content is not altered in an undocumented manner.

It highlights the need for an adequate network of representation information to be associated with each digital object so that current and future users will be able to fully understand the intended meanings. Closely associated with this is the need to explicitly identify the Designated Community that is expected to understand this information so that mechanisms can be established to verify this understanding.

The reference model addresses digital migration issues and the evolution of access services to the digital objects. It also address interactions among cooperating archives and identifies levels of federation. It establishes a minimum set of responsibilities for an operation to be called an OAIS archive, and it requires the consistent use of terms and concepts as part of the standard.

While the OAIS reference model sets a framework, and makes various suggestions to improve the likelihood of information preservation, it does not specify an implementation. Real archives, including all large scientific data bases, must make implementation choices. The remainder of this paper addresses some issues associated with digital migration, representation information, and certification of archives.

Representation Information

The OAIS reference model defines Representation Information as information that maps Data into more meaningful concepts. An example is the ASCII definition which describes how bits (i.e., Data) are mapped into symbols. Another example is a description of the numbers (i.e., Data) in a table as being the coordinates of a location on the Earth measured in East longitude and latitude. The prescription for associating Representation Information with a Digital Object begins with a clear definition of the boundaries of the digital object, and then continues with the need to identify the boundaries of the associated Representation Information. As a practical matter, the boundaries of a digital object might be identified as the content of a digital file, not including the implementation of the file system itself nor including the file name or other attributes that might be supported by the file system. The Representation Information for understanding this file content may itself be composed of one or more digital objects, and these in turn need Representation Information. The situation can become more complex when the digital object is part of a database and its implementation is hidden, or when the focus is one or more Web pages with embedded links.

Often the lack of adequate, associated, Representation Information is hidden by the use of currently available access software. However the OAIS Reference Model takes the position that software is not an adequate substitute for Representation Information because it can not be relied upon to continue working and because it further obscures the underlying information on which it depends. At the heart of these issues is the lack of generally accepted, and supported, notions of what constitutes 'digital information' as opposed to 'digital bits'. Each scientific database or archive currently needs to define such things for itself.

Associated with the complexity of Representation Information for scientific data is the issue of embedded pointers. It is convenient for data producers to embed local file names and even directory paths into their data and Representation Information. When such data are moved to new locations, such to archives, or even to new locations within an archive, these links will often break. There is the potential to use universal identifiers, supported by updateable mapping systems, in place of these embedded location identifiers. The Universal Resource Name (URN) is one such approach but it is not yet widely used. In the domain of the Space agencies, there is an infrastructure called Control Authorities which assigns globally unique identifiers (Control Authority and Description Identifier) to descriptions of data and from which these descriptions are retrievable upon presentation of the unique identifier. Such techniques are not yet known to be in wide use but appear promising.

Digital Migration

The OAIS reference model defines digital migration as the transfer of digital information, while intending to preserve it, within the OAIS. It is distinguished from transfers in general by three attributes:

a focus on the preservation of the full information content;
a perspective that the new archival implementation of the information is a replacement for the old; and
an understanding that full control and responsibility over all aspects of the transfer resides with the OAIS.

Digital migrations are identified in the reference model by four main categories:

Refreshment - the replacement of a media instance with one of the same type
Replication - copying the full information content and the packaging information used to delimit this information content, to a new media instance of the same or different type
Repackaging - a copy where there is no change to the full information content, but some change to packaging information takes place
Transformation - a copy where there is some change to the full information content

A significant implementation issue is the need for a clear designation of what constitutes the boundaries of a given information object, which is directly related to the digital object and Representation Information boundaries discussed above. Without this distinction the migration may be an unrecognized Transformation migration with the result that information is lost. The lack of generally accepted approaches to this issue makes many migrations of scientific data more error prone. Also, the embedded pointer issue described above can lead to information loss or at minimum a significant effort to correct.

Some migrations will need to change the format of the scientific information objects. These Transformation migrations may be reversible using an algorithm. However if they are not, it may be difficult to determine if information has been lost (or noise has been added).

Certification

The OAIS reference model does not address the issue of certification of archives directly. Nevertheless there is considerable interest, particularly when one archive needs to rely on another for a range of services, in being able to 'trust' the other archive. One way to approach this in general is to identify approaches by which an archive may establish some level of certification, whether by self assessment or by an auditor. A starting point might be the minimum responsibilities identified for an OAIS in the reference model. The work of ISO TC 171/SC 3 on document TR 15801 includes 'methods by which an an Archive's customers may gain confidence in the authenticity, quality, and usefulness of digitally archived materials'. It may be useful in furthering certification approaches.