us-eu.small.jpg (31333 bytes)

 



Scientific Data Collections
Reagan W. Moore, San Diego Supercomputer Center

A rapid convergence of information management technology is occurring for the support of scientific data collections.  Information management technology is being assembled across multiple communities that are developing archival storage systems, digital library services, parallel compute platforms, distributed computing environments, and persistent archives.  The combination of these systems is resulting in the ability to describe, manage, access, and build very large scientific data collections.  Several key factors are driving the technology convergence:

·        Development of an appropriate information model for describing scientific data.  The eXtensible Markup Language (XML) provides a common information model for describing data structure and data set context. XML provides a representation for semi-structured information, and is thus highly appropriate for supporting scientific data.  XML provides a way to define infrastructure independent representations for information.

·        Differentiation between the information context to associate with individual data sets, the collection, and the user interface to the collection. XML Document Type Definitions (DTDs) define the tags that can be used to represent structure and semantic context.  It is possible to use a DTD to define the structure of an individual data set, or of a collection.  Note that multiple DTDs can be applied to a given data set, making it possible to map a data set into multiple collections.  The differentiation between the properties of a collection and the properties of a data set makes it possible to dynamically organize collections.  Use of XSL style sheets makes it possible to create interfaces to collections that can be tuned to the requirements of separate user communities.

·        Development of interoperability systems that support the federation of data collections across heterogeneous software and hardware systems.  Interoperability systems decouple the implementation of the data collection from the data access mechanisms.  It then becomes possible to store data sets in storage resources as digital objects with an associated DTD, while the information catalog used to identify the data sets resides on an independent platform.  Collections can then be organized as logical groupings of data sets that are distributed across arbitrary physical storage resources.  The advances in information description models make it possible to separate the storage of data from the software systems used both to organize collections and to support information discovery. 

The ability to manipulate information has multiple consequences:

·        Federation of data collections.  It is possible to publish the schema used to organize a collection as an XML DTD.  Information discovery can then be done through queries based upon the semi-structured representation of the collection attributes provided by the XML DTD.  Distributed queries across multiple collections can be accomplished by mapping between the multiple DTDs. 

·        Persistent archives.  The context to associate with a data set can be specified by defining a collection that includes the attributes needed to understand the data set structure, physical interpretation, and associated origination information.  Since the organization of the attributes can be defined through an XML DTD, it is possible to archive the information needed to assemble a collection independently of the data sets that comprise the collection.  This makes it possible to migrate a collection forward in time onto new technology.   The collection description is instantiated on the new technology, while the data sets remain on the physical storage resource.  Or conversely, the data sets are moved to a new physical storage system, while the collection description remains on the original system.

·        Dynamic creation of data objects.  Many software systems rely upon object oriented programming to define the procedures that can be applied to a particular class of objects.  It is possible to represent the requirements needed for a data set to qualify as a member of an object class through an XML DTD.  Note that a data set has its own associated DTD to describe its structure and context.  To turn a data set into an object belonging to a particular class then requires the ability to map between the DTD of the data set and the DTD of the object class.  The result is the ability to manage the application of procedures to arbitrary data sets through a common information model that describes both data sets and procedures.

The access and management of scientific data collections can then be reduced to the ability to manipulate the associated information model.  This is an active area of research. Examples include:

·        XML Matching And Structuring language.  The XMAS system under development at UCSD provides the ability to issue queries by specification of attributes relative to the XML DTD associated with the collection.  The research issue is the types of operations on DTDs that are appropriate for scientific data sets.

·        Mediation of Information using XML.  The MIX system, also under development at UCSD, provides mediators that transform the information that can be returned from an information resource into an XML DTD.  Research issues include the ability to wrap arbitrarily complex information resources such as Web pages and Geographic Information Systems.

·        Object constructors.  The manipulation of a data set requires the ability to transform the data set into the structure specified by an object class DTD.  Generalizations of scientific data set DTDs are needed to minimize the number of constructors needed to manipulate scientific data.  This effort is currently driven separately within each scientific discipline.  A cross-discipline effort is needed to develop a generic infrastructure.

The implementation of information management technology needs to build upon the information models and manipulation abilities that are coming from the Digital Library community, and the remote data access and procedure execution support that is coming from the distributed computing community.  The Data Access Working Group of the Grid Forum is promoting the development of standard implementation practices for the construction of data grids.  Data grids are inherently distributed systems that tie together data, compute, and visualization resources.  Researchers rely on the data grid to support all aspects of information management and data manipulation.  An end-to-end system provides support for:

·        Information discovery – ability to query across multiple information repositories to identify data sets of interest

·        Data handling – ability to read data from a remote site for use within an application

·        Remote processing – ability to filter or subset a data set before transmission over the network

·        Publication – ability to add data sets to collections for use by other researchers

·        Analysis – ability to use data in scientific simulations, or for data mining, or for creation of new data collections

These services are implemented as middleware that hide the complexity of the diverse distributed heterogeneous resources that comprise data and compute grids.  The services provide four key functionalities or transparencies that simplify the complexity of accessing distributed heterogeneous systems.

·        Name transparency – Unique names for data sets are needed to guarantee a specific data set can be found and retrieved.  However, it is not possible to know the unique name of every data set that can be accessed within a data grid (possibly billions of objects). Attribute based access is used so that any data set can be identified either by Unix system compatible attributes, or Dublin core provenance attributes, or user specified attributes.  Information discovery systems support queries against attributes maintained in information discovery catalogs.   Information catalogs are as diverse as LDAP directories, object-relational databases, and even Unix flat files.

·        Location transparency – Given the identification of a desired data set, a data handling system manages interactions with the possibly remote data set.  The actual location of the data set can be maintained as part of the UNIX system level attributes.  This makes it possible to automate remote data access.  Some data handling systems use URLs to define the data set location.  Other data handling systems use a combination of IP address and naming conventions appropriate to the storage resource.  When data sets are replicated across multiple sites, attribute-based access is essential to allow the data handling system to retrieve the “closest” copy.

·        Protocol transparency – Data grids provide access to heterogeneous data resources, including file systems, databases, and archives.  The user of the data grid is provided a common interface through the Data Model management system.  The data handling system can use attributes stored in the information discovery catalog to determine the particular access protocol required to retrieve the desired data set.  For heterogeneous systems, servers can be installed on each storage resource to automate the protocol conversion.  Then an application can access objects stored in a database or in an archive through the same user interface.

·        Time transparency – Five mechanisms are typically used to optimize retrieval time: data caching, data replication, data aggregation, parallel I/O, and remote data filtering.  Data caching can be automated by having the data handling system pull data from the remote archive to a local data cache.  Data replication across multiple storage resources can be used to minimize wide area network traffic.  Data aggregation through the use of containers can be used to minimize the number of times data must be remotely accessed.  Use of parallel I/O can minimize the time needed to transfer a large data set.  Remote data filtering minimizes the amount of data than must be moved.  This latter capability requires the ability to support remote procedure execution at the storage resource.

Data grids are implemented through the integration of data resources (archives, databases, file systems) by data handling systems.  A data access architecture is shown in Figure 1.  The major components are the data model management software for supporting access to a data set that is retrieved via a data handling system from a remote storage system after application of remote filtering procedures.  The identification of the data set is done through an information discovery system, and the data set is processed before transmission to the application through execution of remote procedures.

An extension to the above “transparency” components is the addition of uniform interfaces for accessing the data handling system and the storage resources.  By providing an API “glue” to the underlying data handling systems (authentication of users, authorization for access, I/O and cache scheduling, quality of service guarantees, data format conversion, location control, and information discovery), it is possible to concatenate data handling systems.  Through the common API, any of the data handling systems will be able to access data stored within another data grid.  Similarly, by providing a common API “glue” for accessing storage resources, any data handling system will be able to use any of the multiple storage resources that are now available.  APIs are also needed for a standard interface to the information discovery environment, and to the remote procedure execution environment.

  
Figure 1.  Data Access Architecture for Open Grid Architecture

For a given discipline, multiple data collections can be federated into a data grid, promoting the interchange of information between researchers with common interests.  Data grids are inherently hierarchical.  As shown in Figure 2., individual data collections can be federated to form an overarching digital library.  Multiple digital libraries in turn can be federated through finding aids for use by the library community, such as the University of California, California Digital Library. 

In this example from the National Partnership for Advanced Computational Infrastructure, four separate data collections are federated to form a Neuroscience Digital Library.  Each of the collections is supported by a separate site, where the expertise resides for the formation of the particular collection.  The sites are linked by wide-area networks.  The Neuroscience (NS), Earth Systems Science (ESS), and Molecular Structures (MS) digital libraries are in turn federated for access from the California Digital Library.  Each of the discipline specific digital libraries can be thought of as a data grid, with services designed to support the analysis requirements of the corresponding community.  The data grids are in turn federated through the use of finding aids, that provide an information discovery system that is able to search across all collections simultaneously.

This is an excellent opportunity for integration of technology between the digital library community and the data grid community. An emerging protocol to support information discovery is the Stanford Simple Digital Library Interoperability Protocol (SDLIP).  This provides a common interface for information discovery across heterogeneous information catalogs.

   

 

Figure 2.  Hierarchical Data Grids