us-eu.small.jpg (31333 bytes)

 


Interoperability Standards for Data Discovery and Information Exchange
Raymond Plante, NCSA

Introduction

Scientific digital libraries and information services are now commonly used for scientific research. A few short years ago, a scientist could bookmark a few favorite URLs and have access to most of the data relevant to her work that was available in electronic form. Today, it is increasingly harder to answer the question, what data and information exists about...? The difficulty is not just in the sheer volume of data now available, but in the variety of data and in the number of repositories where they might be found. Thus, today more attention is being paid to issues of interoperability and federating resources.

One can imagine varying degrees of database interoperability and/or federation that bring up a variety of technological issues. Certain federating projects, driven by specific applications (e.g. the Digital Sky, EOSDIS), are working towards a high-level of interoperability which brings up such issues as data formats and high-throughput data transfer. This position paper focuses specifically on data discovery and light-weight exchange. I consider this the first-order of interoperability. It is most applicable to an aggregate of data repositories without strong motivation (or support) to interoperate with each other. That is, many repositories do not have the resources to put into supporting interoperability and instead must mainly worry about serving their own narrow community of users. However, these data providers might be willing to support such first-order interoperability in order to enable cross-database searching, scalable cross-referencing, and enhanced browsing tools provided that some basic standards (accompanied with software) existed which could be deployed easily, piecemeal, and at low cost.

This paper summarizes what I believe are the basic components required to achieve this first-order interoperability. Clearly, the common thread through all the components is the issue of standard metadata. I suggest a hierarchical, community-based approach to defining standard metadata that attempts to allow for wider interoperability across broader communities. Multi-data-center collaborations may be a good model for developing and deploying such standards; collaborations that span the Atlantic could improve the likelihood that the standards are useful and widely deployed. An example of such a collaboration is Project ISAIA (being funded via a NASA AISRP grant): it brings together about a dozen US and European data centers to develop such standards for the broad community of Space Science.

It's worth noting that realizing this first-order interoperability does not require much in the way of new technology. Nevertheless, it represents a necessary infrastructure for tackling advanced research topics, such as data mining, knowledge clustering, scalable and automated semantics, and techniques for data synthesis and network-based analysis. More importantly, interoperability standards are not widely deployed, despite the existing technology needed to do so; given limited resources available to most data providers, standards that can be deployed across discipline boundaries will not be developed without specific funding to do so.

Motivation: Provide a Little and Get a Lot

Any data provider will look at an interoperability standard and ask, what do I get for supporting the standard? Often what the provider desires is two fold: first, it allows him to offer expanded services to the data center's usual users, and second, it brings new users to the data center. However, to support the standard, the ratio of expected benefit to cost must be fairly high (some talk of zero-buy-in).

A first-order set of interoperability standards for data discovery and exchange could enable the following capabilities:
a scalable policy for data exchange would allow the creation of data synthesis services...
cross-linking related data objects between data centers
sophisticated tools for browsing/visualizing/analyzing data from diverse sources (NASA ADC Viewer, CatsEye).
searching across multiple databases, via
central gateways (Astrobrowse, NCSA Emerge)
intelligent agents

cross-discipline research (??)
Cross-discipline research, in which a scientist takes a broad step out of her field of study to incorporate new sources of data and information, brings up important philosophical questions: how practical is it to enable, and how useful will it really be? Often this concept appeals more to the technologist than the down-in-the-trenches scientist.

Necessary Components

This section describes the components of a framework for interoperable data discovery and information exchange. A common thread through each component is the notion of standard metadata. While it would be highly desirable to use the same basic metadata standard throughout the system, this may not happen in practice if the components are developed independently.

structured record format:
a machine--readable format for exchanging data and metadata.

The eXtensible Markup Language (XML) provides the most promising mechanisms for defining such a format. Examples DTDs:
Extensible Scientific Interchange Language (XSIL) (from CACR) -- flexible, general purpose, discipline-independent.
Astronomical Markup Language (AML) (D. Guillaume) -- describes a variety of astronomical data objects (images, catalogs, articles, people).
Astronomical catalog and query results -- a draft DTD resulting from a workshop held in Strasbourg, "From Information to Knowledge Using Astronomical Databases".

It's interesting to note that using XML makes it less important (in principle) that there be a single standard for a structured record syntax: XSL provides a simple mechanism for translating from one DTD to another.

 

search profiles:
the concepts that can be used in a search query that is to be distributed to multiple databases. This includes:
searchable metadata
operators for comparing metadata against test values
hints for interpreting the metadata

It should be noted that the definition of a search profile should be independent of the syntax used to express the query and the protocol used to deliver it.

 

search protocols and syntax:
the implementation of a search profile via a means for expressing and delivering a query.

Protocols:
HTTP: very flexible, ubiquitous
Z39.50: inflexible, a DL standard

Syntax:
Astronomical Server URL (ASU) -- astronomy specific; combines syntax, protocol, and profile
Astrobrowse: a gateway system with configurable URL syntax, for astronomy
NCSA Emerge: a gateway system with configurable syntax, supporting multiple protocols, discipline-independent. Also uses XML based query language

 

repository directory service:
a directory providing broad descriptions of data repositories and the data they serve. The directory must support non-interactive queries by automated systems. Examples:
LDAP -- widely used in commercial sector, requires metadata framework (e.g. Globus)
GLU (from CDS) used as a directory service by Astrobrowse, AstroGLU.

The directory service can be used by gateway systems to determine where and how cross-database queries should be sent.

The Cost to Data Providers

This architecture allows a variety of different search services to be built using these standards, such as gateway search engines and intelligent agents. A data repository could participate in such services by:

  1. supporting one or more search profiles using a standard protocol and syntax. This usually involves mapping generalized concepts (e.g. title, position, coverage, etc.) to the specific metadata entities used by the database.
  2. supporting a structured record format for emitting database result records
  3. submitting a description of the repository to a directory service.

Experience shows that for repositories without a strong mandate to support interoperable services, the cost and effort must be extremely low. This means that easy-to-use software tools that implement the standards and require a minimum of configuration must be readily available. A good example is Astrobrowse which provides a Web form for registering with its directory: in a few minutes, an astronomy data provider can describe its collection and how to query it via a URL. The NCSA Emerge system allows a provider to easily connect a Z39.50 interface to a database via a XML-based configuration file.

Note that limited forms of interoperability can be enabled by doing any one of these steps. This could be important, in practice, as it allows a provider to participate piecemeal, enabling additional support as time and money allow and as they experience more of the benefits.

Metadata for Interoperability

The defining of metadata for interoperability between diverse resources brings up two issues:

 

 

 

The role of metadata is to bring diverse resources together by capitalizing on their commonalities; by definition, details are glossed over. Some of the ramifications are:
the broader the community the metadata is meant to serve, the less detailed it can be.
how one repository applies a schema to its resource may be somewhat different from another repository.
Many details of a repository may not be visible at an interoperable level.
Thus, it is important that a first-order framework allow the user to "drill-down" to the repository-specific information and services.
Bringing together a broad community can lead to large metadata schemas, and large schemas are difficult to maintain and intimidating to implement. Examples with Z39.50 search profiles:
BIB-1: > 100 concepts
GEO-1: > 300 concepts

One approach that can address these issues is to break the metadata standard into small pieces. Dividing the standards by discipline or community is perhaps the most natural way to divide the standard; however, certain classes of metadata (e.g. bibliographic) may span across many communities. Project ISAIA is examining this idea by recognizing that its broad community, space science, can be considered a hierarchy of sub-communities (e.g. astronomy, planetary science, & space physics); it is natural then that the division of metadata standards should be hierarchical in order to represent varying levels of detail. For example, a general profile could be used for searching across all of space science, while sub-profiles for the component communities would be defined for searches within a narrower disciplinary range.

This hierarchical, community-based approach has several advantages. First, each sub-schema can be kept small (less than 10 recommended). Responsibility for maintaining and evolving the schema could be left to the community it serves. Individual data providers can then choose which schemas it will use, depending on the level of interoperability they can afford to support. For the architecture described above, the metadata used in search profiles is most relevant to data providers. Which profiles they support can be registered with the directory service; this information can be used by gateways and agents for intelligently routing search queries. Furthermore, there is nothing barring a repository from supporting a schema that might be considered outside its discipline if it can at least in part be applied to its resources; doing so would further aid cross-discipline research. Finally, we can imagine smaller collaborations of repositories defining more detailed sub-schemas to serve special needs of the collaboration; if it conforms to the general framework, then it could be used by clients outside the collaboration.

We can imagine extending this concept to broader communities, such as to all of science or to the entire community of digital information users. A very good example of a top level metadata schema is represented by the Dublin Core. One of the nice features of the definition is that it is syntax- and protocol-independent. It provides a good starting point that communities can build onto "from underneath." The W3C's Resource Description Framework (RDF) is expected to provide an approach to metadata definition that encourages interoperability across diverse applications. Use of XML will also encourage broader interoperability: the use of namespaces will allow one to mix schema together in a single query or response, and XSL allows for easy translation between schemas.

Conclusion: Call for Research and Development

Many very nice schemes and software tools for interoperability have come and gone without being widely adopted. The reasons are varied, but I want to highlight a few:

 
Some technologies have been developed without the sufficient involvement of those who are expected to invest in the technology (e.g. data providers, users).
Data providers do not have the time, money, and person-power to experiment with technologies not directly related to their primary mission.
The "Not-developed-here" syndrome discourages some providers from investing in technologies developed outside their sphere of influence. This syndrome is fueled in large part by the competition for funding.

Multi-data-center collaborations for developing interoperability standards can address these stumbling blocks. First, the data centers have a vested interest in the standards and technologies they develop. Furthermore, deployment of the standards within the collaboration could provide the critical mass necessary for wider deployment. Since scientific research is inherently an international endeavor, international collaborations will further encourage wide adoption of standards and discourage the "not-developed-here" syndrome.

Much of the architecture described in this paper requires more development than research (and much of the development is sociological). However, first-order interoperability provides a foundation not only for advanced research in information technology but also for a second-order interoperability: high-performance, distributed computing.