|
|
Introduction Scientific digital libraries and information services are now commonly used for scientific research. A few short years ago, a scientist could bookmark a few favorite URLs and have access to most of the data relevant to her work that was available in electronic form. Today, it is increasingly harder to answer the question, what data and information exists about...? The difficulty is not just in the sheer volume of data now available, but in the variety of data and in the number of repositories where they might be found. Thus, today more attention is being paid to issues of interoperability and federating resources. One can imagine varying degrees of database interoperability and/or federation that bring up a variety of technological issues. Certain federating projects, driven by specific applications (e.g. the Digital Sky, EOSDIS), are working towards a high-level of interoperability which brings up such issues as data formats and high-throughput data transfer. This position paper focuses specifically on data discovery and light-weight exchange. I consider this the first-order of interoperability. It is most applicable to an aggregate of data repositories without strong motivation (or support) to interoperate with each other. That is, many repositories do not have the resources to put into supporting interoperability and instead must mainly worry about serving their own narrow community of users. However, these data providers might be willing to support such first-order interoperability in order to enable cross-database searching, scalable cross-referencing, and enhanced browsing tools provided that some basic standards (accompanied with software) existed which could be deployed easily, piecemeal, and at low cost. This paper summarizes what I believe are the basic components required to achieve this first-order interoperability. Clearly, the common thread through all the components is the issue of standard metadata. I suggest a hierarchical, community-based approach to defining standard metadata that attempts to allow for wider interoperability across broader communities. Multi-data-center collaborations may be a good model for developing and deploying such standards; collaborations that span the Atlantic could improve the likelihood that the standards are useful and widely deployed. An example of such a collaboration is Project ISAIA (being funded via a NASA AISRP grant): it brings together about a dozen US and European data centers to develop such standards for the broad community of Space Science. It's worth noting that realizing this first-order interoperability does not require much in the way of new technology. Nevertheless, it represents a necessary infrastructure for tackling advanced research topics, such as data mining, knowledge clustering, scalable and automated semantics, and techniques for data synthesis and network-based analysis. More importantly, interoperability standards are not widely deployed, despite the existing technology needed to do so; given limited resources available to most data providers, standards that can be deployed across discipline boundaries will not be developed without specific funding to do so. Motivation: Provide a Little and Get a Lot Any data provider will look at an interoperability standard and ask, what do I get for supporting the standard? Often what the provider desires is two fold: first, it allows him to offer expanded services to the data center's usual users, and second, it brings new users to the data center. However, to support the standard, the ratio of expected benefit to cost must be fairly high (some talk of zero-buy-in).A first-order set of interoperability standards for data discovery and exchange could enable the following capabilities:
cross-discipline research (??) | Necessary Components This section describes the components of a framework for interoperable data discovery and information exchange. A common thread through each component is the notion of standard metadata. While it would be highly desirable to use the same basic metadata standard throughout the system, this may not happen in practice if the components are developed independently.
It's interesting to note that using XML makes it less important (in principle) that there be a single standard for a structured record syntax: XSL provides a simple mechanism for translating from one DTD to another.
It should be noted that the definition of a search profile should be independent of the syntax used to express the query and the protocol used to deliver it.
Protocols:
Syntax:
The directory service can be used by gateway systems to determine where and how cross-database queries should be sent. The Cost to Data Providers This architecture allows a variety of different search services to be built using these standards, such as gateway search engines and intelligent agents. A data repository could participate in such services by:
Experience shows that for repositories without a strong mandate to support interoperable services, the cost and effort must be extremely low. This means that easy-to-use software tools that implement the standards and require a minimum of configuration must be readily available. A good example is Astrobrowse which provides a Web form for registering with its directory: in a few minutes, an astronomy data provider can describe its collection and how to query it via a URL. The NCSA Emerge system allows a provider to easily connect a Z39.50 interface to a database via a XML-based configuration file. Note that limited forms of interoperability can be enabled by doing any one of these steps. This could be important, in practice, as it allows a provider to participate piecemeal, enabling additional support as time and money allow and as they experience more of the benefits. Metadata for Interoperability The defining of metadata for interoperability between diverse resources brings up two issues:
One approach that can address these issues is to break the metadata standard into small pieces. Dividing the standards by discipline or community is perhaps the most natural way to divide the standard; however, certain classes of metadata (e.g. bibliographic) may span across many communities. Project ISAIA is examining this idea by recognizing that its broad community, space science, can be considered a hierarchy of sub-communities (e.g. astronomy, planetary science, & space physics); it is natural then that the division of metadata standards should be hierarchical in order to represent varying levels of detail. For example, a general profile could be used for searching across all of space science, while sub-profiles for the component communities would be defined for searches within a narrower disciplinary range. This hierarchical, community-based approach has several advantages. First, each sub-schema can be kept small (less than 10 recommended). Responsibility for maintaining and evolving the schema could be left to the community it serves. Individual data providers can then choose which schemas it will use, depending on the level of interoperability they can afford to support. For the architecture described above, the metadata used in search profiles is most relevant to data providers. Which profiles they support can be registered with the directory service; this information can be used by gateways and agents for intelligently routing search queries. Furthermore, there is nothing barring a repository from supporting a schema that might be considered outside its discipline if it can at least in part be applied to its resources; doing so would further aid cross-discipline research. Finally, we can imagine smaller collaborations of repositories defining more detailed sub-schemas to serve special needs of the collaboration; if it conforms to the general framework, then it could be used by clients outside the collaboration. We can imagine extending this concept to broader communities, such as to all of science or to the entire community of digital information users. A very good example of a top level metadata schema is represented by the Dublin Core. One of the nice features of the definition is that it is syntax- and protocol-independent. It provides a good starting point that communities can build onto "from underneath." The W3C's Resource Description Framework (RDF) is expected to provide an approach to metadata definition that encourages interoperability across diverse applications. Use of XML will also encourage broader interoperability: the use of namespaces will allow one to mix schema together in a single query or response, and XSL allows for easy translation between schemas. Conclusion: Call for Research and Development Many very nice schemes and software tools for interoperability have come and gone without being widely adopted. The reasons are varied, but I want to highlight a few:
Multi-data-center collaborations for developing interoperability standards can address these stumbling blocks. First, the data centers have a vested interest in the standards and technologies they develop. Furthermore, deployment of the standards within the collaboration could provide the critical mass necessary for wider deployment. Since scientific research is inherently an international endeavor, international collaborations will further encourage wide adoption of standards and discourage the "not-developed-here" syndrome. Much of the architecture described in this paper requires more development than research (and much of the development is sociological). However, first-order interoperability provides a foundation not only for advanced research in information technology but also for a second-order interoperability: high-performance, distributed computing. |