|
|
Scientific repositories can be viewed as the interplay of three main
components: data collection and management, computing infrastructures (both hardware and
software), and analysis and visualization tools. Thus, the scientific community
immediately benefits from basic research in each of these areas, which we will not address
in this paper. In fact, there are already operational scientific applications combining
remote sensing, databases, computer simulations, and advanced visualization to support a
given task, e.g. real-time decision support in case of air pollution, oil spills and
forest fires. However, even though more advanced than their predecessors, the majority of scientific applications are still being developed in isolation with each other. Thus they are essentially legacy systems running in stand-alone mode. With the development of distributed systems and the growth of the Internet it is now possible to go well beyond this way of building systems. Specifically, we believe that one of the biggest challenges for the years to come will be to open-up scientific repositories and connect them to larger, federated infrastructures making their resources globally available. The THETIS [1] and Poseidon [2] projects, aiming in providing a single point of access through the WWW to a collection of networked repositories containing data, programs and tools for coastal zone management applications, are steps in this direction. We view federated Scientific Information Systems as open systems that manage different types of objects, e.g. raw scientific data, multimedia documentary information, and programs. The also manage the metadata that is needed to locate and use these objects. Hence, they are strongly related to digital libraries, federated databases, Recommender systems, and Problem Solving Environments. Features of open, federated architectures Based on our recent experience, we believe that open architectures for federated scientific information systems should promote: 1. Resource sharing with minimal administration cost -- organizations can make their resources (data and programs) available to the federation without going through complex installation procedures, independently of the local hardware and operating system. 2. Operational autonomy and security -- participation in the federation does not affect (in an uncontrollable way) the ability of the organization to provide services to its local clients; other members of the federation cannot access resources that have not been explicitly exported. 3. Efficient
resource discovery and access -- users can locate resources (data and programs) of
interest quickly and accurately and they can access them in a straightforward way, via a
network-transparent interface, almost as if they were available locally. The importance of making the system open and flexible while keeping its administration cost low must be stressed as it significantly affects its acceptance. This is especially true for research environments where software development and installation is primarily done by scientists, not by a dedicated task force as is the case in a business environment. More on resource discovery and access Although the Internet opens the way towards a global information system, merely putting ones data and programs on a server is not sufficient if they are to be made (truly!) available to a wider user community. In order for resources to be used by others they must be (a) located, understood, and (b) efficiently accessed over the network. In the following, we give a few points that play a key role in making federated systems capable of effectively supporting these tasks. 1. Metadata design. The properties of resources, be it data or programs, must be captured as accurately as possible through respective metadata descriptions. The better the metadata of a resource the bigger the chance that it will indeed be discovered and used. There are already widely recognized standards for data sets (e.g. Dublin Core for documents, FGDC standard for geospatial metadata, etc.). Even though metadata sets developed by separate organizations may differ, they will most probably be compatible if they have been designed to describe similar types of objects. The Warwick framework defines a containerized structure where different standards can be combined in one metadata object, to allow different metadata sets to refer to (describe) a single data or software object [4]. 2. Encourage scientists to document the data/programs they produce/develop so that these can be used by others. Support user-friendly metadata editing of resources via electronic tools, e.g. via the web. Notably, it is impractical to describe by hand data sets that are created automatically. Whenever possible, metadata should be generated by the system that produces the corresponding data sets. 3. Automate metadata submission and updating of the metadata registries. Distribute metadata indices over multiple machines for scalability and fault-tolerance [1]. Use spider programs to fetch metadata from repositories that maintain their own (private) metadata registries. Employ search engine and information retrieval technology to search / browse through metadata registries via user-friendly tools. 4. Use ontologies to augment the standards in metadata creation for thematic information thereby providing for a common and accurate search of information. Ontologies can be formulated and managed via tools to create ontology servers (libraries) for various disciplines [5]. 5. Domain-specific
tasks can be defined in terms of semantic nets of ontological terms, connecting resources
with each other, via their metadata descriptions. These relations can be transformed into
a set of rules stored in a knowledge base, which can be exploited by appropriate tools to
derive on demand workflows for data production [6]. Different rules and thus workflows can
be produced for different user groups. Accessing resources over the network 1. Deal with different storage formats. Although the scientific community is increasingly using databases, many data are still being kept in files (not unreasonably) in several different formats. The system must provide some kind of translation mechanism that converts data to a format specified by the user. This problem is typically addressed by maintaining a library of wrappers for the heterogeneous repositories [3]. 2. Deal with different access patterns. It seems that the traditional approach of assigning each data set a fixed wrapper is not always sufficient, because different users/applications may wish to access the same data in different ways. Moreover, different access patterns could yield different client-based or server-based caching, pre-processing, compression, and pre-fetching that are unknown a priori, when the wrapper is implemented. Dynamic selection and installation mechanisms for wrappers can be implemented using agent technology. 3. Support remote execution of programs. For legacy code this requires the implementation of program wrappers that intercept and convey incoming requests to the local program. Popular mechanisms/protocols that are used for this purpose are CORBA or DCOM. Major redevelopment is needed, however, to make existing programs interoperable and (interactively) controllable over the network. If we are to witness significant advances in this area, scientists will have to change their paradigm, shifting from monolithic batch processing software to reactive, network-capable, service-oriented scientific modules. Agent technology can be used to introduce more flexibility regarding the wrapping and control of program execution. 4. Support execution of complex workflows. Allow the user to formulate and submit workflows, i.e. coordinated execution of more than one program. This requires installing a workflow runtime and providing respective wrappers for the programs that are to be invoked. It should be possible to run a given workflow in batch or interactive mode, depending on the application. The above tasks must be supported via
appropriate interfaces. The main role of the interface is to help the use formulate
metadata queries, which are forwarded to the search engine, and display the results.
Geographical maps can be used to further restrict the search space using coordinates. We
have found relatively simple interfaces to work well for open scientific information
systems that are not dedicated to a single, specific application. What is of great
importance is to offer a single point of access to the federated information
system, which can be achieved in a relatively straightforward way using WWW technology. A comment on visualization is also in place here. Unlike resource discovery, which can be supported through one generic interface, visualization is highly application specific so that a different interface could be needed for each case. WWW technology, through plug-ins and applets that can be loaded from within a browser at runtime, allows the user interface to change according to the page the browser points to. However, only a few visualization/analysis tools are available as dynamically loadable modules, e.g. VRML viewers, so that in most cases the user accessing a resource must have the corresponding analysis/visualization software installed on her/his machine. Apropos, widely used GIS are quite static in that sense; data sets must be prepared (in principle manually) before they can be loaded, visualized and queried through these tools. We certainly hope that in the future GIS will be able to automatically load and visualize data based on their metadata. Acknowledgments Funding for the THETIS project was obtained from the Research on Telematics Program of the EU, under project number F0069. Funding for the Poseidon project was obtained in part from the US Department of Commerce (NOAA, Sea Grant) under grant NA86RG0074, the US National Ocean Partnership Program via ONR grant N00014-97-1-1018, and the MIT Department of the Ocean Engineering. NATO provided travel funds for exchanges between the Poseidon and THETIS groups under grant number CRG971523. References [1] THETIS : A Data Management and Data Visualization System for Coastal Zone Management of the Mediterranean Sea. Contact person C. Houstis. http://www.ics.forth.gr/pleiades/THETIS/thetis.html [2] Poseidon : A Distributed Information System for Ocean Processes. Contact person N. M. Patrikalakis. http://czms.mit.edu/poseidon/ [3] N. M. Patrikalakis, C. Chryssostomidis, K. Mihanetzis, Design and Manufacturing in a Distributed Computer Environment, invited paper, ICCAS 99, June 1999. http://deslab.mit.edu/DesignLab/dinos/iccas.pdf or http://deslab.mit.edu/DesignLab/dinos/iccas.ps [4] P. C. H. Wariyapola, N. M. Patrikalakis, S. L. Abrams, P. Elisseeff, A. R. Robinson, H. Schmidt, K. Streitlien, Ontology and Metadata Creation for the Poseidon Distributed Coastal Zone Management System, Proceedings of IEEE Forum on Research and Technology Advances in Digital Libraries Conference, IEEE ADL 99, Baltimore, MD, pp 180189, May 1999, Los Alamitos, CA: IEEE, 1999. [5] R. Fikes and A. Farquhar, Distributed Repositories of Highly Expressive Reusable Ontologies, Stanford University, March 1998. http://ontolingua.stanford.edu/ [6] V. Christophides, C. Houstis, S.
Lalis, H. Tsalapata, Ontology-driven Integration of Scientific Repositories,
NGITS99, New Generation Information Technologies, Lecture Notes in Computer Science,
Elsevier, Habart Habaron, Israel, July 1999. |