European-United States
joint workshop on
Large Scientific Databases
![]() Annapolis, Maryland, USA, 1999 September 8-10 Supported in part by the (Grant IIS-9910140) and the European Commission (European Union Information Society Technology Programme) Organized by this report compiled by the organizing committee: California Institute of Technology, USA <roy@cacr.caltech.edu>
California Institute of Technology and US Department of Energy <messina@cacr.caltech.edu>
CERN <Fabrizio.Gagliardi@cern.ch> John Darlington Giovanni Aloisio This document and supporting material may be found at The full report from the workshop is available Executive SummaryScience is continuing to generate ever larger amounts of valuable data, but we are in danger of being unable to extract fully the latent knowledge within the data because of insufficient technology. To address this we propose the establishment of an Expedition Center, a virtual "center", hosted at multiple geographical sites, similar in scope and thrust to the US-NSF Science and Technology Centers. Features include:
The center would be a network of excellence in specific research domains, emphasizing trans-Atlantic teams, supporting both basic and applied research, with large-scale testbeds and large-scale demonstrations. It would have a strong education and outreach component. There might be four sites in the EU and US with independent funding for visitors, travel, and workshops. These regional centers could share the resources of independently-funded facilities to create large-scale demonstrations and prototypes; such sharing could be achieved by collaboration agreements or by rental. There would be liaison to other activities, for example the Framework 5 in the EU, the Grid Forums and Digital Library Initiatives in the US. A crucial requirement for this kind of collaboration is trans-Atlantic data communication that provides high bandwidth, high availability, and low latency. We recommend a study to consider and cost the options in detail. We recommend funding application driven, multidisciplinary research, with the creation of prototypes, testbeds and full-scale implementations. Such research should always be close to the needs of a particular community, preferably directly connected to scientifically-interesting research. Such a scientific community should be geographically distributed and international. We encourage the creation and reconciliation of data object and metadata standards, but only in a strongly-defined, discipline-specific environment, and with enough funding to produce relevant and useful software, not just a report. Further work could define metadata semantics, discipline specific data dictionaries, information models for organizing metadata, and data models for describing data set structure. We encourage projects that use established, extensible metadata standards. Where an existing standard exists, new projects should use, subset, or extend one of these standards, or provide good reason for any decision to start afresh. Interoperability projects should be encouraged, that begin with two or more existing scientific databases, preferably already catalogued and/or online, together with a good reason and mechanism for combining the data. We should then encourage the implementation of this federation. We recommend specific research on the following aspects of distributed and/or large databases: data clustering and caching; data redundancy, dynamic summarization, and query formulation to allow machine optimization and brokering; splitting queries into separate, local queries and cost estimation of queries; parallel multi-dimensional access and search methods, approximate search methods, and data compression; load-balancing of computational work and data in distributed systems, replication of data among regional centers, protocols for high-speed, parallel dataflow, and protocols for real-time steering and control of running jobs. We also recommend supporting work on security mechanisms, especially work that coexists with other security mechanisms. We recommend that creators of scientific databases should be encouraged to consider in advance the preservation of the data. Preservation description information should be associated with the digital objects being preserved. We recommend investigation of ways to standardize requirements for IT-courses in the educational curricula of the domain sciences with emphasis in data modeling and use of databases. We recommend exploration of the profound impact of databases and networking on the process of science, including publishing, peer-review, collaboration, and data ownership. *This workshop was sponsored in part by the National Science Foundation, under the grant IIS-9910140 (PI: Roy Williams) awarded by the Information and Data Management Program of the Information and Intelligent Systems Division. All opinions, findings, conclusions and recommendations in any material resulting from this workshop are those of the participants, and do not necessarily reflect the views of the National Science Foundation.
|