European-United States
joint workshop on
Large Scientific Databases

us-eu-s.jpg (39829 bytes)

Annapolis, Maryland, USA, 1999 September 8-10

Supported in part by the
National Science Foundation
(Grant IIS-9910140)
and the
European Commission
(European Union Information Society Technology Programme)

Organized by
the Center for Advanced Computing Research at the California Institute of Technology,
and by the European Laboratory for Particle Physics (CERN)

this report compiled by the organizing committee:

Roy Williams
California Institute of Technology, USA
<roy@cacr.caltech.edu>


Paul Messina
California Institute of Technology and US Department of Energy
<messina@cacr.caltech.edu>


Fabrizio Gagliardi
CERN
<Fabrizio.Gagliardi@cern.ch>

John Darlington
Imperial College, UK
<jd@doc.ic.ac.uk>

Giovanni Aloisio
University of Lecce, Italy
<giovanni.aloisio@unile.it>

This document and supporting material may be found at
http://www.cacr.caltech.edu/euus


The full report from the workshop is available
as pdf, plain text or html


Executive Summary

Science is continuing to generate ever larger amounts of valuable data, but we are in danger of being unable to extract fully the latent knowledge within the data because of insufficient technology. To address this we propose the establishment of an Expedition Center, a virtual "center", hosted at multiple geographical sites, similar in scope and thrust to the US-NSF Science and Technology Centers. Features include:

  • unification of information and knowledge management between US and Europe

  • strong leadership and continuity of purpose

  • funding in the millions per year

  • longevity of 5 years or more

  • flexibility to seize new opportunities quickly and to shift the agenda rapidly

The center would be a network of excellence in specific research domains, emphasizing trans-Atlantic teams, supporting both basic and applied research, with large-scale testbeds and large-scale demonstrations. It would have a strong education and outreach component. There might be four sites in the EU and US with independent funding for visitors, travel, and workshops. These regional centers could share the resources of independently-funded facilities to create large-scale demonstrations and prototypes; such sharing could be achieved by collaboration agreements or by rental. There would be liaison to other activities, for example the Framework 5 in the EU, the Grid Forums and Digital Library Initiatives in the US.

A crucial requirement for this kind of collaboration is trans-Atlantic data communication that provides high bandwidth, high availability, and low latency. We recommend a study to consider and cost the options in detail.

We recommend funding application driven, multidisciplinary research, with the creation of prototypes, testbeds and full-scale implementations. Such research should always be close to the needs of a particular community, preferably directly connected to scientifically-interesting research. Such a scientific community should be geographically distributed and international.

We encourage the creation and reconciliation of data object and metadata standards, but only in a strongly-defined, discipline-specific environment, and with enough funding to produce relevant and useful software, not just a report. Further work could define metadata semantics, discipline specific data dictionaries, information models for organizing metadata, and data models for describing data set structure. We encourage projects that use established, extensible metadata standards. Where an existing standard exists, new projects should use, subset, or extend one of these standards, or provide good reason for any decision to start afresh.

Interoperability projects should be encouraged, that begin with two or more existing scientific databases, preferably already catalogued and/or online, together with a good reason and mechanism for combining the data. We should then encourage the implementation of this federation.

We recommend specific research on the following aspects of distributed and/or large databases: data clustering and caching; data redundancy, dynamic summarization, and query formulation to allow machine optimization and brokering; splitting queries into separate, local queries and cost estimation of queries; parallel multi-dimensional access and search methods, approximate search methods, and data compression; load-balancing of computational work and data in distributed systems, replication of data among regional centers, protocols for high-speed, parallel dataflow, and protocols for real-time steering and control of running jobs.

We also recommend supporting work on security mechanisms, especially work that coexists with other security mechanisms. We recommend that creators of scientific databases should be encouraged to consider in advance the preservation of the data. Preservation description information should be associated with the digital objects being preserved. We recommend investigation of ways to standardize requirements for IT-courses in the educational curricula of the domain sciences with emphasis in data modeling and use of databases. We recommend exploration of the profound impact of databases and networking on the process of science, including publishing, peer-review, collaboration, and data ownership.

*This workshop was sponsored in part by the National Science Foundation, under the grant IIS-9910140 (PI: Roy Williams) awarded by the Information and Data Management Program of the Information and Intelligent Systems Division. All opinions, findings, conclusions and recommendations in any material resulting from this workshop are those of the participants, and do not necessarily reflect the views of the National Science Foundation.

Click on a topic below for further information:

Full report (pdf)

Full report (HTML)

Full report (plain text)

Position Papers

Attendees

Links

Pictures