|
|
1. Motivation Scientific datasets which are generated by large-scale scientific simulations and by sensors attached to various devices such as satellites, airplanes and microscopes are usually multi-dimensional - for example, the data dimension can be spatial coordinates, time, or varying experimental conditions such as temperature, velocity or magnetic field. In large-scale scientific simulations, these data sets require special care in transferring them out of processor memory in order to minimize the overall execution time. After data are off-loaded, they are analyzed and post-processed to visualize the evolving state of the simulated physical phenomenon. The latest trend is for the results of the analyses to be fed back immediately to the still-running simulation program to change simulation parameters (computational steering or interactive, integrated visualization) or the post-processed data is used by other applications. Therefore, fast I/O during post-processing is as important as minimizing the computation-time I/O. Meeting both criteria poses a significant challenge in data management. 2. Our Research In the past five years, the research activities of our Institute focused on the development of parallel I/O support (language, compiler and runtime support) minimizing the computation-time of HPF scientific simulations. In this period, we have cooperated with Joel Saltz (University of Maryland), Marianne Winslett (University of Illinois, Champaign-Urbana) and Alok Choudhary (Northwestern University, Evanston). On top of these results, our new project addresses the development of parallel I/O support for data analysis and visualization. More specifically, we will design, implement, and evaluate a software tool, the Parallel Scientific Data Repository (PASDR) that will serve the needs of a wide variety of scientific applications including large-scale modeling, optimizations, simulations and data analysis. PASDR will be able to manage, store, and access large-scale scientific datasets in an effective and efficient manner. PASDR assumes a shared-nothing (distributed-memory) architecture. Datasets are distributed across the disks. PASDR will differ from parallel file systems and libraries developed in the past years in several ways: - PASDR will act as a special high-performance scientific data warehouse. - Besides common I/O operations (e.g., read and write a multi-dimensional array, checkpoint/restart, snapshot), PASDR will be able to carry out multi-dimensional range queries directed at large datasets. - PASDR will provide support for operations which are common in database systems including index generation, data retrieval, memory management, and handling metadata. - PASDR will provide support for Online Analytical Processing (OLAP) and a selected set of data mining operations. - PASDR will include features for the construction and store of persistent hyper-cubes used by OLAP. We will investigate both in-core and out-of-core construction algorithms. - The PASDR interface should allow a repository to be accessed from applications developed in different programming languages. The same objectives were pursued by the Object Database Management Group (ODMG). Therefore our design of the repository interface is based on the current ODMG standard. - Disclosing access patterns: The repository interface provides means for specification of application access patterns to disclose knowledge of future repository accesses in the form of hints to the repository management system, which can use them to guide aggressive runtime optimizations. - Support for parallelism: Repository data and metadata is stored in a distributed and parallel form across multiple I/O devices. Further, the user has the opportunity to influence the distribution of the repository data, using hints. 3. Contribution to Joint Research Projects I fundamentally agree with the workshop position papers proposing the research into high-performance and reliable data management systems. Our current research effort is oriented towards these issues as well and we hope that it can provide a very relevant support for the challenging applications outlined in the position papers. At present, we are preparing a detailed bibliography which will include annotated records describing publications relevant to the topics addressed by this workshop. The bibliography will be available on the web. The workshop should specify a set of pilot applications from different scientific and engineering areas, define partnership and structure of research initiatives. |