us-eu.small.jpg (31333 bytes)

 


Managing Data Movement
David Maier
Database and Object Technology Lab
Computer Science and Engineering Department
Oregon Graduate Institute
P.O. Box 91000
Portland, Oregon 97291-1000

Imagine that we could fabricate small sensors for temperature, pressure and position for ten cents each. The applications for such technology in scientific investigation would be vast: spray them onto trees in a forest to study microclimates; dump a bucket of them into the ocean to map currents in detail; strew them behind a planetary explored to track diurnal and seasonal changes. Now imagine the data stream produced by, say, a few million deployed sensors.

The average cost of acquiring a byte of scientific data is on a steep downward trend. It would be foolhardy to imagine that we can manage all scientific data in the future by the traditional approach of capturing it all on storage media and retrieving later for analysis. We should think about ways to process, partition, fuse and disseminate such data to a widely distributed body of investigators, without it first having to cross a disk platter or tape surface. We should be thinking about alternative models where such data is routed in near real time to interested clients, which can aggregate it, comb it for particular patterns or detect events that trigger capture of portions of it.

To move in this direction, we need to think about data architectures that are "net centric" rather than "disk centric," and where the emphasis is data movement rather than data storage. Such a shift raises numerous issues about how data management systems should be constructed, including:

Shifting from file-oriented to stream-oriented processing. Processing models that assume a data source has an "end" aren’t suitable for data streams that continue indefinitely. Incremental processing will be more the norm, and splitting and combining of streams will be important operations.
Constructing new kinds of data management components. While most of the basic system components in current database management systems will continue to be of use, there are other functions need that don’t appear in current systems. One such component is an alerter, which processes a data stream against thousands of stored conditions. Another is an accumulator, which is a high-turnover storage manager for maintaining a shifting subset of data that has appeared in a stream.
Alternative structures for data. Normalized, tabular forms of data may not always be a suitable representation, as they can place related bits of information far apart in a data stream. Hierarchical, grouped structures, as typified by XML, may be more appropriate.
New roles for meta-data. Meta-data will need to be mixed intimately with the data it describes and annotates, for purposes of routing and assessing relevance of data in streams.

One project that is investigating net-centric data management, though not specifically focused on scientific data, is the NIAGRA project underway at the University of Wisconsin (David DeWitt, Jeffrey Naughton) and Oregon Graduate Institute (David Maier).