|
Position Paper
Dan Reed, University of Illinois and NCSA
The "data explosion" is upon us -- both large-scale computational simulations and ever more sophisticated
scientific instrumentations are generating a torrent of data. Just as computational parallelism is
necessary to achieve multiple teraop computation rates, parallel secondary and tertiary storage techniques like disk and tape striping
will be required to sustain multi-terabyte or petabyte queries.
Massive I/O parallelism is a consequence of increasing disparities in disk capacity and transfer rates, all driven by the commodity PC
market. Although areal densities are increasing by 60 percent annually, access times (i.e., seek and rotational delays) are
decreasing by considerably more slowly. This has profound implications for how data is stored on and distributed across devices.
With fairly modest assumptions about the I/O rates needed to query reasonable subsets of a
multi-petabyte data archive, one can calculate the number of storage devices that must be accessed in parallel to achieve
reasonable I/O rates. Even with data striping across thousands of devices, access times can potentially be very large.
This suggests that we need to explore (and revisit) several issues:
1. Disks with embedded data analysis capability (processors in disks)
2. Redundant storage that exploits rising storage capacities to store multiple of copies of key data subsets in
ways that match expected access patterns.
3. Learning techniques for data organization and reorganization that
exploit behaviors gleaned from earlier queries.
In addition, it is increasingly common to correlate data across multiple data archives (e.g., correlating radio
astronomy observations with those in the optical domain) and disciplines (e.g., GIS and
population data). Cross-domain indexing and naming, sometimes called semantic interoperability, are critical if this is to be
successful. Hence, metadata representation and generation techniques are paramount.
|