us-eu.small.jpg (31333 bytes)

 


Position Paper
Dan Reed, University of Illinois and NCSA

The "data explosion" is upon us -- both large-scale computational simulations and ever more sophisticated scientific instrumentations are generating a torrent of data. Just as computational parallelism is  necessary to achieve multiple teraop computation rates, parallel secondary and tertiary storage techniques like disk and tape striping will be required to sustain multi-terabyte or petabyte queries.

Massive I/O parallelism is a consequence of increasing disparities in disk capacity and transfer rates, all driven by the commodity PC market. Although areal densities are increasing by 60 percent annually, access times (i.e., seek and rotational delays) are decreasing by considerably more slowly. This has profound implications for how data is stored on and distributed across devices.

With fairly modest assumptions about the I/O rates needed to query reasonable subsets of a multi-petabyte data archive, one can calculate the number of storage devices that must be accessed in parallel to achieve reasonable I/O rates. Even with data striping across thousands of devices, access times can potentially be very large.

This suggests that we need to explore (and revisit) several issues:

1. Disks with embedded data analysis capability (processors in disks)

2. Redundant storage that exploits rising storage capacities to store multiple of copies of key data subsets in ways that match expected access patterns.

3. Learning techniques for data organization and reorganization that 
exploit behaviors gleaned from earlier queries.

In addition, it is increasingly common to correlate data across  multiple data archives (e.g., correlating radio astronomy observations with those in the optical domain) and disciplines (e.g., GIS and  population data). Cross-domain indexing and naming, sometimes called  semantic interoperability, are critical if this is to be successful. Hence, metadata representation and generation techniques are paramount.