us-eu.small.jpg (31333 bytes)

 

 

From Data to Information to Knowledge: Supporting the Path to Scientific Investigation
Arie Shoshani July, 1999

1. The process of scientific investigation

The rapid decrease in the cost of computer hardware and the advances in parallel processing and network technology have opened up new opportunity for scientific investigation. The main benefit of these advanced systems is the ability to compute models more accurately and analyze results at a faster rate. However, the increase computation power brings with it a parallel increase in the amount of data generated, collected and analyzed.

There are two sources of data for scientific investigations. One source is the collection of observed data by various devices in laboratories, experimental systems, observation stations, and satellites. The quantities collected are usually limited by the availability of storage devices and their cost. For example, in High Energy Physics experiments, only a small fraction of the particle collision data is collected (less than 1%, only high energy collisions are kept), because of the high cost of storage.

The second source of data is simulation data. Most simulation models use some kind of a mesh structure to model space and time. The smaller the mesh dimensions, the more accurate the model is. Here the limitation is often the computational power of the simulation system, as well as the cost of storage media. For example, Global Circulation Climate Models (GCMs) today still use a spatial mesh dimension of about 1 kilometer. Climatologists desire much smaller mesh dimensions for accurate modeling. Similarly, it is beneficial to shorten the time step of the models.

The datasets generated by such powerful computational and observational systems are getting so large (terabytes to petabytes) that the handling of the data is becoming the main bottleneck to scientific investigation. To understand the reason for such bottlenecks one has to follow typical scientific investigation patterns. How do scientists analyze the datasets after they are generated or collected? When we consider very large datasets, it is not possible to analyze or visualize a dataset in its entirety. One cannot visualize a terabyte of data, neither is it possible to navigate this quantity of data.

The typical way to proceed with analyzing large datasets, is to first generate some abstraction of the dataset in the form of features or summarization. This is the step where information is generated from the data. This phase includes various techniques, such as systematically extracting some features of the data, summarizing the data to a lower granularity, or applying some data mining technique to identify regions of high activity or of special interest. An example of systematic extraction of features is generating the total energy, momentum, and the number of various particles for each collision in a High Energy Physics experiment. An example of summarizing data to lower granularity is the generation of monthly means over simulated climate data. An example of data mining techniques applied to the data is cluster analysis in a multi-dimensional property space, identifying regions of high activity, or finding outliers/anomalies in the dataset.

Scientists then use this information (i.e. the features / summarizations) to guide them to regions of interest. In this phase, the exploration phase, they often wish to "drill down" in the dataset. This is done by specifying a subset of the dataset either by selecting summarization features (e.g. in HENP data, one might specify "get me all the collision data for some energy range), or directly on the dataset parameters (e.g. in Climate modeling, one might specify "get me the temperature and wind velocity data for the Indian Ocean region over the last 2 years). This analysis phase is where a scientist uses the information to guide his focus on regions of the data to deduce / extract knowledge. In this phase, scientists may use generic/specialized analysis programs, as well as visualization methods to "discover" knowledge.

2. Functional units in support of scientific investigation

The above description of the scientific investigation process of large scale models and datasets suggests the need for three functional units in support of the steps described above:

  1. Computational and Storage Facility.
  2. Data Mining and Feature Extraction Center.
  3. Large Scale Data Management Center.
  4. Advanced Data Analysis and Visualization Center.

The Computational and Storage Facility is needed to efficiently compute scientific models and store the data generated by such models or collected by experiments/observations. Since many models can be parallelized, such a facility should support massively parallel computation. The Data Mining and Feature Extraction Center function is to specialize in techniques that extract information from data. Such techniques are usually very demanding of computational and storage systems since they are applied to entire datasets. In general, these techniques are amenable to paralelization, and therefore can benefit from an efficient computational facility that supports parallel computation as well. The Large Scale Data Management Center function is make the exploration phase as efficient as possible. If each iteration of extracting subsets of the data takes too long (hours), scientist forgo exploratory paths. The function of this activity is to develop methods that optimize the organization of data according to their anticipated access patterns, and to provide efficient methods for subset extraction from large datasets. Finally, the Advanced Data Analysis and Visualization Center is also needed in support of the exploration phase, developing and supporting a variety of data analysis and visualization techniques.

In terms of a research agenda, I believe that of the 4 areas mentioned above the more critical areas are 2) and 3). Areas 1) and 4) are not as critical. Research on Computational and Storage Facilities is taking place anyway because of other forces, such as the need for large scale simulations. Data Visualization is taking place because of various commercial interests. Data Analysis is often domain specific. The chances that one can share such techniques between disciplines is are not high. On the other hand, Feature Extraction and Data Mining techniques can be shared between disciplines. Similarly, techniques for efficiently accessing subsets of very large datasets form tertiary storage, their partial replication on distributed disk cache systems, the organization of the data to match access patterns, and tools to organize the metadata that describe large data collections, should also be emphasized. Such activities should be promoted so as to develop a body of knowledge and hopefully software libraries that can be shared by the scientific community.