|
|
First of all the name "large for data bases might be misleading in the sense that the pure quantity is meant. In fact, a Terabyte of similar looking and well organized satellite data is definitely easier to handle than for instance a Megabyte of results from a biological experiment. This relativation of "size is the more the case the greater and the cheaper the capacities of electronic storage devices will become. The issue is really data and information management. "Byte and "File handling and operations of Terabyte archives can be left to the engineers and computer scientists. In fact several developments and standards have appeared in recent years in handling large data archives, such as the "Hierarchical Storage Management (HSM) standard. Of concern for the scientists is the long term storage of this data. Whereas more than thousand years of experience exists in the preservation of written records, the experience in preserving digital information might not raise much confidence that one record written on a specific computer system and stored on a specific media, is still readable after ten years. Concepts for the permanent migration of scientific information and data bases need to be developed. Concepts which should make use of new automated media handling technologies, such as in robotic libraries. Similar problems arise in banking and administration. Science should look for synergies with the approaches developed in these areas. There will also be problems of "deleting scientific data (because nobody can afford to store it any longer). Standards, parameters and principle research is required on the "long term value and "irreplaceability of scientific information. Another serious issue is the description, partitioning and identification of the data and information. Descriptive information about data is denoted as metadata. Whereas description of literature (e.g. books) has a long history in libraries, description of scientific data needs a new approach. Until a few years ago, the only output of scientific work was meant to be a scientific publication, an information with a well defined metadata-description (e.g. standard library citations, ISBN, Dublin Metacore). Few real data measurement tables and such have appeared in these publications. Scientific work, based on measurements performed by others was near to impossible. Only the development and affordability of digital computing made it possible to store and transfer measurements in an easy and cost effective way. However, what is still missing, is a commonly agreed understanding of the contents of these data files, i.e. a common meta-data description. As scientific data bases grew bigger, some science domains have started to be concerned with this problem. In earth environmental science and earth observation - a domain I am familiar with experts have realized that any data set can in principle be described by the triple information of: - geographical position of the data set - time of the measurement - source/sensor of the measurement At least traditional satellite data can be described with these basic meta-information. Amended by a few more parameters and using basic descriptive rules from the librarian expertise, NASA has started the "International Directory Network (IDN) or also named Global Change Master Directory (GCMD), which is the by far largest collection of meta-data for global earth environmental and satellite data. The metadata format in the GCMD is called the "Directory Interchange Format (DIF), indicating that descriptive records should be intercomparable with other meta-data collections. In fact many earth science disciplines have adapted the DIFs for their meta-data description. Learn more about GCMD under http://gcmd.gsfc.nasa.gov With the advent of the Internet, all these meta-data resources have been turned into on-line data bases. Interestingly enough, the major purpose of these on-line resources is not to explain the content of the data set, once it is on the hands of a researcher (e.g. as a kind of users manual), but to allow researchers to find these data sets in the plethora of available information. Metadata is currently more supporting the search rather than the use. This is amongst the weaknesses of the current meta-data concept. Metadata is only defined along a certain intended use. In case the researcher has any other use for the data, it might be hard to find it. As an example, meta-data collections such as GCMD support a set of specific keywords. Being not familiar with the earth science keywords, you might not be able to find the data with your specific nomenclature. The multi-disciplinary approach in the on-line meta-data retrieval is nowadays better supported by allowing "full text retrieval in scientific abstracts (such as supported by common Internet search engines). More advanced approaches try to offer users support in terms of "translation of science terms, using tools such as thesauri and similarity checkers. Similar problems are now faced by public and commercial search engines on the Web. Synergy with these partially commercial - approaches is recommended. Amongst radically different approaches is the idea, to retrieve the metadata form the original data based on the specific query type. Several approaches of "data mining have been tried in earth observation data collection. Users cannot only ask "give me a satellite image of this date at this place, but "give me all satellite images with small rivers going southwards. In the latter case, the information of small rivers is not included in the metadata, but might be included in other data bases (e.g. "Digital Chart of the World) and this information is used to retrieve the right data. Or even more advanced the query invokes an image processing algorithm, which identifies the object class "small river going southwards from the Terabytes of original on-line data. More research is needed in this domain. Once metadata is available on-line on the internet, the problem arises, how users are able to search a couple of these on-line data bases. The simplest approach is to log-on to all resources (provided they are known) and type the specific questions required by each of these data bases. In case the data bases support standard meta-data descriptions (such as DIF), the query might be the same with each of the data bases in question. However in this case, the query could be automated. The catalogued information could be retrieved by a specific query protocol. The user only logs-in into one data base and the query is automatically transferred to several data bases supporting this query format. In the same way, the standard formatted answers are transferred to the initial host. Without logging-in into dozens of data bases, the user might get more information than just those residing on one server. Again, the concept of this "Catalog Interoperability is not new to librarian community. Earth Observation has picked up this idea in the recent years and has created standards for the interoperable exchange of catalog information about satellite data. However, the standard (based on the ISO standard for library exchange, named z39.50) is now being expanded to other geoscientific information. The work on catalog interoperability was performed under the auspices of the Committee on Earth Observation Satellites (CEOS). More about CEOS activities under http://wgiss.ceos.org/wgiss/ Other activities of CEOS are concerned with the homogenization of global data sets. A typical problem is the fact that supposed to be similar data come in different digital file formats. Amongst others, the Hierarchical Data Format (HDF), is now used as a standard for geo-scientific products. Map- alike products such as digital satellite imagery need to be transferred to a map projection in order to be analyzed in a Geographical Information System (GIS). Map making and the associated formulas is an art of the 18th & 19th century, deliberately defined for national territories. Global and unified activities need to work on global and unified map projections. Or even better avoid area and scale distorting map projections in general and use innovative 3-D mapping algorithms powered by nowadays computer graphic performance. Computer graphics is an area with many innovative approaches, usable in the handling and visualization of scientific data. More synergism between traditional research and computer graphics is recommended. |