us-eu.small.jpg (31333 bytes)

 


Thoughts on Possible EU-US Cooperation on Large Scientific Data Bases
Alberto Apostolico, Purdue University and Universita' di Padova, August 1999

A pervasive feature of the current and future information flood is that search for information without meaning will become impossible or useless. A search engine is likely to return thousands of Greek restaurants to a query about the ``Parthenon'', and it won't do too much good to be able to choose among 1,000 movies on TV in the evening if browsing through the collection of their summaries takes longer than watching one.

With data being increasingly amassed, the prevailing problem has thus become one of how to limit and filter what a query shall return. Correspondingly, issues are faced of achieving effective synthetic descriptions, generating succinct characterizations, enhancing prominent features for the available data. Metadata buildup is the essential component to the Internet of the future, and the activities it will carry. The environment within which most of this processes must take place is like the famous Heraclitus' river: never twice the same.

In computer science jargon, we seem forced to move from paradigms of search "by value'' and search "by contents'' to a new one of search "by meaning'', a paradigm hardly explored so far. To appreciate the difficulty this poses, it suffices to consider that already search by contents, which appears to be so easy when dealing with text (e.g., give me all documents containing the word ``river''), becomes quite difficult with other media such as pictures and sounds (give me all pictures containing a river, all records with the sound of water flow, etc.). In general, while it is clear that we care for symbols in that they carry meaning, harvesting the latter requires to break the barrier of syntax and penetrate semantics. Performing this task automatically would call for rather impervious and far from being perfected intermediaries, which includes, but is not limited to, natural language understanding.

While some of the tasks required in the organization and retrieval of knowledge can be automated, most still require considerable amounts of quality human intervention by ways of meaning extraction, format standardization and so on. For once, the commercial sector seems to be ahead of the game on this since it had to rapidly adjust under pressure to the emerging patterns and protocols of commerce and other activities. However, most of the commercial approaches are eminently heuristic. The scientific community has still a chance of making fundamental contributions of general scope while building its own future organization of data and knowledge.

Because syntax is still the gateway to automated meaning extraction, establishing the grammars of data and knowledge in the various walks of science seems a necessary ancillary step and one that only the scientific community involved in each individual discipline can carry out competently and effectively.

It seems to me that the bulk of EU-US (or, for that matter, planetary) cooperative effort on these issues should concentrate on understanding, defining and standardizing the structure and format of the current and future scientific data within appropriate disciplinary contexts, in such a way as to make it most conducive to access, dissemination and maintenance. I would favor the emergence of a jointly funded transnational "authority'' with the charter of dynamically understanding, defining and updating the standards of data and knowledge patterns. This might be initially limited to a few (sub) areas of, say, Physics and Aerospace Research, Earth Sciences, Digital Libraries and Molecular Bioinformatics, and later propagated to others.

I would expect such an authority to come up with guidelines or frames of reference within which more ad hoc technical joint programs may be conceived and carried out.

I would also like to see the establishment of one or more internationally distributed (col)laboratories for the study of technical issues that invest scientific disciplines across the board and purport to the management of massive information repositories of the future by means of compression, inference, searching and matching, mining, and related principles and techniques. In particular, I refer to a number of problems arising in the organization and analysis of data and information and that may be modeled in terms of building, matching and searching with some elementary discrete structures such as strings, trees, arrays, regular expressions, some special classes of graphs, and compounds thereof. I believe that advances in automated association generation and other similarly desirable semantic capabilities of filtration and inference rest still considerably on progress on those syntactic issues.