us-eu.small.jpg (31333 bytes)

 


Data routes in Astronomy
François Ochsenbein, Centre de Données astronomiques de Strasbourg

The existing pieces

Astronomy has a very long history of keeping the observational results -- only photons can currently be observed and analyzed in astronomy, without any possible interaction with the source of photons...

The information which is currently available in astronomy basically consists in:

observational data and their derived data: observatory archives (e.g. the HST Archive), recent and old sky surveys (e.g. Palomar sky survey, 2MASS, SDSS...), images, catalogues, spectra, etc.... The size of these data sets is increasing at an unprecedented rate (gigabytes of daily archives, catalogues of up to 5×108 objects...)
bibliography (published material) classically used to communicate the results and the interpretations of the observations among the scientists. All the leading astronomical journals are currently published electronically and are available on-line.
compilation databases and catalogues, gathering processed data from several archives, catalogues, databases, and publications, organized for specific purposes (e.g. extragalactic objects in NED, non-solar-system objects in SIMBAD, high-energy data in HEASARC, astronomical literature in ADS, etc...)
astronomically-oriented software, results of simulations
yellow pages (e.g. AstroWeb), personal pages, etc...

The bulk of the existing information is currently managed by a few key partners:

data producers (e.g. the STScI operating the HST) store the observations in observatory archives, close to the instrument collecting the data.
data centers provide an access to the information in an organized way, e.g. by wavelength or by astronomical object. Data centers also provide high value added services (organized information, homegenized and validated data), and interconnection tools -- e.g. pointers to images in all available wavelengths, to spectra, to studies of the galaxy NGC 4321, etc...
the journals disseminate the scientific results, and also act as a moderator in asserting the validity of the published material.
In astronomy, these partners are already working in close cooperation: the on-line astronomical journals are for example interconnected through the ADS service, or the large tables or results are currently published in electronic form by the data centers on behalf of some journals. Many links already exist between these partners, allowing an easy navigation between the data, the publications and the archived data.

Interchange standards

The links between the distributed and heterogeneous facilities, which exist in astronomy are possible because some standards have been adopted among the partners. These standards were defined to solve pragmatic problems in data exchange a long time ago, and became later `de facto' standards:

FITS which is the image and data exchange format in use in astronomy for now over 20 years. Most observatory archives are stored in this form, and many tools have been developped to facilitate data processing and visualisation around this format, see the FITS pages
bibliographical references are currently summarized by the bibcode used by all partners to refer to published papers in astronomy.
catalogues in tabular form are described by a standard description (see http://vizier.u-strasbg.fr/doc/catstd.htx)

The links between the different pieces of data were also improved by the GLU system, a tool developped at CDS which allows to maintain the links between cooperating services. It is for instance used in the Astrobrowse system to convert queries based on celestial positions into any of the local dialects used by the various databases, allowing therefore to submit generic queries to heterogeneous remote databases.

Improve the Data Exchanges

A system like Astrobrowse allows to submit a generic query to various databases and presents the heterogeneous results to the end user. The next step is to let a program interpret the result for further processing, e.g. to combine the results coming from the various sources for a more synthetized presentation or more intuitive visualisation, or to collect the related data through newly generated queries.

Aladin is an example of a tool which can display astronomical images and locate the objects catalogued in several databases on a single Java applet. The access to heterogeneous image archives is possible because FITS is used (FITS describes the image geometry and its location on the sky); ad-hoc interfaces to the most important catalogues and databases were developped in the Aladin context to mark up the positions on the sky of the objects retrieved from these databases.

XML is obviously a way to explore to mark up the key parameters required for basic interpretations of the objects returned from remote databases. The move to XML in Astronomy is quite active presently, and for instance the XML formatting of tabular data is being discussed.

In this context of improving the interpretation of the data, I would like to address the following points:

the usability of large scientific databases by non specialists: even within one discipline (astronomy) it can be difficult to interpret accurately the results of queries addressed to specialized databases; it's even more difficult to know how to use the results of resquests addressed to databases in other disciplines. This addresses the question of documentation, but also of e.g. expressing the results in commonly used units.
metadata sharing across disciplines: the exercise is not straightforward, see the ISAIA (Interoperable Systems for Archival Information Access) developments in Space Physics.
XML is currently being investigated as a way to provide metadata in the documents returned from queries, but how easily could the results of cross-disciplinary documents be merged ?

Management of very large databases

Multi-million catalogues are now common in astronomy (e.g. USNO-A2.0 with ~500×106 objects), and larger catalogues issued from large surveys are coming up. Various methods for indexing such large catalogues (mainly from the location in the sky) are available or in development, and methods developed in the context of object-oriented DBMS could become `de facto' standards in the near future. I guess quite similar problems are occuring in other disciplines (e.g. geosciences), and sharing the experiences would be quite useful.

Links

The Aladin sky atlas: http://aladin.u-strasbg.fr/aladin.gml
AstroBrowse: http://heasarc.gsfc.nasa.gov/ab/
AstroGLU: http://simbad.u-strasbg.fr/glu/cgi-bin/astroglu.pl
AstroWeb: http://cdsweb.u-strasbg.fr/astroweb.html
BibCode: http://cdsweb.u-strasbg.fr/abstract/simbad/refcode.html
FITS: http://fits.gsfc.nasa.gov/ GLU: http://simbad.u-strasbg.fr/glu/glu.html
ISAIA: http://heasarc.gsfc.nasa.gov/isaia/
USNO-A2.0 catalogue, see e.g. http://vizier.u-strasbg.fr/cgi-bin/VizieR?-source=USNO-A2.0
XML for Astronomy: http://pioneer.gsfc.nasa.gov/public/xml/