|
Data routes in Astronomy
François Ochsenbein, Centre de Données astronomiques de Strasbourg
The existing pieces
Astronomy has a very long history of keeping the observational results -- only photons
can currently be observed and analyzed in astronomy, without any possible interaction with
the source of photons...
The information which is currently available in astronomy basically consists in:
 | observational data and their derived data: observatory archives (e.g. the HST Archive),
recent and old sky surveys (e.g. Palomar sky survey, 2MASS, SDSS...), images, catalogues,
spectra, etc.... The size of these data sets is increasing at an unprecedented rate
(gigabytes of daily archives, catalogues of up to 5×108 objects...) |
 | bibliography (published material) classically used to communicate the results and the
interpretations of the observations among the scientists. All the leading astronomical
journals are currently published electronically and are available on-line. |
 | compilation databases and catalogues, gathering processed data from several archives,
catalogues, databases, and publications, organized for specific purposes (e.g.
extragalactic objects in NED, non-solar-system objects in SIMBAD, high-energy data in
HEASARC, astronomical literature in ADS, etc...) |
 | astronomically-oriented software, results of simulations |
 | yellow pages (e.g. AstroWeb), personal pages, etc... |
The bulk of the existing information is currently managed by a few key partners:
 | data producers (e.g. the STScI operating the HST) store the observations in
observatory archives, close to the instrument collecting the data. |
 | data centers provide an access to the information in an organized way, e.g. by
wavelength or by astronomical object. Data centers also provide high value added services
(organized information, homegenized and validated data), and interconnection tools -- e.g.
pointers to images in all available wavelengths, to spectra, to studies of the galaxy NGC
4321, etc... |
 | the journals disseminate the scientific results, and also act as a moderator in
asserting the validity of the published material. |
 | In astronomy, these partners are already working in close cooperation: the on-line
astronomical journals are for example interconnected through the ADS service, or the large
tables or results are currently published in electronic form by the data centers on behalf
of some journals. Many links already exist between these partners, allowing an easy
navigation between the data, the publications and the archived data. |
Interchange standards
The links between the distributed and heterogeneous facilities, which exist in
astronomy are possible because some standards have been adopted among the partners. These
standards were defined to solve pragmatic problems in data exchange a long time ago, and
became later `de facto' standards:
 | FITS which is the image and data exchange format in use in astronomy
for now over 20 years. Most observatory archives are stored in this form, and many tools
have been developped to facilitate data processing and visualisation around this format,
see the FITS pages |
 | bibliographical references are currently summarized by the bibcode used by all partners
to refer to published papers in astronomy. |
 | catalogues in tabular form are described by a standard description (see http://vizier.u-strasbg.fr/doc/catstd.htx)
|
The links between the different pieces of data were also improved by the GLU system, a
tool developped at CDS which allows to maintain the links between cooperating services. It
is for instance used in the Astrobrowse system to convert queries based on celestial
positions into any of the local dialects used by the various databases, allowing therefore
to submit generic queries to heterogeneous remote databases.
Improve the Data Exchanges
A system like Astrobrowse allows to submit a generic query to various databases and
presents the heterogeneous results to the end user. The next step is to let a program
interpret the result for further processing, e.g. to combine the results coming from the
various sources for a more synthetized presentation or more intuitive visualisation, or to
collect the related data through newly generated queries.
Aladin is an example of a tool which can display astronomical images and locate the
objects catalogued in several databases on a single Java applet. The access to
heterogeneous image archives is possible because FITS is used (FITS describes the image
geometry and its location on the sky); ad-hoc interfaces to the most important catalogues
and databases were developped in the Aladin context to mark up the positions on the sky of
the objects retrieved from these databases.
XML is obviously a way to explore to mark up the key parameters required for basic
interpretations of the objects returned from remote databases. The move to XML in
Astronomy is quite active presently, and for instance the XML formatting of tabular data
is being discussed.
In this context of improving the interpretation of the data, I would like to address
the following points:
 | the usability of large scientific databases by non specialists: even within one
discipline (astronomy) it can be difficult to interpret accurately the results of queries
addressed to specialized databases; it's even more difficult to know how to use the
results of resquests addressed to databases in other disciplines. This addresses the
question of documentation, but also of e.g. expressing the results in commonly used units.
|
 | metadata sharing across disciplines: the exercise is not straightforward, see the ISAIA
(Interoperable Systems for Archival Information Access) developments in Space Physics. |
 | XML is currently being investigated as a way to provide metadata in the documents
returned from queries, but how easily could the results of cross-disciplinary documents be
merged ? |
Management of very large databases
Multi-million catalogues are now common in astronomy (e.g. USNO-A2.0 with ~500×106
objects), and larger catalogues issued from large surveys are coming up. Various methods
for indexing such large catalogues (mainly from the location in the sky) are available or
in development, and methods developed in the context of object-oriented DBMS could become
`de facto' standards in the near future. I guess quite similar problems are occuring in
other disciplines (e.g. geosciences), and sharing the experiences would be quite useful.
Links
|