|
|
The problems associated to the use of very large databases in the High Energy Physics domain has already been discussed in other position papers (F. Gagliardi, J, Bunn and H. Newman). The goal is to be able to use data sets in the PetaByte range for the treatment and analysis of LHC data. Current experiment like BaBar at SLAC is already making use of a large database in the 100 TByte range / year. Some ideas can already been extracted from this experience concerning possible future developments and expansions. Data Representation: The current model for data representation in HEP experiment is based on Object Orientation. This is quite new in the field but proves to be efficient and convenient. It is reasonable to think that this model will still be valid in the next coming years. The persistency systems used in the existing databases are functional. The question is more how to access and use these persistent objects. Data Access: The problem of how the data are accessed and of the representation of these data is an important issue. BaBar physicist for instance are never using the persistent object directly, they are always converted in their transient form. In this way, the database and the analysis world are disconnected, this give better flexibility and security. It seems clear that in a future database development, it should be possible to access the data from different sources and under different formats. Data access should be platform independent and language independent. The use of CORBA is very promising in this sense. Data Distribution: One important characteristics of HEP experiments is the world wide repartition of the data users. The model where data are centrally recorded, processed and analyzed will not be working in the experiment of the next generation. Physicists will want to access the data from their home institutes with the same functionalities and level of details as the one located near the experiment. At the moment, the important phases of the data processing intended to transform the data into a format suitable for analysis are still centrally done. It would be very interesting to be able to distribute these tasks among the major computing centers participating to the experiment all over the world. While the LHC experiments have foreseen a mechanism of automatic duplication and replication of the data. This model is not working yet and the data should be distributed through very heavy procedures. This area needs a lot of development. Networking: Usage of very large databases by a large user's community imply an excellent networking. The design of a database system should include Wide Area Networking capabilities. It is necessary to access the data locally when they are available and to be able to get them from a distant site if they are absent. This capability should be transparent to the users. These ideas are not knew, but to my knowledge, no satisfactory solutions exist. It should be noted that the recent evolution of the network itself even between US and Europe would already allow this kind of model (at least for the current generation of experiment). The problems are mostly in the database design side. Mass Storage System: Speaking about large databases leads immediately to the problem of the storage. It is impractical at the moment (and probably for the next coming years) to store all the data on disks. It is even not useful to do so, as a large fraction of the data is rarely accessed. The design of a mass storage system adapted to the database needs should go in parallel with the database development. Some solutions already exists (like HPSS in HEP) but are not really adapted to the problem and are too disconnected from the database system itself. Computing Infrastructure: Having a good database software is not sufficient. It should rely on a computing infrastructure (computers, network, storage device, servers...) capable to sustain, high I/O rate, simultaneous access, large amount of data etc... A database development project should include this aspect from the beginning. Test bench: The problem with current, real life database implementation is that we discover the real problems while setting the database in production. It would be extremely useful to design a test bench system capable to evaluate a given database in a realistic condition and to try different hardware configurations. This test bench should be able to simulate a very large amount of data (scalability tests) and a large number of users accessing the data at the same time to identify possible bottlenecks. This kind of development is probably easy to fit in a collaboration between US and EU, as it may also test the important aspect of data distribution and of network relations between databases. Relations with commercial software: The question is open to decide whether it is better to base a database development on existing commercial software or to design a new system from scratch specific for a given application. If a commercial product is used, it is vital to establish a close collaboration with the company and to have full access to the source code.
|