us-eu.small.jpg (31333 bytes)

 

 

Thrust Areas for EU/US collaboration
Roy Williams August 1999

Federation of Data Collections

Science proceeds through the unification of experimental data into a coherent whole. In the same way, collaboration on large scientific databases should emphasize the federation of multiple databases. We should encourage projects with the following ingredients:

Multiple, independent databases, preferably curated by different groups of people, preferably in different countries, preferably culturally separated,
A really good scientific justification for federating the databases,
An architecture that encourages access by a set of people that is not just the architects of the system.

Authentication

In academia, authentication is treated as a sticky, difficult problem that is often left to the end of the project and then ignored; and yet a successful authentication framework is one that is built into the system from the beginning. Without authentication, we cannot rise above "toy data", we cannot ingest, process, or deliver the data that real scientists are interested in, because this is generally not public. Clearly, in this age of hackers, security through obscurity is insufficient.

We should encourage projects that demonstrate easy to use, yet strong, authentication schemes, including ways to issue usage-permission to valid users, ways to log usage, ways to provide different levels of authentication. A most important facet of this problem is authentication in distributed systems, so that a user only needs to log-in once, yet multiple, heterogeneous services can be instantiated as a result. In other words, we need models by which one private service, accessed by a user of given authentication level, can access another private service at the same level.

Standard Objects and Services for Science

In the past there has been much work on building standard file formats for representing scientific data, for example HDF, netCDF, as well as proprietary formats such as Matlab, Excel, IDL and others. We can think of these as serializations of data objects. There are also new ways to encapsulate such files and to extend them with the addition of   metadata, using MIME and XML technologies. At the same time, distributed object systems such as CORBA, Java RMI, and Voyager are allowing machines to exchange objects directly between trusted systems. We should encourage projects which define these objects for particular disciplines, and which provide the software to create, transform, and combine them.

Such hierarchical collections of objects include: arrays, parameters, relational tables, human-readable documents, code fragments and agents, authentication certificates, query objects. Once an object and its meaning is defined, it is always important to

be able to refine/extend the object to something more specific and precise,
to be able to express that the object is a member of a more general class,
to know when the object is syntactically and semantically valid, and when the validation takes place,
to know how exceptions and diagnostics get back to the human client

In a distributed system, emphasis shifts from objects to services. A request object is sent to the service, and a response object is returned. Potential users of the service need to know that it exists, presumably through a discovery service, they need to know how to use it -- how to construct a request and what kinds of response objects are available. Services should be designed for use by either a human or a machine, meaning that the response may be cast as a structured document that the (machine) client can interpret.

E-commerce tools

In the academic community, we must not insulate ourselves from the enormous Internet industry, and what it can offer us. Business software, at best, is cheap (compared to a graduate student, or a supercomputer), well-documented, and robust. Unlike home-made software, new versions, with new features, appear regularly with no personal coding effort. All we need to do, for long-term projects, is to insulate ourselves from reliance on a single vendor by using open interfaces.

We should consider supporting academic projects that are closely partnered with industry, especially when they are on opposite sides of the Atlantic. The support is definitely not a subsidy to the business plan of the industrial partner, but rather should fund the insulation of the academic enterprise from the industrial partner! Specifically, we should fund the development of an open interface and the corresponding broker software, by which the two can effectively collaborate, and so that others can also join the enterprise.

Multilingual Interfaces

Obviously there are many languages in Europe, but this is also true in the US. We should support projects that allow multilingual interfaces, perhaps through translation, or even by simple mechanisms such as different words written on the GUI components. XML is a technology designed for flexible presentation of structured data: we can use this flexibility to provide a language-specific interface. We could also consider projects that can utilize automatic, perhaps private, translation services, thus allowing outsourcing of the translation. We could encourage projects that mark up text for translation or that define open interfaces for the exchange of the knowledge bases and ontologies that are used in machine translation.

Scalability

Computing infrastructure is like a food pyramid. PCs and workstations with business software are the base layer, like rice and pasta; installed, specialized software brings us to the next level (fruits and vegetables); remote machines and servers are at the third level (meat, eggs, milk); and at the tip are supercomputers and tape robots (chocolate). We should be interested in projects that address the concerns of all levels of the infrastructure, so that a user can learn at the lowest level, then move up if necessary.

Computation ranges in complexity from Java applets and Excel spreadsheets, to scientific software installed or compiled on a workstation, to simple services on remote machines, to broker services that schedule or farm out compute tasks, to advanced architecture supercomputers, finally adding real-time steering and diagnostics.
Communication begins with text-based files that can be printed out, to larger binary files, to services that dynamically create data objects on request, to cache-maintenance and parallel, high-performance data links.
Visualization begins with a table of numbers, then moves to gnuplot-style line graphs, to images, to 3D representations, and finally to immersive environments with funny glasses.

In each case, we must be careful not to address only the high-end, the chocolate of the food pyramid, but instead there must be emphasis on balance.

References

Extensible Scientific Interchange Language (XSIL): http://www.cacr.caltech.edu/XSIL

Interfaces to Scientific Data Archives, an NSF Workshop: http://www.cacr.caltech.edu/isda