|
|
One advantage of writing one's position paper rather late is that many people already made theirs available to which one can refer. The three main challenges I see in the near future are: 1) Large Datasets: The hundreds of petabytes of data that Bunn and Neumann talk about are far from easy to handle. Even if one would want to select a small random sample for some type of analysis current, commercial, DBMSs would take more time than the analyst can handle. Perhaps even more crucial, the database would need to employ data structures that do scale to such sizes 2) Federated data: as has been pointed out by others, not all scientific data sets are necessarily very large. But biologists routinely access many different databases to get the data they need; unfortunately there are no standards. 3) Data mining: mining possibly very large and distributed databases via the web is far from standard. Most tools rely on single tables and large in practice seems to mean in the gigabyte range; i.e., the databases often fit in main memory. This gives rise to an avalanche of research questions. Ranging from database primitives geared towards the support of scientific data mining, via scaling of algorithms (such as bump hunting) to these sizes, towards the design of specialized algorithms (e.g., what is the best way to cluster genetic data). The scientists that produce the data in the US and the EU collaborate. We as computer scientists who support their efforts should also collaborate. |