|
|
Project DataSpace : US/European Collaboration on Distributed Mining of
Large Data Sets on the Next Generation Internet Traditional computational techniques and computer architectures are now routinely overwhelmed by the sheer volume and complexity of information generated from data-gathering instruments, computational and experimental methodologies and business operations. The fundamental problem of extracting knowledge and insight from massive databases and data sets is shared across a wide range of fields in business, academia and government. The new field of Data Mining and Knowledge Discovery in Databases (KDD) has arisen as an interdisciplinary response to this situation, merging ideas and techniques drawn from disciplines such as statistics, pattern recognition, machine learning, databases, visualisation and high performance computing. One major challenge for the data mining community is to provide an environment for mining large-scale distributed data. Today's data mining tools are able to deal with moderate amounts of data, in the range of several million data items. Furthermore, they tend to be offered as stand-alone or front-end applications. This raises two issues: how to increase the computational capacity of the data mining systems, and how to effectively deliver the data mining solutions and integrate them into the business or scientific process. Moreover, large-scale data sets are almost always logically and physically distributed, and organisations that are geographically distributed need a de-centralised approach to decision support. Therefore the issues concerning enterprise data mining are not just the size of the data to be mined but also its distributed nature. This is particularly important for scientific data analysis when data may be scattered around the whole World. The project DataSpace is a global collaboration under the support of the Terabyte Challenge Project funded by NFS. The aim of the project is to develop an infrastructure for global distributed data mining. The data mining group at Imperial College is a key partner in the project. The research in the project includes: 1. Protocols : investigating network protocols and QoS requirements for moving data over WANs with differing communication parameters. 2. Servers: investigating the architecture for scalable data servers in the NGI environment 3. Languages: designing XML-based languages for supporting a collaborative data mining process across the NGI environment. These languages include model representation languages, languages for expressing sufficient statistics and meta data et. al. 4. Distributed Data Mining Algorithms: investigating new data mining algorithms that allow distributed data to be mined in place. The results of local mining can be then combined to form the final result. Information about Project DataSpace: Project Director : Prof. Robert Grossman : grossman@lac.uic.edu
Data Mining Research Group at Imperial College |