Q. What is DataSpace?
A. DataSpace is an infrastructure for creating a data web. Just as the web today enables easy access to remote multimedia documents, a data web enables easy access to remote and distributed data. DataSpace is based upon open standards, such as XML and web services.
Q. What does the infrastructure consist of?
A. The DataSpace infrastructure consists of:
- XML languages for metadata, such as the Predictive Model Markup Language (PMML).
- The dataspace transfer protocol (DSTP), a protocol for moving data over the web.
- PSockets and SABUL, protocols for connecting clients and servers over high performance wide area networks.
- Open source DSTP servers for making data easily available to visitors to the data web.
- Open source DSTP clients.
Q. What is a data web?
Roughly speaking, data webs provide access to remote and distributed data, just as the web today provides access to remote documents. Data grids loosely couple computational resources over high performance networks to create virtual supercomputers. The semantic web is designed to enable working with knowledge defined using W3C's RDF and related standards.
| View | Mine/Discover | Compute | |
|---|---|---|---|
| Knowledge | Digital Libraries | Knowledge Mining | Semantic Webs |
| Attributes/Columns | Basic Data Webs | Data Webs | Data Grids |
| Files | Persistent Archives | Distributed Data Mining | Grids |
Table 1. Data webs, data grids, and semantic webs can all be used to provide access to remote numerical data. Data webs provide direct access to distributed rows and columns of data. Data grids enable large scale resource sharing of computational and data resources. Semantic webs provide knowledge based access to data using ontologies, RDF and agent based architectures.
Q. What is the DataSpace Transfer Protocol (DSTP)? How are DSTP and SABUL related?
A. DSTP is a new protocol for moving data over the web, similar to HTTP. It's a protocol that runs over TCP for streaming data from source to clients. DSTP is specifically design for working with data: it knows about rows and columns, the metadata associated with data, sampling, etc. DSTP also runs over next generation protocols such as PSockets and SABUL which are designed for connecting clients and servers over high performance wide area networks.
Q. What are the advantages of using DSTP?
A. DSTP provides a simple way to publish data on the web and allow others to access it, analyze it, and mine it easily. Working with remote and distributed data will become much easier as DSTP or similar protocols become accepted, just as HTTP made working with remote documents easier. DSTP is unique in that it supports a simple way based upon universal correlation keys, which merge distributed data and overlaying remote data over local data
Q. How do data webs support data mining?
A. A DSTP client can easily access other peoples data and metadata. Once data is retrieved from one or more sites, data mining algorithms and exploratory data analysis can be done as usual. DataSpace is designed to interoperate with proprietary and open source data mining tools. In particular the open source statistical package R has been integrated into Version 1.1 of DataSpace and is currently being integrated into Version 2.0. DataSpace also works with predictive models in PMML, the XML markup language for statistical and data mining models.
Q. What is the Tera Wide Data Mining (TWDM) Testbed?
A. The TWDM is a testbed for data webs over optical networks. The TWDM is a lambda grid connecting TWDM clusters in StarLight, UIC, Amsterdam and Ottawa.
Q. What standards are being used?
A. Dataspace is built on open protocols and standards. The metadata is in XML. Data mining is done using the Data Mining Group's (DMG) Predictive Model Markup Language (PMML). DataSpace interoperates with W3C's XML, RDF, and ontologies, the building blocks for the semantic web. DataSpace will soon interoperate with W3C's SOAP. Shortly, we will start the process with IETF to standardize the data transport protocols DSTP and SABUL.
Q. How is DataSpace being commercialized?
The commercialization model is an open source one. The standards are all open. The DSTP clients and servers are open source. Project DataSpace is encouraging companies to use its open source clients and servers. A start up company is being planned that will provide support for projects using the DataSpace infrastructure.
Q. What scientific applications are running in DataSpace?
A. There are several data sets on DataSpace today, including earth science data from NCAR, protein data from the Protein Data Bank, and health care data from the WHO. DSTP Clients can view, retrieve, visualize and explore this data.
Q. What business applications are running in DataSpace?
A. DSTP Clients can be used to build virtual data warehouses. Virtual data warehouses leave the data in place and use high speed networks and the DSTP protocols to create data warehouses on the fly, view by view. In addition, SABUL and PSockets are currently being tested to provide business continuity and diaster recovery services for data centers using high performance SONET and optical networks. This allows business to replicate crucial data in real time in distant locations and switch over to alternate sites within seconds.
Q. Who is supporting the project?
A. The project is supported by the National Science Foundation.
Q. Who is the Project Director?
A. Robert Grossman is the Project Director. He holds two positions. He is a part time faculty member at the University of Illinois at Chicago and the Director of the Laboratory for Advanced Computing at UIC. He is also the President of the Open Data Partners.
Q. How do I find out more?
A. The project web site is www.datspaceweb.net You can also contact the Project Director at grossman@uic.edu.