Site Navigation


Overview of the Mobius Project Architecture
      [GGF11 Proceedings contain a more detailed overview (pdf)]

Motivations   [top]

The smooth interchange of data across and between scientific and research institutions can accelerate and improve the quality of findings and help to realize the ideals of institutional research by enhancing collaboration. More than ever before, research is being enabled by technologies that couple resources from diverse data archives spread across disparate institutions. Web service architectures and the "Grid" have emerged to help fulfill this ideal of drawing together data resources. But, while it is true that the Grid and Web services have greatly increased the sharing of information and computer power, it is often the case that such sharing is hampered by the sheer volume of available data and that fact that the data that researchers may want to compare or analyze in concert are stored using heterogeneous data types and are housed in widely variable ways.

A biomedical researcher, for example, may develop a hypothesis and accumulate serveral types of patient and laboratory data. In the course of her work, she will need to create databases to maintain this data. She may also need to analyze data stored earlier, in various manners and in multiple archives, or need to mine research accumulated by other researchers in her field or related fields. Her research may involve the integration of proteomic, molecular, genomic, image, or a wide variety of other sorts of data in a multi-institutional context.

The researcher ought be able to use a well-designed and straightforward system to create a data warehouse spanning multiple, distributed archives. She should also be able to use this data warehouse to run queries to find new connections and test hypotheses. Mobius facilitates this. It also addresses a number of other common contingencies: Any two data archives may define related data differently. The data may be semantically the same but have a widely divergent structure. This is a serious impediment to work. Likewise, the researcher's analysis of data may lead to collections of new datasets as well as new arrangements and types of data. How are these new data stored so that they can be easily used again, possibly by other researchers?

High-Level Project Overview   [top]

Mobius is a middleware framework designed for efficient management of data and metadata in such dynamic, heterogeneous, and distributed research environments. It provides a set of services and protocols for the distributed creation, versioning, and management of data instances and data descriptions, for the on demand creation of databases, for the federation of existing databases, and for the querying of data in a distributed environment. Its design is motivated by the data sharing needs of biomedical research and by the methods, tools, and potentials of the Grid (in particular the activities of the Data Access and Integration Services (DAIS) group at the Global Grid Forum (GGF)). Mobius uses XML schemas to create and represent data definitions as metadata (data models) and uses XML documents to represent and exchange data instances. It supports the on-demand creation, versioning, and management of distributed databases, the federation of existing databases, querying of databases in a distributed environment, and translation from one data model to another.

Mobius consists of three core services:

  1. GME (Global Model Exchange), a DNS-like, global, data definintion registry and exchange service
  2. Mako, a data instance management service to create new databases, integrate existing databases, validate models against the global model, and run federated queries
  3. DTS, a data translation service to translate between existing data models that have similar semantic content but variable structures

Hypothetical Use Case   [top]

Using Mobius, a biomedical researcher first designs XML schemas that describe the data types she wants to maintain. She can search the GME for existing schemas that meet her needs, version an existing schema by adding parts to it or deleting from it, create an entirely new definition, or assemble one from parts by referencing existing schemas. She then registers the schema with the GME to be shared or discovered by other researchers. Mako services will use the schema to generate databases and validate data against her data model.

The researcher instantiates one or more Mako servers to maintain databases that conform to the schema or registers the schema on an already running Mako server. The servers create databases in an ad hoc fashion according to the schemas and allow new data to be entered, queried, and maintained. When a new data set is submitted to a Mako server (as an XML document), the server "ingests" the document and indexes and stores in the databases the data the document contains. Datasets can be distributed across a collection of Mako servers and virtually represented as a federated whole (VMako) as needed for a query or data retrieval. At the level of implementation, queries against Mako databases are performed using the XML XPath definition, but the user does not need to know XPath to perform queries.

Finally, if need be, the researcher can use the DTS to help migrate data between already existing variably structured schemas that have have similar semantic content.

GME (Global Model Exchange)   [top]

In order for services on the Grid to communicate with each other, their data must be described in a format that is understood by all necessary components involved. Thus a data management system for the Grid must provide a method for defining metadata and data via a universal consistent modeling pattern, distinct instances of which we call data models. The "data model" is the specification of the structure, format, syntax, and occurances of the data instances it represents.

The models must be globally available to every (authorized/authenticated) service to enable the system to work. To get around the problem that data models from different areas and institutions, though they may define a similar entity (a Patient entity, for example), will not necessarily be equivalent, entities within models are assigned to a namespace that effectively makes a Patient entity from one institution distinct from a Patient entity at another one. Another issue is persistence and availability of models. Data and the data models registered with the GME ought to persist and be attainable from any node within the GME.

To these ends, the GME is responsible for storing and linking data models as defined within namespaces in a distributed environment. It enables other services to publish, retrieve, discover, remove, and version metadata definitions (data models). GME services are composed in a DNS-like architecture, in which parent-child namespaces organize the connection of nodes into a hierarchical tree structure (see Fig 1, below). To provide versioning and the potential for exchange between model versions, the GME provides model version and model to model dependency management. Modified and republished models will be versioned automatically, allowing for multiple model versions to be used concurrently. Furthermore, if a suitable mapping between models can be established, the models can be interchanged seamlessly.

the DNS-like structure of the GME component
Fig 1: Clients publish and receive data models with the GME
(here refered to as GSE, Global Schema Exchange)

Future extensions to GME may not only store the models but also adopt a semantic model definition language, such as RDF, and provide higher level querying for those models. One could imagine being able to pose the question, "Are there any models published anywhere in the Grid that have something to do with cancer research." With queriable semantic data models, grid-wide, intelligent, accurate, and precise responses would return from such a question.

Mako (data instance management)   [top]

Mako provides services for the Grid for storing and querying data and metadata. It allows for data to be stored across heterogeneous and loosely-coupled machines. It also allows trusted users the ability to update, query, and delete the data they store. Databases can be instantiated in an ad hoc fashion and can be of virtually any size and scale, growing dynamically according to need.

Mako exposes data resources as XML services through a set of well-defined interfaces based on the Mako protocol. Individual data resources can be of virtually any sort--a relational database, an XML database, a file system, binary file, proprietary machine interface, etc. The data resources referenced in a Mako are exposed using XPath. This provides a standardized way of interacting with resources, greatly simplifying applications' ability to communicate with heterogeneous data stores. As mentioned earlier, the interfaces themselves are motivated by the work of the DAIS working group of the GGF.

The Mako architecture (see Fig. 2, below) contains a set of listeners that allow clients to communicate with a Mako instance using any communications protocol, such as TCP, SSL, or the Globus Security Infrastructure (GSI). Packets are then passed to a packet router, which determines if the packet has a handler in the Mako and, if so, shuttles the packet to the handler for processing and sends a response to the client.

connections and workflow of the Mako architecture
Fig 2: Architecture of communication with a Mako instance

Protocol services

Aspects of the Mako protocol are responsible for:

  • providing "service data" on when the Mako was started, its underlying data resources, and a list of the request types it handles
  • creating, removing, and listing collections (roughly equivalent to "tables" in a relational database)
  • listing, removing, and requesting schemas
  • submitting, updating, retrieving, and removing XML documents
  • performing XPath queries against collections

Global Addressing and Virtual Inclusion

The federation of data across multiple Makos is facilitated by a global addressing scheme in which elements in a collection are uniquely identified and addressed using a simple three tuple id consisting of a Mako URI, the name of a collection, and an id for the element. One example of how data may be federated across multiple Makos is the concept of "virtual inclusion." A virtual inclusion is a reference within an XML document to another XML document. Virtual inclusion thereby allows for XML documents to be created that contain references to other existing documents or elements, both local and remote. Through virtual inclusion, an XML document can be distributed and stored across multiple Mako servers, subsections of the document being stored remotely and integrated through references. In specific, this makes it possible for very large data documents to be partitioned across multiple resource (in a cluster, for example), while still appearing as and having the semantics of a single document.

VMako

Mako's component architecture allows for alternate protocol handlers to be installed in a Mako server to enable that single server to act as a front-end to multiple remote Mako instances. The single Virtual Mako (VMako) instance (see Fig. 3, below) maps a number of virtual collections onto a user-defined set of remote Mako collections. This reduces the complexity for the client application, presenting a single virtualized interface to a number of distinct, federated Makos. Among other applications, the VMako could be used to decluster large data. By utilizing virtual inclusion, a submitted XML document could be broken down into separate sub-documents and distributed across remote Makos. But the primary purpose for a Virtual Mako is to enable distributed query execution. In a Virtual Mako, requests are broken down into sub-queries and sent to appropriate remote Makos. Responses are then aggregated at the VMako and returned to the client.

virtualization of multiple Makos into a Virtual Mako instance
Fig 3: A VMako interface to multiple remote Mako collections

DTS (Data Translation Service)   [top]

DTS allows physically separate institutions using semantically similar data to translate between the varying structures of their data, thereby making it possible for the institutions to leverage each other's data resources. DTS maintains a registry of remote mapping services, which provide pairwise translation between elements from different namespaces (Fig 4, below).

DTS is motivated by two needs. First is the need to maintain the link between stored data that is typed against a particular version of a schema and that same data as the schema evolves into new versions. Second is the need to translate data types that are semantically similar but syntactically different. This ability to translate across common data types becomes more and more pressing as the number of distributed data sources and data types grows. DTS services will also provide mappings of ideosyncratic individual researcher or institutional data types to and from community-wide established standard data definitions. It is essential that Mobius "play well" within the Grid and research communities.

client discovering and using the DTS
Fig 4: Client finds a mapping service in the DTS registry
and performs translation between data types

The basic Data Translation Service Registry (DTSR) will contain information that describes the individual DTSs that are running on the Grid. Users can build and register DTS services as needed for their own use and to be used by other researchers. Ultimately, DTS will require an ontology registry to store semantic information and describe the relationships of registered namespace elements. It is expected that for many data pairs there will be numerous possible translations. This will require that there be means for users to specify their preferences for meaningful type translations.