Data Integration
The Biomedical Informatics Research Network employs a 'mediator architecture' to link multiple databases together, each with a unique schema, into an accessible format or data federation. The mediator architecture is flexible, scalable, and powerful. It allows individual researchers to manage their own data in databases tailored to meet their specific needs. Those searching the BIRN system, however, will see the collection of federated databases as if it were a single database.
When a researcher queries the BIRN database, the query will be issued against the "mediator", a virtual database that combines the individual data sources in meaningful ways. This combination is achieved using "Integrated View Definitions" which describe how the mediator represents the source databases. When a user submits a database query, it is routed to the mediator (or more precisely, to one of the integrated views the mediator can expose). The mediator will then parse the user's query and subsequently submit one or more of its own queries to the relevant data sources. Use cases from the BIRN test beds are available:
Data Integration FAQ
The following information details how the data mediation infrastructure provided by the BIRN Coordinating Center can make distributed and heterogeneous databases accessible:
What is knowledge-based mediation?
“Knowledge-based mediation” refers to the mediator’s ability to augment data collections by encapsulating domain knowledge that is not captured in the source databases. The databases being developed by BIRN partners contain diverse types of data, which cross multiple levels of resolution, different imaging modalities, different experimental paradigms, and different species. In order to issue queries against multiple data sources, the mediator has to have information about how elements in each database are related to one another and to other databases. Some of these relationships may be simple. For example, each database may have a table called “subject” with a field called “species.” Once equivalence is established in the mediator between such comparable data, this data across all sources is viewed as if it were in a single database.
The problem becomes more complex when considering the diverse types of data in the BIRN. A source like the CCDB may have data on protein localization in Purkinje neurons while a source like UCLA may have data on proteins found in cerebellum. A neuroscientist knows that proteins found in Purkinje neurons are found in cerebellum, because he has additional knowledge that places Purkinje neurons in the cerebellum. By making this domain knowledge explicit, the mediator can use these relationships to enhance queries and link data across sources. In order to accomplish this two things are required: (i) neuroscience domain expertise beyond what a naive user or “normal db admin” knows, and (ii) a computational representation of that domain expertise. In the mediator system being developed for the BIRN, additional knowledge in the form of ontologies, rules, constraints and abstractions can be incorporated into the system to provide a method of bridging sources. The knowledge sources being employed by the BIRN include ontologies like Neuronames and UMLS and spatial references in the form of brain atlases. The ability to link data from BIRN data sources to these external knowledge sources relieves the database creators from having to catalog data under exhaustive lists of keywords. For example, a Purkinje neuron in the CCDB may contain the following anatomical characterization: Purkinje neuron, vermis, cerebellum. However, via the mediator, queries can be issued for proteins found in “cerebellar cortex”, even though that term never appears in the CCDB, by using inferences derived from UMLS.
Even more importantly, it makes the mediator systems much more powerful by allowing “mediation engineers” to define virtual databases using formal representations of this knowledge.
What are Integrated View Definitions?
In relational databases, views are used to customize the way data is presented to the user. Views can establish additional levels of security, derive information from the data in tables, and simplify complex queries. They are often tailored for a particular type of user's perspective. For example, a database may contain confidential patient information. The patient’s doctor may have a view of the database that shows all patient data. Another researcher at the same institution may have a restricted view that does not reveal the confidential information.
“Integrated View Definitions" (IVDs) are views written across tables and data contained in multiple data sources. For the BIRN, IVDs can include any data source registered with the mediator. By using IVDs, each user will be able to create a “virtual database” from a variety of distributed data sources. Applications may also take advantage of IVDs. For example, a “Smart Atlas” for the mouse brain will issue specialized queries to the mediator. An IVD for the “Smart Atlas” may encapsulate only the information that is relevant to this application.
In the early phases, IVDs will need to be written by someone who is knowledgeable in database queries. The BIRN-CC will provide some simple IVDs for general types of queries. The BIRN-CC will also define a set of procedures for advanced database users to create their own. As the project progresses the level of expertise required will diminish, and novice database users should be able to create their own IVDs.
How does the mediator talk to a source database?
Each database contained in the BIRN will have a wrapper layer that specifies how the mediator can talk to the database and what the database exports. A list of these capabilities for each data source will be maintained in the BIRN Federation Registry. To simplify the task of building the mediator for the BIRN, participating sites will be limited initially to the use of Oracle databases. Eventually, the mediator will be able to talk to other database systems, web sites, and flat files.
What should be included in a source database?
The mediator architecture was chosen for the BIRN because it allows maximum flexibility for each site to tailor their database to suit their needs. This ability will become increasingly important as new sites are added to the BIRN network. The knowledge- based system should be able to handle integration of new databases, even if they are quite different, as long as they contain conceptual links to existing databases or knowledge bases and adhere to the following requirements:
- Each data set needs to be stored with a complete description that includes information about the project, experiment, subject, and imaging technique. These descriptions should be complete enough for a user to interpret an image and know how it was obtained. BIRN project groups will likely define "BIRN Document Type Definitions (DTDs)", or XML schemas, for this information.
- Each dataset should be linked to one or more ontologies. The first ontology to be incorporated into the mediator will be UMLS, with the Neuronames extension linking anatomical concepts in the database to the relevant ontology IDs in UMLS. Additional knowledge bases will be added as the project progresses, and participating groups will be able to define their own ontologies and incorporate them into the BIRN. Plans for future development include the incorporation of a neurohomology knowledge source, implementation of taxonomies for behavioral tests and experimental protocols, and inclusion of a process map for disease pathology.
Each concept may have one ontology ID, or several if necessary. The use of an ontology ID will address the problem of standardization across databases and make the search less dependent on the exact terms entered. It will also allow for searching across related, but dissimilar, information. For example, querying multiple databases for all human subjects who are right handed is complicated when there are multiple measures used to determine handedness (e.g. all people who score > 3 out of a range of -10 to 10 on the Edinburgh Inventory are right handed). If the mediator knows that the Edinburgh handedness Inventory, the Arnett Handedness Inventory, and a simple subject response (i.e. I am right handed) are comparable, then the mediator can find right handed subjects regardless of the exact assessment measure. - Each dataset should reference a common spatial framework, such as a standard atlas coordinate system. The mediator will be able to handle spatial queries on two dimensional and three dimensional brain maps, but these queries will only be meaningful if the data sets can be mapped to a common coordinate system. The Smart Atlas tool, currently being developed by the BIRN-CC Data Mediation Team, can handle spatial registration for brain subregions to vector-based brain atlases, like the Franklin and Paxinos mouse brain atlas. It will also allow researchers to assign an ontology ID to the database record. Click here to view a presentation about the Smart Atlas.
- Databases should not be restricted to purely descriptive information about the image. Instead, researchers should be able to place derived data, such as measurements, surface reconstruction and vector maps, directly in the database. The richer the model of the data in the database, the richer the queries that can be asked of the data. The BIRN-CC Data Mediation Team is developing database models and query capabilities for these types of derived data. One example is a database cartridge for surface geometries. By storing surface coordinates directly in the database, researchers will be able to issue queries which compute directly on these polygonal meshes and perform morphometrics like shape comparison.
It is important to remember that the mediator architecture was selected with an eye towards scalability. A simple description of the data may suffice for a few data sets, but as the number of data sets in the BIRN grows, simple descriptions will become inadequate. Researchers will want to select data based on more than simple attributes like species. A richer data model will also allow data mining, where researchers can search for relationships among data sets.
How does a source database register with the mediator?
The BIRN will maintain a "Federation Registry" which contains a list of sources registered with the mediator. These sources will include both knowledge bases and databases present at the various sites. In some cases the registration process may be automated, in others it may require interaction between the BIRN-CC and the participating site. Automatic registration should be possible when the wrapper layer is simple, requiring only translation of the relational schema used by the source database into F-logic, the language used by the mediator. Interactive registration will be necessary when decisions must be made about which data will be available to the mediator. These decisions will affect the design of the view to be exported and will require a complex wrapper layer. For example, a site may need to restrict access to some data on the basis of privacy concerns. A site may also want to include additional information in the wrapper layer to improve the efficiency of query processing. If the wrapper layer is complex, registration may require input from a database administrator at the site, a mediation engineer from the BIRN-CC, and a biologist who understands the data. A complex wrapper layer may contain:
- Restructured tables: When creating a database, it is sometimes necessary to create several tables to handle related pieces of information (normalization). At the mediator level, the way a data source is normalized may not make sense. In these cases, it may be useful to restructure the source databases, reassembling the conceptual objects that often get lost as a result of relational normalization.
- Abstractions of the schema: A database may have a large number of tables, but these tables can often be collapsed into several logical classes which represent an abstraction of entire schema or an abstraction of the objects contained within the database. For example, in the CCDB there are currently over 40 tables, but these tables can be organized into a few different kinds of information (e.g., experiment, subject, microscopy, tissue processing, reconstruction, and segmented objects). In this case, it may make more sense to present an abstraction of the schema to the mediator, rather than the entire series of tables. Abstraction is similar to the process of restructuring described above, but it allows for a more object-oriented view of the database schema, one in which only the key information is incorporated into the mediator.
- Constraints and rules: A participating site may choose to include various rules or constraints in the wrapper layer, such as integrity constraints, semantic constraints, or value constraints. For example, in the CCDB all reconstructions must have a reconstruction ID and a microscopy product ID (integrity constraints). Furthermore, the CCDB only contains data derived from light and electron microscopy (semantic constraints). It does not contain data on embryonic animals, i.e., age is greater than 0 days (value constraints). The same kinds of constraints may be present in databases containing human imaging data. All clinical assessment measures may be collected on a specific date for a particular subject (integrity constraint). Each MR scan may be associated with particular data and a particular protocol (semantic constraint). For functional imaging studies, the repetition time (TR) for the functional sequence may always be greater than 0 and less than 10 seconds (value constraint). These constraints, while not strictly necessary, will help restrict the number of data sources that have to be searched to answer a query.

