Merge

In the Merge phase, candidate datasets are harmonized semantically, computationally, and geographically to form one large and coherent data-set. The merge phase covers the following steps: Matching, Mapping, Fusing. Once a collection of datasets has been assembled, the merge phase can commence. To facilitate this process, one must create a mediated schema to which all other datasets are matched. The ODINI project is creating a unique Data Integration Ontology to facilitate automated mapping of the datasets to the mediated schema. We are also developing a data integration software tool to enable oceanographic researchers to use this ontology to integrate large numbers of datasets. Stay tuned for updates.

Schema Matching

In the match step, researchers align the different attributes/parameters in the dataset’s schema with the mediated schema/ontology. To do so, the researcher must often consult the data descriptions of each parameter, which are either listed with the data-set in the source repository or described as part of the methods section of the accompanying paper.

Schema Mapping

In some cases, the semantics of the data in one source are slightly different from that of the mediated schema/ontology. In such cases a mapping phase where conversion functions are generated to facilitate data integration according to correspondences found in the matching step. Even more mundane, but crucial is the need to map from the source format to that of the central repository used to collect the data from the different datasets.

Data Fusion

In this step, researchers need to mitigate problems that emanate from differences in spatiotemporal resolution between the datasets. Thus, one data-set may include measurements of a 50-m depth in increments of 1 m, while another in increments of 10 cm. Decisions must be made on whether to aggregate upwards to lower resolutions, omit incompatible resolutions or interpolate the data to align the resolutions, or fill out missing data in some areas.

Resources

Oceanographic Word Embedding

Word embeddings are used by natural language processing (NLP) algorithms to attach meaning-based mathematical distance to tokens of a domain. General-purpose word embedding is trained upon large amounts of text scraped from the world-wide-web or newspaper archives. Here, we present the first word embedding trained upon a large collection of oceanographic research papers, thus capturing the meaning of oceanographic terminology. Find the corpus we trained upon and the training code at the Oceanic Data Description Extraction Project.

Data Description Extraction (DDE)

Extracting data descriptions from text is a foundation technology enabling data integration tools to utilize free-form text accompanying datasets. A first benchmark task over real oceanographic datasets and their accompanying text and our latest code tackling it are available at the Oceanic Data Description Extraction Project

DDE Schema Matching System

A modified version of the Ontobuilder Research Environment (ORE) allows one to integrate data description entities extracted from the text accompanying a dataset and utilize it when matching a dataset’s schema to a domain ontology.

Ocean Data Integration Application

The application is in private alpha mode at tools.odini.net and allows researchers to choose a set of concepts from our ontology and integrate a set of datasets that contain some or all of these concepts such that they can be analyzed as one large dataset.

Ontology Evaluation Tool

Our Ontology evaluation tool can evaluate any ontology with respect to the oceanographic domain. Read more about it in our latest published paper.