Schema Matching

schema mapping fig 1

Data Integration

The data integration process takes two datasets and combines them into a unified data-set by performing five composable tasks.
Schema matching (1) aligns the schemas of the two datasets.
Schema mapping (2) performs any transformations required by the different semantic of the aligned fields.
Entity resolution (3) identifies duplicate records.
Entity consolidation (4) merges them.
Data cleansing (5) can be applied at any point to detect and correct errors.

Schema Matching System

Schema matching fig 2

A schema matching process receives two or more datasets and outputs a set of correspondences between the datasets ’schemas’ attributes. Schema matching and the related field of ontology alignment have been studied extensively with research on building matching systems and on adapting and combining different matching methods to the task at hand.

First line matchers, also known as matching algorithms, similarity measures, and base learners, utilize information contained in the schemas being matched or in the associated data instances, if these are available, to propose correspondences between schema attributes.
Second line matchers are thus named as they operate upon the result of one or more first line matchers, namely a set of similarity matrices, and perform functions such as filtering, selection and aggregation of results.

Entity Resolution

Entity resolution is the process of identifying different data instances that are related to the same entity. Entity resolution can occur at different levels of granularity and for different entities appearing in the dataset. For large datasets, the entity resolution task may be daunting, requiring n2 comparisons where n is the number of records over all datasets. Thus, common approaches perform a process of blocking, where records are grouped by (one or more) shared properties. Entity resolution is an obvious use of AI-supported DI tools.

Ontologies and their Use in Data Integration

Ontologies

Ontologies provide a conceptualization of the domain (or domains) described by the knowledge graph, adding entailment mechanisms such as the ability to group entities into a class, create same-as links between entities, equivalence relationships between classes, and denote predicates as sub-properties.

All ontologies use some form of vocabularies in order to express terms and specify their meanings. Similarly to taxonomies, they adopt a classification structure. However, ontologies add properties for each class and a set of axioms and rules that allow reasoning and full domain conceptualization.

Ontology-Based Data Integration and Access

Taking advantage of the AI knowledge representation and inference mechanisms, Ontology-Based Data Integration (OBDI) uses ontologies to consolidate several heterogeneous sources into one source.

In many cases existing data sources are non-ontology, rendering OBDI impossible. Ontology-Based Data Access (OBDA) is an alternative model that provides access to the data layer through a declarative mapping between autonomous data layers and domain-specified ontology. A typical development process of an OBDA system for a project that has a SQL database will contain the following steps.

(a) Create an ontology of domain-specific user knowledge.

(b) Write mapping that connects (usually through SQL queries) the ontology to the project’s database.

(c) Write a query using ontology’s vocabulary as a semantic query language query, such as SPARQL.

(d) Build an OBDA system framework that automatically rewrites the SPARQL query to a SQL query over the project’s database.

When searching for relevant research, users use search tools provided by the data sources. These can be classified into one of three types of interfaces. Key word queries comprise a sequence of terms of which at least one should be present in the dataset for it to be returned in the results. OBDA allows the use of Ontological queries that rely on well-defined ontological terms such as organism species or molecular compounds, which the user specifies together with logical constraints and entailment allowances to form a logical statement. Each candidate result must satisfy the logical statement to be returned.