Entries by danielli.Ital

Schema Matching

Data Integration

The data integration process takes two datasets and combines them into a unified data-set by performing five composable tasks.
Schema matching (1) aligns the schemas of the two datasets.
Schema mapping (2) performs any transformations required by the different semantic of the aligned fields.
Entity resolution (3) identifies duplicate records.
Entity consolidation (4) merges them.
Data cleansing (5) can be applied at any point to detect and correct errors.

Schema Matching System

A schema matching process receives two or more datasets and outputs a set of correspondences between the datasets ’schemas’ attributes. Schema matching and the related field of ontology alignment have been studied extensively with research on building matching systems and on adapting and combining different matching methods to the task at hand.

First line matchers, also known as matching algorithms, similarity measures, and base learners, utilize information contained in the schemas being matched or in the associated data instances, if these are available, to propose correspondences between schema attributes.
Second line matchers are thus named as they operate upon the result of one or more first line matchers, namely a set of similarity matrices, and perform functions such as filtering, selection and aggregation of results.

Entity Resolution

Entity resolution is the process of identifying different data instances that are related to the same entity. Entity resolution can occur at different levels of granularity and for different entities appearing in the dataset. For large datasets, the entity resolution task may be daunting, requiring n2 comparisons where n is the number of records over all datasets. Thus, common approaches perform a process of blocking, where records are grouped by (one or more) shared properties. Entity resolution is an obvious use of AI-supported DI tools.