Data di Pubblicazione:
2016
Citazione:
BLAST: a Loosely Schema-aware Meta-blocking Approach for Entity Resolution / Simonini, G., Bergamaschi, S., Jagadish, H.V.. - In: PROCEEDINGS OF THE VLDB ENDOWMENT. - ISSN 2150-8097. - STAMPA. - 9:12(2016), pp. 1173-1184. (42nd International Conference on Very Large Data Bases, VLDB 2016 2016) [10.14778/2994509.2994533].
Abstract:
Identifying records that refer to the same entity is a fundamental step for data integration. Since it is prohibitively expensive to compare every pair of records, blocking techniques are typically employed to reduce the complexity of this task. These techniques partition records into blocks and limit the comparison to records co-occurring in a block. Generally, to deal with highly heterogeneous and noisy data (e.g. semi-structured data of the Web), these techniques rely on redundancy to reduce the chance of missing matches.
Meta-blocking is the task of restructuring blocks generated by redundancy-based blocking techniques, removing superfluous comparisons. Existing meta-blocking approaches rely exclusively on schema-agnostic features.
In this paper, we demonstrate how “loose” schema information (i.e., statistics collected directly from the data) can be exploited to enhance the quality of the blocks in a holistic loosely schema-aware (meta-)blocking approach that can be used to speed up your favorite Entity Resolution algorithm. We call it Blast (Blocking with Loosely-Aware Schema Techniques). We show how Blast can automatically extract this loose information by adopting a LSH-based step for e ciently scaling to large datasets. We experimentally demonstrate, on real-world datasets, how Blast outperforms the state-of-the-art unsupervised meta-blocking approaches, and, in many cases, also the supervised one.
Meta-blocking is the task of restructuring blocks generated by redundancy-based blocking techniques, removing superfluous comparisons. Existing meta-blocking approaches rely exclusively on schema-agnostic features.
In this paper, we demonstrate how “loose” schema information (i.e., statistics collected directly from the data) can be exploited to enhance the quality of the blocks in a holistic loosely schema-aware (meta-)blocking approach that can be used to speed up your favorite Entity Resolution algorithm. We call it Blast (Blocking with Loosely-Aware Schema Techniques). We show how Blast can automatically extract this loose information by adopting a LSH-based step for e ciently scaling to large datasets. We experimentally demonstrate, on real-world datasets, how Blast outperforms the state-of-the-art unsupervised meta-blocking approaches, and, in many cases, also the supervised one.
Tipologia CRIS:
Articolo su rivista
Keywords:
Entity Resolution; Meta-blocking; Big Data; Data Cleaning
Elenco autori:
Simonini, Giovanni; Bergamaschi, Sonia; Jagadish, H. V.
Link alla scheda completa:
Pubblicato in: