BigDedup: a Big Data Integration toolkit for Duplicate Detection in Industrial Scenarios
Contributo in Atti di convegno
Data di Pubblicazione:
2018
Citazione:
BigDedup: a Big Data Integration toolkit for Duplicate Detection in Industrial Scenarios / Gagliardelli, Luca; Zhu, Song; Simonini, Giovanni; Bergamaschi, Sonia. - 7:(2018), pp. 1015-1023. ( 25th International Conference on Transdisciplinary Engineering (TE2018) Modena July 3-6, 2018) [10.3233/978-1-61499-898-3-1015].
Abstract:
Duplicate detection aims to identify different records in data sources that refers to the same real-world entity. It is a fundamental task for: item catalogs fusion, customer databases integration, fraud detection, and more. In this work we present BigDedup, a toolkit able to detect duplicate records on Big Data sources in an efficient manner. BigDedup makes available the state-of-the-art duplicate detection techniques on Apache Spark, a modern framework for distributed computing in Big Data scenarios. It can be used in two different ways: (i) through a simple graphic interface that permit the user to process structured and unstructured data in a fast and effective way; (ii) as a library that provides different components that can be easily extended and customized. In the paper we show how to use BigDedup and its usefulness through some industrial examples.
Tipologia CRIS:
Relazione in Atti di Convegno
Keywords:
Duplicate detection, Entity Resolution, Data Integration, Record Linkage, Big Data
Elenco autori:
Gagliardelli, Luca; Zhu, Song; Simonini, Giovanni; Bergamaschi, Sonia
Link alla scheda completa:
Link al Full Text:
Titolo del libro:
Transdisciplinary Engineering Methods for Social Innovation of Industry 4.0
Pubblicato in: