Skip to Main Content (Press Enter)

Logo UNIMORE
  • ×
  • Home
  • Corsi
  • Insegnamenti
  • Professioni
  • Persone
  • Pubblicazioni
  • Strutture
  • Terza Missione
  • Attività
  • Competenze

UNI-FIND
Logo UNIMORE

|

UNI-FIND

unimore.it
  • ×
  • Home
  • Corsi
  • Insegnamenti
  • Professioni
  • Persone
  • Pubblicazioni
  • Strutture
  • Terza Missione
  • Attività
  • Competenze
  1. Pubblicazioni

SparkER: an Entity Resolution framework for Apache Spark

Software
Data di Pubblicazione:
2017
Citazione:
SparkER: an Entity Resolution framework for Apache Spark / Gagliardelli, Luca; Simonini, Giovanni; Zhu, Song; Bergamaschi, Sonia. - (2017).
Abstract:
Entity Resolution is a crucial task for many applications, but its nave solution has a low efficiency due to its quadratic complexity. Usually, to reduce this complexity, blocking is employed to cluster similar entities in order to reduce the global number of comparisons. Meta-Blocking (MB) approach aims to restructure the block collection in order to reduce the number of comparisons, obtaining better results in term of execution time. However, these techniques alone are not sufficient to work in the context of Big Data, where typically the records to be compared are in the order of hundreds of million. Parallel implementations of MB have been proposed in the literature, but all of them are built on Hadoop MapReduce, which is known to have a low efficiency on modern cluster architecture. We implement a Meta-Blocking technique for Apache Spark. Unlike Hadoop, Apache Spark uses a different paradigm to manage the tasks: it does not need to save the partial results on disk, keeping them in memory, which guarantees a shorter execution time. We reimplemented the state-of-the-art MB techniques, creating a new algorithm in order to exploit the Spark architecture. We tested our algorithm over several established datasets, showing that ours Spark implementation outperforms other existing ones based on Hadoop.
Tipologia CRIS:
Software
Keywords:
Entity resolution; Apache Spark; Record linkage; Meta-Blocking
Elenco autori:
Gagliardelli, Luca; Simonini, Giovanni; Zhu, Song; Bergamaschi, Sonia
Autori di Ateneo:
BERGAMASCHI Sonia
GAGLIARDELLI LUCA
SIMONINI GIOVANNI
Link alla scheda completa:
https://iris.unimore.it/handle/11380/1145776
  • Dati Generali

Dati Generali

URL

https://github.com/Gaglia88/sparker
  • Utilizzo dei cookie

Realizzato con VIVO | Designed by Cineca | 26.5.0.0