Skip to Main Content (Press Enter)

Logo UNIMORE
  • ×
  • Home
  • Degree programmes
  • Modules
  • Jobs
  • People
  • Research Outputs
  • Academic units
  • Third Mission
  • Projects
  • Skills

UNI-FIND
Logo UNIMORE

|

UNI-FIND

unimore.it
  • ×
  • Home
  • Degree programmes
  • Modules
  • Jobs
  • People
  • Research Outputs
  • Academic units
  • Third Mission
  • Projects
  • Skills
  1. Research Outputs

Generating Synthetic Data with Large Language Models for Low-Resource Sentence Retrieval

Conference Paper
Publication Date:
2026
Short description:
Generating Synthetic Data with Large Language Models for Low-Resource Sentence Retrieval / Caffagni, Davide.; Cocchi, Federico; Mambelli, Anna; Tutrone, Fabio; Zanella, Marco; Cornia, Marcella.; Cucchiara, Rita. - 16097:(2026), pp. 36-52. ( 29th International Conference on Theory and Practice of Digital Libraries, TPDL 2025 Tampere, Finland September 23–26, 2025) [10.1007/978-3-032-05409-8_4].
abstract:
Sentence similarity search is a fundamental task in information retrieval, enabling applications such as search engines, question answering, and textual analysis. However, retrieval systems often struggle when training data are scarce, as is the case for low-resource languages or specialized domains such as ancient texts. To address this challenge, we propose a novel paradigm for domain-specific sentence similarity search, where the embedding space is shaped by a combination of limited real data and a large amount of synthetic data generated by Large Language Models (LLMs). Specifically, we employ LLMs to generate domain-specific sentence pairs and fine-tune a sentence embedding model, effectively distilling knowledge from the LLM to the retrieval model. We validate our method through a case study on biblical intertextuality in Latin, demonstrating that synthetic data augmentation significantly improves retrieval effectiveness in a domain with scarce annotated resources. More broadly, our approach offers a scalable and adaptable framework for enhancing retrieval in domain-specific contexts. Source code and trained models are available at https://github.com/aimagelab/biblical-retrieval-synthesis.
Iris type:
Relazione in Atti di Convegno
Keywords:
Digital Humanities; Large Language Models; Sentence Embeddings; Sentence Similarity Search; Intertextuality; Biblical Versions
List of contributors:
Caffagni, Davide.; Cocchi, Federico; Mambelli, Anna; Tutrone, Fabio; Zanella, Marco; Cornia, Marcella.; Cucchiara, Rita
Authors of the University:
CAFFAGNI DAVIDE
CORNIA MARCELLA
CUCCHIARA Rita
MAMBELLI Anna
Handle:
https://iris.unimore.it/handle/11380/1389118
Book title:
Linking Theory and Practice of Digital Libraries: 29th International Conference on Theory and Practice of Digital Libraries, TPDL 2025 (Tampere, Finland, September 23–26, 2025), Proceedings
Published in:
LECTURE NOTES IN COMPUTER SCIENCE
Journal
LECTURE NOTES IN COMPUTER SCIENCE
Series
  • Use of cookies

Powered by VIVO | Designed by Cineca | 26.4.5.0