Skip to Main Content (Press Enter)

Logo UNIMORE
  • ×
  • Home
  • Corsi
  • Insegnamenti
  • Professioni
  • Persone
  • Pubblicazioni
  • Strutture
  • Terza Missione
  • Attività
  • Competenze

UNI-FIND
Logo UNIMORE

|

UNI-FIND

unimore.it
  • ×
  • Home
  • Corsi
  • Insegnamenti
  • Professioni
  • Persone
  • Pubblicazioni
  • Strutture
  • Terza Missione
  • Attività
  • Competenze
  1. Pubblicazioni

Generating More Pertinent Captions by Leveraging Semantics and Style on Multi-Source Datasets

Articolo
Data di Pubblicazione:
2024
Citazione:
Generating More Pertinent Captions by Leveraging Semantics and Style on Multi-Source Datasets / Cornia, Marcella; Baraldi, Lorenzo; Fiameni, Giuseppe; Cucchiara, Rita. - In: INTERNATIONAL JOURNAL OF COMPUTER VISION. - ISSN 0920-5691. - 132:5(2024), pp. 1701-1720. [10.1007/s11263-023-01949-w]
Abstract:
This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources, containing both human-annotated and web-collected captions. Large-scale datasets with noisy image-text pairs, indeed, provide a sub-optimal source of supervision because of their low-quality descriptive style, while human-annotated datasets are cleaner but smaller in scale. To get the best of both worlds, we propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component. The proposed model avoids the need of object detectors, is trained with a single objective of prompt language modeling, and can replicate the style of human-collected captions while training on sources with different input styles. Experimentally, the model shows a strong capability of recognizing real-world concepts and producing high-quality captions. Extensive experiments are performed on different image captioning datasets, including CC3M, nocaps, and the competitive COCO dataset, where our model consistently outperforms baselines and state-of-the-art approaches.
Tipologia CRIS:
Articolo su rivista
Keywords:
Image captioning; Multimodal learning; Vision and language;
Elenco autori:
Cornia, Marcella; Baraldi, Lorenzo; Fiameni, Giuseppe; Cucchiara, Rita
Autori di Ateneo:
BARALDI LORENZO
CORNIA MARCELLA
CUCCHIARA Rita
Link alla scheda completa:
https://iris.unimore.it/handle/11380/1323870
Link al Full Text:
https://iris.unimore.it//retrieve/handle/11380/1323870/614141/2023_IJCV_Universal_Captioner.pdf
Pubblicato in:
INTERNATIONAL JOURNAL OF COMPUTER VISION
Journal
  • Utilizzo dei cookie

Realizzato con VIVO | Designed by Cineca | 26.5.0.0