Beyond Parliament: AI-Enhanced Multilingual Corpus Using Innovative Methodology for Non-Institutional Political Speeches in German, French, Spanish and Italian

Project

The research project aims to conduct an in-depth study of the scientific, technical, and copyright aspects related to the creation of a multilingual corpus. This planned corpus, utilising the latest advancements in Artificial Intelligence, will provide transcripts in the field of oral political discourse and serve as a basis for international studies in political linguistics. The project focuses on building a corpus and analysing non-institutional political speeches in German, French, Spanish and Italian. It aims to implement a methodology using web tools, Automatic Speech Recognition (ASR), and AI transcription systems to orthographically transcribe, segment, annotate, and analyse the collected speeches. The prosodic analysis of the speeches will be also carried out, testing methods that can allow to focus on the relation between prosody and other linguistic aspects such as lexical choices, metaphorical instances and so on. The outcomes of the prosodic analysis will be available in the corpus. By ‘non-institutional’ speeches, we refer to speeches not delivered in parliamentary settings, but in venues such as political conventions and public speeches during election campaigns. It seems necessary to focus on non-institutional speeches for several reasons: First, parliamentary speeches are usually transcribed with the help of ASR and AI systems and then corrected by stenographers, thus excluding typical idiosyncrasies of orality. Moreover, these speeches are often read aloud in sessions, making them more akin to written language than spoken language. Importantly, non-institutional speeches have not yet been deeply studied by scholars, possibly due to transcription challenges. Videos of these speeches are publicly available on platforms like YouTube, but unlike speeches in parliamentary settings, often these recordings suffer from lower audio quality due to background noise, making transcription challenging. Even if the corpus consists solely of transcripts, crucial visual context information can be captured in the transcription tools' comment section. Documenting non-verbal cues alongside audio data is essential for analysing prosodic features like intonation, stress, and rhythm. By considering both verbal and non-verbal aspects, researchers can achieve more nuanced interpretations, enriching the overall understanding of political speeches. This project aims to implement a tailored methodology for analysing non-institutional speeches. The outcome will be a corpus of speeches, made available for politolinguistic studies and even for more general discourse analysis. This corpus will present a model that combines orthographic transcripts with further analysis, including prosody. Additionally, a collaborative platform will be developed, where the transcripts of German, French, Spanish and Italian speeches delivered at election campaigns will be available alongside the references to the corresponding video and audio sources.