Creating a ground truth multilingual dataset of news and talk show transcriptions through crowdsourcing

Sprugnoli, Rachele; Moretti, Giovanni; Bentivogli, Luisa; Giuliani, Diego

doi:10.1007/s10579-016-9372-5

This paper describes the development of a multilingual and multigenre manually annotated speech dataset, freely available to the research community as ground truth for the evaluation of automatic transcription systems and spoken language translation systems. The dataset includes two video genres—television broadcast news and talk-shows—and covers Flemish, English, German, and Italian, for a total of about 35 h of television speech. Besides segmentation and orthographic transcription, we added a very rich annotation on the audio signal, both at the linguistic level (e.g. filled pauses, pronunciation errors, disfluencies, speech in a foreign language) and at the acoustic level (e.g. background noise and different types of non-speech events). Furthermore, a subset of the transcriptions is translated in four directions, namely Flemish to English, German to English, German to Italian and English to Italian. The development of this dataset was organized in several phases, relying on expert transcribers as well as involving non-expert contributors through crowdsourcing. We first conducted a feasibility study to test and compare two methods for crowdsourcing speech transcription on broadcast news data. These methods are based on different transcription processes (i.e. parallel vs. iterative) and incorporate two different quality control mechanisms. With both methods, we achieved near-expert transcription quality—in terms of word error rate—for English, German and Italian data. Instead, for Flemish data we were not able to get a sufficient response from the crowd to complete the offered transcription tasks. The results obtained demonstrate that the viability of methods for crowdsourcing speech transcription significantly depends on the target language. This paper provides a detailed comparison of the results obtained with the two crowdsourcing methods tested, describes the main characteristics of the final ground truth resource created as well as the methodology adopted, and the guidelines prepared for its development.

Creating a ground truth multilingual dataset of news and talk show transcriptions through crowdsourcing / Sprugnoli, Rachele; Moretti, Giovanni; Bentivogli, Luisa; Giuliani, Diego. - In: LANGUAGE RESOURCES AND EVALUATION. - ISSN 1574-020X. - 2016:(2016), pp. 1-35. [10.1007/s10579-016-9372-5]

Creating a ground truth multilingual dataset of news and talk show transcriptions through crowdsourcing

Sprugnoli, Rachele;Moretti, Giovanni;Bentivogli, Luisa;Giuliani, Diego

2016-01-01

Abstract

This paper describes the development of a multilingual and multigenre manually annotated speech dataset, freely available to the research community as ground truth for the evaluation of automatic transcription systems and spoken language translation systems. The dataset includes two video genres—television broadcast news and talk-shows—and covers Flemish, English, German, and Italian, for a total of about 35 h of television speech. Besides segmentation and orthographic transcription, we added a very rich annotation on the audio signal, both at the linguistic level (e.g. filled pauses, pronunciation errors, disfluencies, speech in a foreign language) and at the acoustic level (e.g. background noise and different types of non-speech events). Furthermore, a subset of the transcriptions is translated in four directions, namely Flemish to English, German to English, German to Italian and English to Italian. The development of this dataset was organized in several phases, relying on expert transcribers as well as involving non-expert contributors through crowdsourcing. We first conducted a feasibility study to test and compare two methods for crowdsourcing speech transcription on broadcast news data. These methods are based on different transcription processes (i.e. parallel vs. iterative) and incorporate two different quality control mechanisms. With both methods, we achieved near-expert transcription quality—in terms of word error rate—for English, German and Italian data. Instead, for Flemish data we were not able to get a sufficient response from the crowd to complete the offered transcription tasks. The results obtained demonstrate that the viability of methods for crowdsourcing speech transcription significantly depends on the target language. This paper provides a detailed comparison of the results obtained with the two crowdsourcing methods tested, describes the main characteristics of the final ground truth resource created as well as the methodology adopted, and the guidelines prepared for its development.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione (Date of publication)
	
				2016
			
	Titolo del periodico (Journal title)
	
				LANGUAGE RESOURCES AND EVALUATION
			
	DOI
	
				https://dx.doi.org/10.1007/s10579-016-9372-5
			
	Codice Scopus (Scopus identifier)
	
				2-s2.0-84992702398
			
	Codice WOS (WOS identifier)
	
				WOS:000403358700002
			
	Tutti gli autori
	
						Sprugnoli, Rachele; Moretti, Giovanni; Bentivogli, Luisa; Giuliani, Diego
					
	Citazione
	
				Creating a ground truth multilingual dataset of news and talk show transcriptions through crowdsourcing / Sprugnoli, Rachele; Moretti, Giovanni; Bentivogli, Luisa; Giuliani, Diego. - In: LANGUAGE RESOURCES AND EVALUATION. - ISSN 1574-020X. - 2016:(2016), pp. 1-35. [10.1007/s10579-016-9372-5]
			
	Appare nelle tipologie:
	
				03.1 Articolo su rivista (Journal article)

File in questo prodotto:

File	Dimensione	Formato
LRandEval_2_0__Copy_.pdf Solo gestori archivio Descrizione: Paper post-print Tipologia: Post-print referato (Refereed author’s manuscript) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 1.12 MB Formato Adobe PDF Visualizza/Apri	1.12 MB	Adobe PDF	Visualizza/Apri