Structural Self-Supervised Objectives for Transformers

Di Liello, Luca

doi:10.15168/11572_389313

In this Thesis, we leverage unsupervised raw data to develop more efficient pre-training objectives and self-supervised tasks that align well with downstream applications. In the first part, we present three alternative objectives to BERT’s Masked Language Modeling (MLM), namely Random Token Substitution (RTS), Cluster-based Random Token Substitution C-RTS, and Swapped Language Modeling (SLM). Unlike MLM, all of these proposals involve token swapping rather than replacing tokens with BERT’s [MASK]. RTS and C-RTS involve pre- dicting the originality of tokens, while SLM tasks the model at predicting the original token values. Each objective is applied to several models, which are trained using the same computational budget and corpora. Evaluation results reveal RTS and C-RTS require up to 45% less pre-training time while achieving performance on par with MLM. Notably, SLM outperforms MLM on several Answer Sentence Selection and GLUE tasks, despite utilizing the same computational budget for pre-training. In the second part of the Thesis, we propose self-supervised pre-training tasks that exhibit structural alignment with downstream applications, leading to improved performance and reduced reliance on labeled data to achieve comparable results. We exploit the weak supervision provided by large corpora like Wikipedia and CC-News, challenging the model to recognize whether spans of text originate from the same paragraph or document. To this end, we design (i) a pre-training objective that targets multi-sentence inference models by performing predictions over multiple spans of texts simultaneously, (ii) self-supervised objectives tailored to enhance performance in Answer Sentence Selection and its Contextual version, and (iii) a pre-training objective aimed at performance improvements in Summarization. Through continuous pre-training, starting from renowned checkpoints such as RoBERTa, ELEC- TRA, DeBERTa, BART, and T5, we demonstrate that our models achieve higher performance on Fact Verification, Answer Sentence Selection, and Summarization. We extensively evaluate our proposals on different benchmarks, revealing significant accuracy gains, particularly when annotation in the target dataset is limited. Notably, we achieve state-of-the-art results on the development set of the FEVER dataset and results close to state-of-the-art models using much more parameters on the test set. Furthermore, our objectives enable us to attain state-of-the-art results on ASNQ, WikiQA, and TREC-QA test sets, across all evaluation metrics (MAP, MRR, and P@1). For Summarization, our objective enhances summary quality, as measured by various metrics like ROUGE and BLEURT. We maintain that our proposals can be seamlessly combined with other techniques from recently proposed works, as they do not require alterations to the internal structure of Transformer models but only involve modifications to the training tasks.

Structural Self-Supervised Objectives for Transformers / Di Liello, Luca. - (2023 Sep 21), pp. 1-140. [10.15168/11572_389313]

Structural Self-Supervised Objectives for Transformers

Di Liello, Luca

2023-09-21

Abstract

In this Thesis, we leverage unsupervised raw data to develop more efficient pre-training objectives and self-supervised tasks that align well with downstream applications. In the first part, we present three alternative objectives to BERT’s Masked Language Modeling (MLM), namely Random Token Substitution (RTS), Cluster-based Random Token Substitution C-RTS, and Swapped Language Modeling (SLM). Unlike MLM, all of these proposals involve token swapping rather than replacing tokens with BERT’s [MASK]. RTS and C-RTS involve pre- dicting the originality of tokens, while SLM tasks the model at predicting the original token values. Each objective is applied to several models, which are trained using the same computational budget and corpora. Evaluation results reveal RTS and C-RTS require up to 45% less pre-training time while achieving performance on par with MLM. Notably, SLM outperforms MLM on several Answer Sentence Selection and GLUE tasks, despite utilizing the same computational budget for pre-training. In the second part of the Thesis, we propose self-supervised pre-training tasks that exhibit structural alignment with downstream applications, leading to improved performance and reduced reliance on labeled data to achieve comparable results. We exploit the weak supervision provided by large corpora like Wikipedia and CC-News, challenging the model to recognize whether spans of text originate from the same paragraph or document. To this end, we design (i) a pre-training objective that targets multi-sentence inference models by performing predictions over multiple spans of texts simultaneously, (ii) self-supervised objectives tailored to enhance performance in Answer Sentence Selection and its Contextual version, and (iii) a pre-training objective aimed at performance improvements in Summarization. Through continuous pre-training, starting from renowned checkpoints such as RoBERTa, ELEC- TRA, DeBERTa, BART, and T5, we demonstrate that our models achieve higher performance on Fact Verification, Answer Sentence Selection, and Summarization. We extensively evaluate our proposals on different benchmarks, revealing significant accuracy gains, particularly when annotation in the target dataset is limited. Notably, we achieve state-of-the-art results on the development set of the FEVER dataset and results close to state-of-the-art models using much more parameters on the test set. Furthermore, our objectives enable us to attain state-of-the-art results on ASNQ, WikiQA, and TREC-QA test sets, across all evaluation metrics (MAP, MRR, and P@1). For Summarization, our objective enhances summary quality, as measured by various metrics like ROUGE and BLEURT. We maintain that our proposals can be seamlessly combined with other techniques from recently proposed works, as they do not require alterations to the internal structure of Transformer models but only involve modifications to the training tasks.

Scheda breve

Scheda completa

Scheda completa (DC)

	Data di esame finale/Defended on
	
				21-set-2023
			
	Ciclo
	
				XXXV
			
	Anno Accademico
	
				2022-2023
			
	Dipartimento
	
				Università degli Studi di Trento
			
	Corso di dottorato
	
				Informatica e telecomunicazioni (fino a.a. 2020-21, 36° ciclo)
			
	Supervisore/Relatore di tesi Unitn (Unitn internal supervisor)
	
				Moschitti, Alessandro
Uryupina, Olga
			
	Tesi in cotutela (Bi-nationally supervised Doctoral Thesis)
	
				no
			
	Codice DOI
	
				https://dx.doi.org/10.15168/11572_389313
			
	Lingua (Language)
	
				Inglese
			
	Settori scientifico-disciplinari (validi fino a 24/06/2024) - Reference SSD (valid until 24/06/2024)
	
				Settore INF/01 - Informatica
			
	Appare nelle tipologie:
	
				08.1 Tesi di dottorato (Doctoral Thesis)

File in questo prodotto:

File	Dimensione	Formato
PhD-Thesis_v2.pdf accesso aperto Tipologia: Tesi di dottorato (Doctoral Thesis) Licenza: Creative commons Dimensione 3.02 MB Formato Adobe PDF Visualizza/Apri	3.02 MB	Adobe PDF	Visualizza/Apri