Enhancing Interpretability Using Human Similarity Judgements to Prune Word Embeddings

Flechas Manrique, Natalia; Bao, Wanqian; Herbelot, Aurelie; Hasson, Uri

doi:10.18653/v1/2023.blackboxnlp-1.13

Interpretability methods in NLP aim to provide insights into the semantics underlying specific system architectures. Focusing on word embeddings, we present a supervised-learning method that, for a given domain (e.g., sports, professions), identifies a subset of model features that strongly improve prediction of human similarity judgments. We show this method keeps only 20-40{\%} of the original embeddings, for 8 independent semantic domains, and that it retains different feature sets across domains. We then present two approaches for interpreting the semantics of the retained features. The first obtains the scores of the domain words (co-hyponyms) on the first principal component of the retained embeddings, and extracts terms whose co-occurrence with the co-hyponyms tracks these scores{'} profile. This analysis reveals that humans differentiate e.g. sports based on how gender-inclusive and international they are. The second approach uses the retained sets as variables in a probing task that predicts values along 65 semantically annotated dimensions for a dataset of 535 words. The features retained for professions are best at predicting cognitive, emotional and social dimensions, whereas features retained for fruits or vegetables best predict the gustation (taste) dimension. We discuss implications for alignment between AI systems and human knowledge.

Enhancing Interpretability Using Human Similarity Judgements to Prune Word Embeddings / Flechas Manrique, Natalia; Bao, Wanqian; Herbelot, Aurelie; Hasson, Uri. - (2023), pp. 169-179. (Intervento presentato al convegno BlackboxNLP tenutosi a Singapore nel 7/12/2023) [10.18653/v1/2023.blackboxnlp-1.13].

Enhancing Interpretability Using Human Similarity Judgements to Prune Word Embeddings

Flechas Manrique, Natalia;Bao, Wanqian;Herbelot, Aurelie;Hasson, Uri

2023-01-01

Abstract

Interpretability methods in NLP aim to provide insights into the semantics underlying specific system architectures. Focusing on word embeddings, we present a supervised-learning method that, for a given domain (e.g., sports, professions), identifies a subset of model features that strongly improve prediction of human similarity judgments. We show this method keeps only 20-40{\%} of the original embeddings, for 8 independent semantic domains, and that it retains different feature sets across domains. We then present two approaches for interpreting the semantics of the retained features. The first obtains the scores of the domain words (co-hyponyms) on the first principal component of the retained embeddings, and extracts terms whose co-occurrence with the co-hyponyms tracks these scores{'} profile. This analysis reveals that humans differentiate e.g. sports based on how gender-inclusive and international they are. The second approach uses the retained sets as variables in a probing task that predicts values along 65 semantically annotated dimensions for a dataset of 535 words. The features retained for professions are best at predicting cognitive, emotional and social dimensions, whereas features retained for fruits or vegetables best predict the gustation (taste) dimension. We discuss implications for alignment between AI systems and human knowledge.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione (Date of publication)
	
				2023
			
	Titolo del volume (Proceedings title)
	
				Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP
			
	Luogo di edizione (Place of publication)
	
				UK (Online)
			
	Casa editrice (Publisher)
	
				Association for Computational Linguistics
			
	Tutti gli autori
	
						Flechas Manrique, Natalia; Bao, Wanqian; Herbelot, Aurelie; Hasson, Uri
					
	Citazione
	
				Enhancing Interpretability Using Human Similarity Judgements to Prune Word Embeddings / Flechas Manrique, Natalia; Bao, Wanqian; Herbelot, Aurelie; Hasson, Uri. - (2023), pp. 169-179. (Intervento presentato al  convegno BlackboxNLP tenutosi a Singapore nel 7/12/2023) [10.18653/v1/2023.blackboxnlp-1.13].
			
	Appare nelle tipologie:
	
				04.1 Saggio in atti di convegno (Paper in Proceedings)

File in questo prodotto:

File	Dimensione	Formato
2023.blackboxnlp-1.13.pdf accesso aperto Tipologia: Post-print referato (Refereed author’s manuscript) Licenza: Creative commons Dimensione 619.39 kB Formato Adobe PDF Visualizza/Apri	619.39 kB	Adobe PDF	Visualizza/Apri