Interpretability Is in the Mind of the Beholder: A Causal Framework for Human-Interpretable Representation Learning

IRIS

Research on Explainable Artificial Intelligence has recently started exploring the idea of producing explanations that, rather than being expressed in terms of low-level features, are encoded in terms of interpretable concepts learned from data. How to reliably acquire such concepts is, however, still fundamentally unclear. An agreed-upon notion of concept interpretability is missing, with the result that concepts used by both post hoc explainers and concept-based neural networks are acquired through a variety of mutually incompatible strategies. Critically, most of these neglect the human side of the problem: a representation is understandable only insofar as it can be understood by the human at the receiving end. The key challenge in human-interpretable representation learning (hrl) is how to model and operationalize this human element. In this work, we propose a mathematical framework for acquiring interpretable representations suitable for both post hoc explainers and concept-based neural networks. Our formalization of hrl builds on recent advances in causal representation learning and explicitly models a human stakeholder as an external observer. This allows us derive a principled notion of alignment between the machine’s representation and the vocabulary of concepts understood by the human. In doing so, we link alignment and interpretability through a simple and intuitive name transfer game, and clarify the relationship between alignment and a well-known property of representations, namely disentanglement. We also show that alignment is linked to the issue of undesirable correlations among concepts, also known as concept leakage, and to content-style separation, all through a general information-theoretic reformulation of these properties. Our conceptualization aims to bridge the gap between the human and algorithmic sides of interpretability and establish a stepping stone for new research on human-interpretable representations.

Interpretability Is in the Mind of the Beholder: A Causal Framework for Human-Interpretable Representation Learning / Marconato, Emanuele; Passerini, Andrea; Teso, Stefano. - In: ENTROPY. - ISSN 1099-4300. - 25:12(2023), pp. 157401-157433. [10.3390/e25121574]

Interpretability Is in the Mind of the Beholder: A Causal Framework for Human-Interpretable Representation Learning

Marconato Emanuele;Passerini Andrea;Teso Stefano

2023-01-01

Abstract

Research on Explainable Artificial Intelligence has recently started exploring the idea of producing explanations that, rather than being expressed in terms of low-level features, are encoded in terms of interpretable concepts learned from data. How to reliably acquire such concepts is, however, still fundamentally unclear. An agreed-upon notion of concept interpretability is missing, with the result that concepts used by both post hoc explainers and concept-based neural networks are acquired through a variety of mutually incompatible strategies. Critically, most of these neglect the human side of the problem: a representation is understandable only insofar as it can be understood by the human at the receiving end. The key challenge in human-interpretable representation learning (hrl) is how to model and operationalize this human element. In this work, we propose a mathematical framework for acquiring interpretable representations suitable for both post hoc explainers and concept-based neural networks. Our formalization of hrl builds on recent advances in causal representation learning and explicitly models a human stakeholder as an external observer. This allows us derive a principled notion of alignment between the machine’s representation and the vocabulary of concepts understood by the human. In doing so, we link alignment and interpretability through a simple and intuitive name transfer game, and clarify the relationship between alignment and a well-known property of representations, namely disentanglement. We also show that alignment is linked to the issue of undesirable correlations among concepts, also known as concept leakage, and to content-style separation, all through a general information-theoretic reformulation of these properties. Our conceptualization aims to bridge the gap between the human and algorithmic sides of interpretability and establish a stepping stone for new research on human-interpretable representations.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione (Date of publication)
	
			2023
		
	Titolo del periodico (Journal title)
	
			ENTROPY
		
	Numero e parte del fascicolo (Issue number and part)
	
			12
		
	DOI
	
			https://dx.doi.org/10.3390/e25121574
		
	Codice PubMed (PubMed Identifier)
	
			38136454
		
	Codice Scopus (Scopus identifier)
	
			2-s2.0-85180697644
		
	Codice WOS (WOS identifier)
	
			WOS:001130537000001
		
	Tutti gli autori
	
			Marconato, Emanuele; Passerini, Andrea; Teso, Stefano
		
	Citazione
	
			Interpretability Is in the Mind of the Beholder: A Causal Framework for Human-Interpretable Representation Learning / Marconato, Emanuele; Passerini, Andrea; Teso, Stefano. - In: ENTROPY. - ISSN 1099-4300. - 25:12(2023), pp. 157401-157433. [10.3390/e25121574]
		
	Appare nelle tipologie:
	
			03.1 Articolo su rivista (Journal article)

File in questo prodotto:

File	Dimensione	Formato
entropy-25-01574-v2.pdf accesso aperto Tipologia: Versione editoriale (Publisher’s layout) Licenza: Creative commons Dimensione 541.03 kB Formato Adobe PDF Visualizza/Apri	541.03 kB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/400719

Citazioni

0

0

0

social impact