Today AI systems are trained by ultimately using a classifier to perform a down-streaming task and are mostly evaluated on the task-success they reach. Not enough attention is given to how the classifier distributes the probabilities among the candidates out of which the target with the highest probability is selected. We propose to take the probability distribution as a litmus test to inspect models’ grounding skills. We take a visually grounded referential guessing game as test-bed and use the probability distribution as a way to evaluate whether question answer pairs are well grounded by the model. To this end, we propose a method to obtain such soft-labels automatically and show they correlate well with human uncertainty about the grounded interpretation of the QA pair. Our result shows that higher task accuracy does not necessarily correspond to a more meaningful probability distribution; we do not consider trustworthy the models which do not pass our litmus test.

Probability Distributions as a Litmus Test to Inspect NNs Grounding Skills / Lucassen, A. J.; Testoni, A.; Bernardi, Raffaella. - ELETTRONICO. - 3287:(2022), pp. 96-114. (Intervento presentato al convegno NL4AI tenutosi a Udine nel 30 novembre 2022).

Probability Distributions as a Litmus Test to Inspect NNs Grounding Skills

Testoni, A.;Bernardi, Raffaella
2022-01-01

Abstract

Today AI systems are trained by ultimately using a classifier to perform a down-streaming task and are mostly evaluated on the task-success they reach. Not enough attention is given to how the classifier distributes the probabilities among the candidates out of which the target with the highest probability is selected. We propose to take the probability distribution as a litmus test to inspect models’ grounding skills. We take a visually grounded referential guessing game as test-bed and use the probability distribution as a way to evaluate whether question answer pairs are well grounded by the model. To this end, we propose a method to obtain such soft-labels automatically and show they correlate well with human uncertainty about the grounded interpretation of the QA pair. Our result shows that higher task accuracy does not necessarily correspond to a more meaningful probability distribution; we do not consider trustworthy the models which do not pass our litmus test.
2022
Sixth Workshop on Natural Language for Artificial Intelligence
Aachen
RWTH Aachen
Lucassen, A. J.; Testoni, A.; Bernardi, Raffaella
Probability Distributions as a Litmus Test to Inspect NNs Grounding Skills / Lucassen, A. J.; Testoni, A.; Bernardi, Raffaella. - ELETTRONICO. - 3287:(2022), pp. 96-114. (Intervento presentato al convegno NL4AI tenutosi a Udine nel 30 novembre 2022).
File in questo prodotto:
File Dimensione Formato  
paper11.pdf

accesso aperto

Descrizione: paper
Tipologia: Versione editoriale (Publisher’s layout)
Licenza: Creative commons
Dimensione 3.93 MB
Formato Adobe PDF
3.93 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/365191
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact