Today AI systems are trained by ultimately using a classifier to perform a down-streaming task and are mostly evaluated on the task-success they reach. Not enough attention is given to how the classifier distributes the probabilities among the candidates out of which the target with the highest probability is selected. We propose to take the probability distribution as a litmus test to inspect models’ grounding skills. We take a visually grounded referential guessing game as test-bed and use the probability distribution as a way to evaluate whether question answer pairs are well grounded by the model. To this end, we propose a method to obtain such soft-labels automatically and show they correlate well with human uncertainty about the grounded interpretation of the QA pair. Our result shows that higher task accuracy does not necessarily correspond to a more meaningful probability distribution; we do not consider trustworthy the models which do not pass our litmus test.
Probability Distributions as a Litmus Test to Inspect NNs Grounding Skills / Lucassen, A. J.; Testoni, A.; Bernardi, Raffaella. - ELETTRONICO. - 3287:(2022), pp. 108-126. (Intervento presentato al convegno 6th Workshop on Natural Language for Artificial Intelligence, NL4AI 2022 tenutosi a Udine nel 30 novembre 2022).
Probability Distributions as a Litmus Test to Inspect NNs Grounding Skills
Testoni, A.;Bernardi, Raffaella
2022-01-01
Abstract
Today AI systems are trained by ultimately using a classifier to perform a down-streaming task and are mostly evaluated on the task-success they reach. Not enough attention is given to how the classifier distributes the probabilities among the candidates out of which the target with the highest probability is selected. We propose to take the probability distribution as a litmus test to inspect models’ grounding skills. We take a visually grounded referential guessing game as test-bed and use the probability distribution as a way to evaluate whether question answer pairs are well grounded by the model. To this end, we propose a method to obtain such soft-labels automatically and show they correlate well with human uncertainty about the grounded interpretation of the QA pair. Our result shows that higher task accuracy does not necessarily correspond to a more meaningful probability distribution; we do not consider trustworthy the models which do not pass our litmus test.File | Dimensione | Formato | |
---|---|---|---|
paper11.pdf
accesso aperto
Descrizione: paper
Tipologia:
Versione editoriale (Publisher’s layout)
Licenza:
Creative commons
Dimensione
3.93 MB
Formato
Adobe PDF
|
3.93 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione