While Vision-Language Models (VLMs) have achieved competitive performance in various tasks, their comprehension of the underlying structure and semantics of a scene remains understudied. To investigate the understanding of VLMs, we study their capability regarding object properties and relations in a controlled and interpretable manner. To this scope, we introduce CIVET, a novel and extensible framework for systemati\textbf{C} evaluat\textbf{I}on \textbf{V}ia controll\textbf{E}d s\textbf{T}imuli. CIVET addresses the lack of standardized systematic evaluation for assessing VLMs' understanding, enabling researchers to test hypotheses with statistical rigor. With CIVET, we evaluate five state-of-the-art VLMs on exhaustive sets of stimuli, free from annotation noise, dataset-specific biases, and uncontrolled scene complexity. Our findings reveal that 1) current VLMs can accurately recognize only a limited set of basic object properties; 2) their performance heavily depends on the position of the object in the scene; 3) they struggle to understand basic relations among objects. Furthermore, a comparative evaluation with human annotators reveals that VLMs still fall short of achieving human-level accuracy.

While Vision-Language Models (VLMs) have achieved competitive performance in various tasks, their comprehension of the underlying structure and semantics of a scene remains understudied. To investigate the understanding of VLMs, we study their capability regarding object properties and relations in a controlled and interpretable manner. To this scope, we introduce CIVET1 , (Image Presented) a novel and extensible framework for systematiC evaluatIon Via controllEd sTimuli. CIVET addresses the lack of standardized systematic evaluation for assessing VLMs’ understanding, enabling researchers to test hypotheses with statistical rigor. With CIVET, we evaluate five state-of-the-art VLMs on exhaustive sets of stimuli, free from annotation noise, dataset-specific biases, and uncontrolled scene complexity. Our findings reveal that 1) current VLMs can accurately recognize only a limited set of basic object properties; 2) their performance heavily depends on the position of the object in the scene; 3) they struggle to understand basic relations among objects. Furthermore, a comparative evaluation with human annotators reveals that VLMs still fall short of achieving human-level accuracy.

CIVET: Systematic Evaluation of Understanding in VLMs / Rizzoli, Massimo; Alghisi, Simone; Khomyn, Olha; Roccabruna, Gabriel; Mousavi, Seyed Mahed; Riccardi, Giuseppe. - (2025), pp. 4462-4480. ( 30th Conference on Empirical Methods in Natural Language Processing, EMNLP 2025 Suzhou, China 4th November-9th November 2025) [10.18653/v1/2025.findings-emnlp.239].

CIVET: Systematic Evaluation of Understanding in VLMs

Rizzoli Massimo
Co-primo
;
Alghisi Simone
Co-primo
;
Khomyn Olha;Roccabruna Gabriel;Mousavi Seyed Mahed;Riccardi Giuseppe
2025-01-01

Abstract

While Vision-Language Models (VLMs) have achieved competitive performance in various tasks, their comprehension of the underlying structure and semantics of a scene remains understudied. To investigate the understanding of VLMs, we study their capability regarding object properties and relations in a controlled and interpretable manner. To this scope, we introduce CIVET1 , (Image Presented) a novel and extensible framework for systematiC evaluatIon Via controllEd sTimuli. CIVET addresses the lack of standardized systematic evaluation for assessing VLMs’ understanding, enabling researchers to test hypotheses with statistical rigor. With CIVET, we evaluate five state-of-the-art VLMs on exhaustive sets of stimuli, free from annotation noise, dataset-specific biases, and uncontrolled scene complexity. Our findings reveal that 1) current VLMs can accurately recognize only a limited set of basic object properties; 2) their performance heavily depends on the position of the object in the scene; 3) they struggle to understand basic relations among objects. Furthermore, a comparative evaluation with human annotators reveals that VLMs still fall short of achieving human-level accuracy.
2025
Findings of the Association for Computational Linguistics: EMNLP 2025
209 N. Eighth Street, Stroudsburg, PA, USA, 18360
Association for Computational Linguistics (ACL)
979-8-89176-335-7
Rizzoli, Massimo; Alghisi, Simone; Khomyn, Olha; Roccabruna, Gabriel; Mousavi, Seyed Mahed; Riccardi, Giuseppe
CIVET: Systematic Evaluation of Understanding in VLMs / Rizzoli, Massimo; Alghisi, Simone; Khomyn, Olha; Roccabruna, Gabriel; Mousavi, Seyed Mahed; Riccardi, Giuseppe. - (2025), pp. 4462-4480. ( 30th Conference on Empirical Methods in Natural Language Processing, EMNLP 2025 Suzhou, China 4th November-9th November 2025) [10.18653/v1/2025.findings-emnlp.239].
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/467573
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex 0
social impact