ProfVLM: A lightweight video-language model for multi-view proficiency estimation

IRIS

Most existing approaches formulate action quality assessment and skill proficiency estimation as discriminative prediction tasks, typically producing discrete labels or scores without explicitly modeling the reasoning process underlying the assessment. We instead reformulate the problem as generative vision-language modeling, introducing ProfVLM, a parameter-efficient vision-language model that jointly predicts proficiency levels and generates expert-like natural language feedback from multi-view videos. ProfVLM leverages conditional language generation to provide actionable insights along with quantitative evaluation scores. Central to our method is an AttentiveGatedProjector that dynamically fuses and projects multi-view egocentric and exocentric features from a frozen TimeSformer backbone into a language model fine-tuned for feedback generation. Trained on EgoExo4D with expert commentaries, ProfVLM surpasses state-of-the-art methods while using up to 20x fewer parameters and reducing training time by up to 60% compared to existing classification-based methods. By providing natural language critiques aligned with performance levels, this work shows that generative vision-language modeling offers a powerful and efficient paradigm shift for interpretable action quality assessment

ProfVLM: A lightweight video-language model for multi-view proficiency estimation / Bianchi, E., Staiano, J., Liotta, A.. - In: COMPUTER VISION AND IMAGE UNDERSTANDING. - ISSN 1077-3142. - 268:(2026), p. 104749. [10.1016/j.cviu.2026.104749]

ProfVLM: A lightweight video-language model for multi-view proficiency estimation

Bianchi, Edoardo^Primo;Staiano, Jacopo^Secondo;Liotta, Antonio^Ultimo

2026-01-01

Abstract

Most existing approaches formulate action quality assessment and skill proficiency estimation as discriminative prediction tasks, typically producing discrete labels or scores without explicitly modeling the reasoning process underlying the assessment. We instead reformulate the problem as generative vision-language modeling, introducing ProfVLM, a parameter-efficient vision-language model that jointly predicts proficiency levels and generates expert-like natural language feedback from multi-view videos. ProfVLM leverages conditional language generation to provide actionable insights along with quantitative evaluation scores. Central to our method is an AttentiveGatedProjector that dynamically fuses and projects multi-view egocentric and exocentric features from a frozen TimeSformer backbone into a language model fine-tuned for feedback generation. Trained on EgoExo4D with expert commentaries, ProfVLM surpasses state-of-the-art methods while using up to 20x fewer parameters and reducing training time by up to 60% compared to existing classification-based methods. By providing natural language critiques aligned with performance levels, this work shows that generative vision-language modeling offers a powerful and efficient paradigm shift for interpretable action quality assessment

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione (Date of publication)
	
				2026
			
	Titolo del periodico (Journal title)
	
				COMPUTER VISION AND IMAGE UNDERSTANDING
			
	DOI
	
				https://dx.doi.org/10.1016/j.cviu.2026.104749
			
	Codice Scopus (Scopus identifier)
	
				2-s2.0-105034458654
			
	Codice WOS (WOS identifier)
	
				WOS:001736043900001
			
	Tutti gli autori
	
						Bianchi, Edoardo; Staiano, Jacopo; Liotta, Antonio
					
	Citazione
	
				ProfVLM: A lightweight video-language model for multi-view proficiency estimation / Bianchi, E., Staiano, J., Liotta, A.. - In: COMPUTER VISION AND IMAGE UNDERSTANDING. - ISSN 1077-3142. - 268:(2026), p. 104749. [10.1016/j.cviu.2026.104749]
			
	Appare nelle tipologie:
	
				03.1 Articolo su rivista (Journal article)

File in questo prodotto:

File	Dimensione	Formato
CVIU_2026.pdf accesso aperto Tipologia: Versione editoriale (Publisher’s layout) Licenza: Creative commons Dimensione 1.87 MB Formato Adobe PDF Visualizza/Apri	1.87 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/485070

Citazioni

ND

0

0

ND

social impact