The Unreasonable Effectiveness of Large Language-Vision Models for Source-free Video Domain Adaptation

IRIS

Source-Free Video Unsupervised Domain Adaptation (SFVUDA) task consists in adapting an action recognition model, trained on a labelled source dataset, to an unlabelled target dataset, without accessing the actual source data. The previous approaches have attempted to address SFVUDA by leveraging self-supervision (e.g., enforcing temporal consistency) derived from the target data itself. In this work, we take an orthogonal approach by exploiting "web-supervision" from Large Language-Vision Models (LLVMs), driven by the rationale that LLVMs contain a rich world prior surprisingly robust to domain-shift. We showcase the unreasonable effectiveness of integrating LLVMs for SFVUDA by devising an intuitive and parameter-efficient method, which we name Domain Adaptation with Large Language-Vision models (DALL-V), that distills the world prior and complementary source model information into a student network tailored for the target. Despite the simplicity, DALL-V 1 achieves significant improvement over state-of-the-art SFVUDA methods.

The Unreasonable Effectiveness of Large Language-Vision Models for Source-free Video Domain Adaptation / Zara, Giacomo; Conti, Alessandro; Roy, Subhankar; Lathuilière, Stéphane; Rota, Paolo; Ricci, Elisa. - (2023), pp. 10273-10283. (Intervento presentato al convegno 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023 tenutosi a Parigi, Francia nel 01-06 October 2023) [10.1109/ICCV51070.2023.00946].

The Unreasonable Effectiveness of Large Language-Vision Models for Source-free Video Domain Adaptation

Zara, Giacomo;Conti, Alessandro;Roy, Subhankar;Lathuilière, Stéphane;Rota, Paolo;Ricci, Elisa

2023-01-01

Abstract

Source-Free Video Unsupervised Domain Adaptation (SFVUDA) task consists in adapting an action recognition model, trained on a labelled source dataset, to an unlabelled target dataset, without accessing the actual source data. The previous approaches have attempted to address SFVUDA by leveraging self-supervision (e.g., enforcing temporal consistency) derived from the target data itself. In this work, we take an orthogonal approach by exploiting "web-supervision" from Large Language-Vision Models (LLVMs), driven by the rationale that LLVMs contain a rich world prior surprisingly robust to domain-shift. We showcase the unreasonable effectiveness of integrating LLVMs for SFVUDA by devising an intuitive and parameter-efficient method, which we name Domain Adaptation with Large Language-Vision models (DALL-V), that distills the world prior and complementary source model information into a student network tailored for the target. Despite the simplicity, DALL-V 1 achieves significant improvement over state-of-the-art SFVUDA methods.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione (Date of publication)
	
				2023
			
	Titolo del volume (Proceedings title)
	
				2023 IEEE/CVF International Conference on Computer Vision (ICCV)
			
	Luogo di edizione (Place of publication)
	
				Piscataway, NJ USA
			
	Casa editrice (Publisher)
	
				IEEE Computer Society
			
	ISBN
	
				979-8-3503-0718-4
			
	Codice Scopus (Scopus Identifier)
	
				2-s2.0-85180558071
			
	Codice WOS (WOS identifier)
	
				WOS:001169499002067
			
	Tutti gli autori
	
						Zara, Giacomo; Conti, Alessandro; Roy, Subhankar; Lathuilière, Stéphane; Rota, Paolo; Ricci, Elisa
					
	Citazione
	
				The Unreasonable Effectiveness of Large Language-Vision Models for Source-free Video Domain Adaptation / Zara, Giacomo; Conti, Alessandro; Roy, Subhankar; Lathuilière, Stéphane; Rota, Paolo; Ricci, Elisa. - (2023), pp. 10273-10283. (Intervento presentato al  convegno 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023 tenutosi a Parigi, Francia nel 01-06 October 2023) [10.1109/ICCV51070.2023.00946].
			
	Appare nelle tipologie:
	
				04.1 Saggio in atti di convegno (Paper in Proceedings)

File in questo prodotto:

File	Dimensione	Formato
Zara_The_Unreasonable_Effectiveness_of_Large_Language-Vision_Models_for_Source-Free_Video_ICCV_2023_paper.pdf accesso aperto Descrizione: ICCV paper is the Open Access version, provided by the Computer Vision Foundation. Except for this watermark, it is identical to the accepted version; Tipologia: Post-print referato (Refereed author’s manuscript) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 753.02 kB Formato Adobe PDF Visualizza/Apri	753.02 kB	Adobe PDF	Visualizza/Apri
The_Unreasonable_Effectiveness_of_Large_Language-Vision_Models_for_Source-free_Video_Domain_Adaptation.pdf Solo gestori archivio Tipologia: Versione editoriale (Publisher’s layout) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 1.28 MB Formato Adobe PDF Visualizza/Apri	1.28 MB	Adobe PDF	Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/400792

Citazioni

ND

8

3

ND

social impact