Source-Free Video Unsupervised Domain Adaptation (SFVUDA) task consists in adapting an action recognition model, trained on a labelled source dataset, to an unlabelled target dataset, without accessing the actual source data. The previous approaches have attempted to address SFVUDA by leveraging self-supervision (e.g., enforcing temporal consistency) derived from the target data itself. In this work, we take an orthogonal approach by exploiting "web-supervision" from Large Language-Vision Models (LLVMs), driven by the rationale that LLVMs contain a rich world prior surprisingly robust to domain-shift. We showcase the unreasonable effectiveness of integrating LLVMs for SFVUDA by devising an intuitive and parameter-efficient method, which we name Domain Adaptation with Large Language-Vision models (DALL-V), that distills the world prior and complementary source model information into a student network tailored for the target. Despite the simplicity, DALL-V 1 achieves significant improvement over state-of-the-art SFVUDA methods.
The Unreasonable Effectiveness of Large Language-Vision Models for Source-free Video Domain Adaptation / Zara, Giacomo; Conti, Alessandro; Roy, Subhankar; Lathuilière, Stéphane; Rota, Paolo; Ricci, Elisa. - (2023), pp. 10273-10283. (Intervento presentato al convegno 2023 IEEE/CVF International Conference on Computer Vision, ICCV 2023 tenutosi a Parigi, Francia nel 01-06 October 2023) [10.1109/ICCV51070.2023.00946].
The Unreasonable Effectiveness of Large Language-Vision Models for Source-free Video Domain Adaptation
Zara, Giacomo;Conti, Alessandro;Roy, Subhankar;Lathuilière, Stéphane;Rota, Paolo;Ricci, Elisa
2023-01-01
Abstract
Source-Free Video Unsupervised Domain Adaptation (SFVUDA) task consists in adapting an action recognition model, trained on a labelled source dataset, to an unlabelled target dataset, without accessing the actual source data. The previous approaches have attempted to address SFVUDA by leveraging self-supervision (e.g., enforcing temporal consistency) derived from the target data itself. In this work, we take an orthogonal approach by exploiting "web-supervision" from Large Language-Vision Models (LLVMs), driven by the rationale that LLVMs contain a rich world prior surprisingly robust to domain-shift. We showcase the unreasonable effectiveness of integrating LLVMs for SFVUDA by devising an intuitive and parameter-efficient method, which we name Domain Adaptation with Large Language-Vision models (DALL-V), that distills the world prior and complementary source model information into a student network tailored for the target. Despite the simplicity, DALL-V 1 achieves significant improvement over state-of-the-art SFVUDA methods.File | Dimensione | Formato | |
---|---|---|---|
Zara_The_Unreasonable_Effectiveness_of_Large_Language-Vision_Models_for_Source-Free_Video_ICCV_2023_paper.pdf
accesso aperto
Descrizione: ICCV paper is the Open Access version, provided by the Computer Vision Foundation. Except for this watermark, it is identical to the accepted version;
Tipologia:
Post-print referato (Refereed author’s manuscript)
Licenza:
Tutti i diritti riservati (All rights reserved)
Dimensione
753.02 kB
Formato
Adobe PDF
|
753.02 kB | Adobe PDF | Visualizza/Apri |
The_Unreasonable_Effectiveness_of_Large_Language-Vision_Models_for_Source-free_Video_Domain_Adaptation.pdf
Solo gestori archivio
Tipologia:
Versione editoriale (Publisher’s layout)
Licenza:
Tutti i diritti riservati (All rights reserved)
Dimensione
1.28 MB
Formato
Adobe PDF
|
1.28 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione