Spatial entropy as an inductive bias for vision transformers

Peruzzo, Elia; Sangineto, Enver; Liu, Yahui; De Nadai, Marco; Wei, Bi; Lepri, Bruno; Sebe, Nicu

doi:10.1007/s10994-024-06570-7

Recent work on Vision Transformers (VTs) showed that introducing a local inductive bias in the VT architecture helps reducing the number of samples necessary for training. However, the architecture modifications lead to a loss of generality of the Transformer backbone, partially contradicting the push towards the development of uniform architectures, shared, e.g., by both the Computer Vision and the Natural Language Processing areas. In this work, we propose a different and complementary direction, in which a local bias is introduced using an auxiliary self-supervised task, performed jointly with standard supervised training. Specifically, we exploit the observation that the attention maps of VTs, when trained with self-supervision, can contain a semantic segmentation structure which does not spontaneously emerge when training is supervised. Thus, we explicitly encourage the emergence of this spatial clustering as a form of training regularization. In more detail, we exploit the assumption that, in a given image, objects usually correspond to few connected regions, and we propose a spatial formulation of the information entropy to quantify this object-based inductive bias. By minimizing the proposed spatial entropy, we include an additional self-supervised signal during training. Using extensive experiments, we show that the proposed regularization leads to equivalent or better results than other VT proposals which include a local bias by changing the basic Transformer architecture, and it can drastically boost the VT final accuracy when using small-medium training sets. The code is available at https://github.com/helia95/SAR.

Spatial entropy as an inductive bias for vision transformers / Peruzzo, Elia; Sangineto, Enver; Liu, Yahui; De Nadai, Marco; Bi, Wei; Lepri, Bruno; Sebe, Nicu. - In: MACHINE LEARNING. - ISSN 0885-6125. - 113:9(2024), pp. 6945-6975. [10.1007/s10994-024-06570-7]

Spatial entropy as an inductive bias for vision transformers

Peruzzo, Elia;Sangineto, Enver;Liu, Yahui;De Nadai, Marco;Bi, Wei;Lepri, Bruno;Sebe, Nicu

2024-01-01

Abstract

Recent work on Vision Transformers (VTs) showed that introducing a local inductive bias in the VT architecture helps reducing the number of samples necessary for training. However, the architecture modifications lead to a loss of generality of the Transformer backbone, partially contradicting the push towards the development of uniform architectures, shared, e.g., by both the Computer Vision and the Natural Language Processing areas. In this work, we propose a different and complementary direction, in which a local bias is introduced using an auxiliary self-supervised task, performed jointly with standard supervised training. Specifically, we exploit the observation that the attention maps of VTs, when trained with self-supervision, can contain a semantic segmentation structure which does not spontaneously emerge when training is supervised. Thus, we explicitly encourage the emergence of this spatial clustering as a form of training regularization. In more detail, we exploit the assumption that, in a given image, objects usually correspond to few connected regions, and we propose a spatial formulation of the information entropy to quantify this object-based inductive bias. By minimizing the proposed spatial entropy, we include an additional self-supervised signal during training. Using extensive experiments, we show that the proposed regularization leads to equivalent or better results than other VT proposals which include a local bias by changing the basic Transformer architecture, and it can drastically boost the VT final accuracy when using small-medium training sets. The code is available at https://github.com/helia95/SAR.

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno di pubblicazione (Date of publication)
	
				2024
			
	Titolo del periodico (Journal title)
	
				MACHINE LEARNING
			
	Numero e parte del fascicolo (Issue number and part)
	
				9
			
	DOI
	
				https://dx.doi.org/10.1007/s10994-024-06570-7
			
	Codice Scopus (Scopus identifier)
	
				2-s2.0-85198854483
			
	Codice WOS (WOS identifier)
	
				WOS:001270434200001
			
	Tutti gli autori
	
						Peruzzo, Elia; Sangineto, Enver; Liu, Yahui; De Nadai, Marco; Bi, Wei; Lepri, Bruno; Sebe, Nicu
					
	Citazione
	
				Spatial entropy as an inductive bias for vision transformers / Peruzzo, Elia; Sangineto, Enver; Liu, Yahui; De Nadai, Marco; Bi, Wei; Lepri, Bruno; Sebe, Nicu. - In: MACHINE LEARNING. - ISSN 0885-6125. - 113:9(2024), pp. 6945-6975. [10.1007/s10994-024-06570-7]
			
	Appare nelle tipologie:
	
				03.1 Articolo su rivista (Journal article)

File in questo prodotto:

File	Dimensione	Formato
s10994-024-06570-7 (1).pdf accesso aperto Tipologia: Versione editoriale (Publisher’s layout) Licenza: Creative commons Dimensione 2.16 MB Formato Adobe PDF Visualizza/Apri	2.16 MB	Adobe PDF	Visualizza/Apri