A clear understanding of where humans move in a scenario, their usual paths and speeds, and where they stop, is very important for different applications, such as mobility studies in urban areas or robot navigation tasks within humanpopulated environments. We propose in this article, a neural architecture based on Vision Transformers (ViTs) to provide this information. This solution can arguably capture spatial correlations more effectively than Convolutional Neural Networks (CNNs). In the paper, we describe the methodology and proposed neural architecture and show the experiments’ results with a standard dataset. We show that the proposed ViT architecture improves the metrics compared to a method based on a CNN. Index Terms—vision transformers, human motion prediction, semantic scene understanding, masked autoencoders, occupancy priors.

Learning Priors of Human Motion With Vision Transformers / Falqueto, Placido; Sanfeliu, Alberto; Palopoli, Luigi; Fontanelli, Daniele. - (2024), pp. 382-389. (Intervento presentato al convegno COMPSAC tenutosi a Osaka, Japan nel 02-04 July 2024) [10.1109/compsac61105.2024.00060].

Learning Priors of Human Motion With Vision Transformers

Falqueto, Placido
Primo
;
Palopoli, Luigi
Co-ultimo
;
Fontanelli, Daniele
Co-ultimo
2024-01-01

Abstract

A clear understanding of where humans move in a scenario, their usual paths and speeds, and where they stop, is very important for different applications, such as mobility studies in urban areas or robot navigation tasks within humanpopulated environments. We propose in this article, a neural architecture based on Vision Transformers (ViTs) to provide this information. This solution can arguably capture spatial correlations more effectively than Convolutional Neural Networks (CNNs). In the paper, we describe the methodology and proposed neural architecture and show the experiments’ results with a standard dataset. We show that the proposed ViT architecture improves the metrics compared to a method based on a CNN. Index Terms—vision transformers, human motion prediction, semantic scene understanding, masked autoencoders, occupancy priors.
2024
2024 IEEE 48th Annual Computers, Software, and Applications Conference (COMPSAC)
Piscataway, New Jersey
IEEE
979-8-3503-7696-8
979-8-3503-7697-5
Falqueto, Placido; Sanfeliu, Alberto; Palopoli, Luigi; Fontanelli, Daniele
Learning Priors of Human Motion With Vision Transformers / Falqueto, Placido; Sanfeliu, Alberto; Palopoli, Luigi; Fontanelli, Daniele. - (2024), pp. 382-389. (Intervento presentato al convegno COMPSAC tenutosi a Osaka, Japan nel 02-04 July 2024) [10.1109/compsac61105.2024.00060].
File in questo prodotto:
File Dimensione Formato  
main_1.pdf

accesso aperto

Tipologia: Versione editoriale (Publisher’s layout)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 435.68 kB
Formato Adobe PDF
435.68 kB Adobe PDF Visualizza/Apri
Learning_Priors_of_Human_Motion_With_Vision_Transformers (1).pdf

Solo gestori archivio

Tipologia: Versione editoriale (Publisher’s layout)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 854.06 kB
Formato Adobe PDF
854.06 kB Adobe PDF   Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/434090
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex ND
social impact