We introduce Spatio-Temporal Vector of Locally Max Pooled Features (ST-VLMPF), a super vector-based encoding method specifically designed for local deep features encoding. The proposed method addresses an important problem of video understanding: how to build a video representation that incorporates the CNN features over the entire video. Feature assignment is carried out at two levels, by using the similarity and spatio-temporal information. For each assignment we build a specific encoding, focused on the nature of deep features, with the goal to capture the highest feature responses from the highest neuron activation of the network. Our ST-VLMPF clearly provides a more reliable video representation than some of the most widely used and powerful encoding approaches (Improved Fisher Vectors and Vector of Locally Aggregated Descriptors), while maintaining a low computational complexity. We conduct experiments on three action recognition datasets: HMDB51, UCF50 and UCF101. Our pipeline o...

Spatio-Temporal Vector of Locally Max Pooled Features for Action Recognition in Videos / Duta, Ionut Cosmin; Ionescu, Bogdan; Aizawa, Kiyoharu; Sebe, Nicu. - 2017-:(2017), pp. 3205-3214. ( 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017 Honolulu JUL 21-26, 2017) [10.1109/CVPR.2017.341].

Spatio-Temporal Vector of Locally Max Pooled Features for Action Recognition in Videos

Duta, Ionut Cosmin;Sebe, Nicu
2017-01-01

Abstract

We introduce Spatio-Temporal Vector of Locally Max Pooled Features (ST-VLMPF), a super vector-based encoding method specifically designed for local deep features encoding. The proposed method addresses an important problem of video understanding: how to build a video representation that incorporates the CNN features over the entire video. Feature assignment is carried out at two levels, by using the similarity and spatio-temporal information. For each assignment we build a specific encoding, focused on the nature of deep features, with the goal to capture the highest feature responses from the highest neuron activation of the network. Our ST-VLMPF clearly provides a more reliable video representation than some of the most widely used and powerful encoding approaches (Improved Fisher Vectors and Vector of Locally Aggregated Descriptors), while maintaining a low computational complexity. We conduct experiments on three action recognition datasets: HMDB51, UCF50 and UCF101. Our pipeline o...
2017
30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017)
Piscataway, NJ
IEEE
978-1-5386-0457-1
Duta, Ionut Cosmin; Ionescu, Bogdan; Aizawa, Kiyoharu; Sebe, Nicu
Spatio-Temporal Vector of Locally Max Pooled Features for Action Recognition in Videos / Duta, Ionut Cosmin; Ionescu, Bogdan; Aizawa, Kiyoharu; Sebe, Nicu. - 2017-:(2017), pp. 3205-3214. ( 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017 Honolulu JUL 21-26, 2017) [10.1109/CVPR.2017.341].
File in questo prodotto:
File Dimensione Formato  
08099824.pdf

Solo gestori archivio

Tipologia: Versione editoriale (Publisher’s layout)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 560.48 kB
Formato Adobe PDF
560.48 kB Adobe PDF   Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/193400
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 61
  • ???jsp.display-item.citation.isi??? 34
  • OpenAlex 71
social impact