Automated analysis of videos for content understanding is one of the most challenging and well researched areas in computer vision and multimedia. This thesis addresses the problem of video content understanding in the context of action recognition. The major challenge faced by this research problem is the variations of the spatio-temporal patterns that constitute each action category and the difficulty in generating a succinct representation encapsulating these patterns. This thesis considers two important aspects of videos for addressing this problem: (1) a video is a sequence of images with an inherent temporal dependency that defines the actual pattern to be recognized; (2) not all spatial regions of the video frame are equally important for discriminating one action category from another. The first aspect shows the importance of aggregating frame level features in a sequential manner while the second aspect signifies the importance of selective encoding of frame level features. The first problem is addressed by analyzing popular Convolutional Neural Network (CNN)-Recurrent Neural Network (RNN) architectures for video representation generation and concludes that Convolutional Long Short-Term Memory (ConvLSTM), a variant of the popular Long Short-Term Memory (LSTM) RNN unit, is suitable for encoding spatio-temporal patterns occurring in a video sequence. The second problem is tackled by developing a spatial attention mechanism for the selective encoding of spatial features by weighting spatial regions in the feature tensor that are relevant for identifying the action category. Detailed experimental analysis carried out on two video recognition tasks showed that spatially selective encoding is indeed beneficial. Inspired from the two aforementioned findings, a new recurrent neural unit, called Long Short-Term Attention (LSTA), is developed by augmenting LSTM with built-in spatial attention and a revised output gating. The first enables LSTA to attend to the relevant spatial regions while maintaining a smooth tracking of the attended regions and the latter allows the network to propagate a filtered version of the memory localized on the most discriminative components of the video. LSTA surpasses the recognition accuracy of existing state-of-the-art techniques on popular egocentric activity recognition benchmarks, showing its effectiveness in video representation generation.

Deep Neural Architectures for Video Representation Learning / Sudhakaran, Swathikiran. - (2019), pp. 1-121.

Deep Neural Architectures for Video Representation Learning

Sudhakaran, Swathikiran
2019-01-01

Abstract

Automated analysis of videos for content understanding is one of the most challenging and well researched areas in computer vision and multimedia. This thesis addresses the problem of video content understanding in the context of action recognition. The major challenge faced by this research problem is the variations of the spatio-temporal patterns that constitute each action category and the difficulty in generating a succinct representation encapsulating these patterns. This thesis considers two important aspects of videos for addressing this problem: (1) a video is a sequence of images with an inherent temporal dependency that defines the actual pattern to be recognized; (2) not all spatial regions of the video frame are equally important for discriminating one action category from another. The first aspect shows the importance of aggregating frame level features in a sequential manner while the second aspect signifies the importance of selective encoding of frame level features. The first problem is addressed by analyzing popular Convolutional Neural Network (CNN)-Recurrent Neural Network (RNN) architectures for video representation generation and concludes that Convolutional Long Short-Term Memory (ConvLSTM), a variant of the popular Long Short-Term Memory (LSTM) RNN unit, is suitable for encoding spatio-temporal patterns occurring in a video sequence. The second problem is tackled by developing a spatial attention mechanism for the selective encoding of spatial features by weighting spatial regions in the feature tensor that are relevant for identifying the action category. Detailed experimental analysis carried out on two video recognition tasks showed that spatially selective encoding is indeed beneficial. Inspired from the two aforementioned findings, a new recurrent neural unit, called Long Short-Term Attention (LSTA), is developed by augmenting LSTM with built-in spatial attention and a revised output gating. The first enables LSTA to attend to the relevant spatial regions while maintaining a smooth tracking of the attended regions and the latter allows the network to propagate a filtered version of the memory localized on the most discriminative components of the video. LSTA surpasses the recognition accuracy of existing state-of-the-art techniques on popular egocentric activity recognition benchmarks, showing its effectiveness in video representation generation.
2019
XXXI
2019-2020
Ingegneria e scienza dell'Informaz (29/10/12-)
Information and Communication Technology
Lanz, Oswald
no
Inglese
Settore INF/01 - Informatica
File in questo prodotto:
File Dimensione Formato  
swathi_thesis_rev1.pdf

Solo gestori archivio

Tipologia: Tesi di dottorato (Doctoral Thesis)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 6.14 MB
Formato Adobe PDF
6.14 MB Adobe PDF   Visualizza/Apri
disclaimer-13062019171335.pdf

Solo gestori archivio

Tipologia: Tesi di dottorato (Doctoral Thesis)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 922.07 kB
Formato Adobe PDF
922.07 kB Adobe PDF   Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/369191
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact