Joint attributes and event analysis for multimedia event detection

Ma, Zhigang; Chang, Xiaojun; Zhongwen, Xu; Sebe, Nicu; Hauptmann, Alexander G.

doi:10.1109/TNNLS.2017.2709308

Semantic attributes have been increasingly used the past few years for multimedia event detection (MED) with promising results. The motivation is that multimedia events generally consist of lower level components such as objects, scenes, and actions. By characterizing multimedia event videos with semantic attributes, one could exploit more informative cues for improved detection results. Much existing work obtains semantic attributes from images, which may be suboptimal for video analysis since these image-inferred attributes do not carry dynamic information that is essential for videos. To address this issue, we propose to learn semantic attributes from external videos using their semantic labels. We name them video attributes in this paper. In contrast with multimedia event videos, these external videos depict lower level contents such as objects, scenes, and actions. To harness video attributes, we propose an algorithm established on a correlation vector that correlates them to a target event. Consequently, we could incorporate video attributes latently as extra information into the event detector learnt from multimedia event videos in a joint framework. To validate our method, we perform experiments on the real-world large-scale TRECVID MED 2013 and 2014 data sets and compare our method with several state-of-the-art algorithms. The experiments show that our method is advantageous for MED.

Joint attributes and event analysis for multimedia event detection / Ma, Zhigang; Chang, Xiaojun; Xu, Zhongwen; Sebe, Nicu; Hauptmann, Alexander G.. - In: IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS. - ISSN 2162-237X. - 29:7(2018), pp. 2921-2930. [10.1109/TNNLS.2017.2709308]