The last advancements in technology leads to an easy acquisition and spreading of multi-dimensional multimedia content, e.g. videos, which in many cases depict human faces. From such videos, valuable information describing the intrinsic characteristic of the recorded user can be retrieved: the features extracted from the facial patch are relevant descriptors that allow for the measurement of subject's emotional status or the identification of synthetic characters. One of the emerging challenges is the development of contactless approaches based on face analysis aiming at measuring the emotional status of the subject without placing sensors that limit or bias his experience. This raises even more interest in the context of Quality of Experience (QoE) measurement, or the measurement of user emotional status when subjected to a multimedia content, since it allows for retrieving the overall acceptability of the content as perceived by the end user. Measuring the impact of a given content to the user can have many implications from both the content producer and the end-user perspectives. For this reason, we pursue the QoE assessment of a user watching multimedia stimuli, i.e. 3D-movies, through the analysis of his facial features acquired by means of contactless approaches. More specifically, the user's Heart Rate (HR) was retrieved by using computer vision techniques applied to the facial recording of the subject and then analysed in order to compute the level of engagement. We show that the proposed framework is effective for long video sequences, being robust to facial movements and illumination changes. We validate it on a dataset of 64 sequences where users observe 3D movies selected to induce variations in users' emotional status. From one hand understanding the interaction between the user's perception of the content and his cognitive-emotional aspects leads to many opportunities to content producers, which may influence people's emotional statuses according to needs that can be driven by political, social, or business interests. On the other hand, the end-user must be aware of the authenticity of the content being watched: advancements in computer renderings allowed for the spreading of fake subjects in videos. Because of this, as a second challenge we target the identification of CG characters in videos by applying two different approaches. We firstly exploit the idea that fake characters do not present any pulse rate signal, while humans' pulse rate is expressed by a sinusoidal signal. The application of computer vision techniques on a facial video allows for the contactless estimation of the subject's HR, thus leading to the identification of signals that lack of a strong sinusoidality, which represent virtual humans. The proposed pipeline allows for a fully automated discrimination, validated on a dataset consisting of 104 videos. Secondly, we make use of facial spatio-temporal texture dynamics that reveal the artefacts introduced by computer renderings techniques when creating a manipulation, e.g. face swapping, on videos depicting human faces. To do so, we consider multiple temporal video segments on which we estimated multi-dimensional (spatial and temporal) texture features. A binary decision of the joint analysis of such features is applied to strengthen the classification accuracy. This is achieved through the use of Local Derivative Patterns on Three Orthogonal Planes (LDP-TOP). Experimental analyses on state-of-the-art datasets of manipulated videos show the discriminative power of such descriptors in separating real and manipulated sequences and identifying the creation method used. The main finding of this thesis is the relevance of facial features in describing intrinsic characteristics of humans. These can be used to retrieve significant information like the physiological response to multimedia stimuli or the authenticity of the human being itself. The application of the proposed approaches also on benchmark dataset returned good results, thus demonstrating real advancements in this research field. In addition to that, these methods can be extended to different practical application, from the autonomous driving safety checks to the identification of spoofing attacks, from the medical check-ups when doing sports to the users' engagement measurement when watching advertising. Because of this, we encourage further investigations in such direction, in order to improve the robustness of the methods, thus allowing for the application to increasingly challenging scenarios.

Facial-based Analysis Tools: Engagement Measurements and Forensics Applications / Bonomi, Mattia. - (2020 Jul 27), pp. 1-90. [10.15168/11572_271342]

Facial-based Analysis Tools: Engagement Measurements and Forensics Applications

Bonomi, Mattia
2020-07-27

Abstract

The last advancements in technology leads to an easy acquisition and spreading of multi-dimensional multimedia content, e.g. videos, which in many cases depict human faces. From such videos, valuable information describing the intrinsic characteristic of the recorded user can be retrieved: the features extracted from the facial patch are relevant descriptors that allow for the measurement of subject's emotional status or the identification of synthetic characters. One of the emerging challenges is the development of contactless approaches based on face analysis aiming at measuring the emotional status of the subject without placing sensors that limit or bias his experience. This raises even more interest in the context of Quality of Experience (QoE) measurement, or the measurement of user emotional status when subjected to a multimedia content, since it allows for retrieving the overall acceptability of the content as perceived by the end user. Measuring the impact of a given content to the user can have many implications from both the content producer and the end-user perspectives. For this reason, we pursue the QoE assessment of a user watching multimedia stimuli, i.e. 3D-movies, through the analysis of his facial features acquired by means of contactless approaches. More specifically, the user's Heart Rate (HR) was retrieved by using computer vision techniques applied to the facial recording of the subject and then analysed in order to compute the level of engagement. We show that the proposed framework is effective for long video sequences, being robust to facial movements and illumination changes. We validate it on a dataset of 64 sequences where users observe 3D movies selected to induce variations in users' emotional status. From one hand understanding the interaction between the user's perception of the content and his cognitive-emotional aspects leads to many opportunities to content producers, which may influence people's emotional statuses according to needs that can be driven by political, social, or business interests. On the other hand, the end-user must be aware of the authenticity of the content being watched: advancements in computer renderings allowed for the spreading of fake subjects in videos. Because of this, as a second challenge we target the identification of CG characters in videos by applying two different approaches. We firstly exploit the idea that fake characters do not present any pulse rate signal, while humans' pulse rate is expressed by a sinusoidal signal. The application of computer vision techniques on a facial video allows for the contactless estimation of the subject's HR, thus leading to the identification of signals that lack of a strong sinusoidality, which represent virtual humans. The proposed pipeline allows for a fully automated discrimination, validated on a dataset consisting of 104 videos. Secondly, we make use of facial spatio-temporal texture dynamics that reveal the artefacts introduced by computer renderings techniques when creating a manipulation, e.g. face swapping, on videos depicting human faces. To do so, we consider multiple temporal video segments on which we estimated multi-dimensional (spatial and temporal) texture features. A binary decision of the joint analysis of such features is applied to strengthen the classification accuracy. This is achieved through the use of Local Derivative Patterns on Three Orthogonal Planes (LDP-TOP). Experimental analyses on state-of-the-art datasets of manipulated videos show the discriminative power of such descriptors in separating real and manipulated sequences and identifying the creation method used. The main finding of this thesis is the relevance of facial features in describing intrinsic characteristics of humans. These can be used to retrieve significant information like the physiological response to multimedia stimuli or the authenticity of the human being itself. The application of the proposed approaches also on benchmark dataset returned good results, thus demonstrating real advancements in this research field. In addition to that, these methods can be extended to different practical application, from the autonomous driving safety checks to the identification of spoofing attacks, from the medical check-ups when doing sports to the users' engagement measurement when watching advertising. Because of this, we encourage further investigations in such direction, in order to improve the robustness of the methods, thus allowing for the application to increasingly challenging scenarios.
27-lug-2020
XXXII
2018-2019
Ingegneria e scienza dell'Informaz (29/10/12-)
Information and Communication Technology
Boato, Giulia
no
Inglese
File in questo prodotto:
File Dimensione Formato  
PhDThesisMattiaBonomi.pdf

Open Access dal 28/07/2021

Descrizione: PhD_Thesis_Mattia_Bonomi
Tipologia: Tesi di dottorato (Doctoral Thesis)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 11.12 MB
Formato Adobe PDF
11.12 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/271342
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
  • OpenAlex ND
social impact