Facial action units (AUs), as defined in the Facial Action Coding System (FACS), have received significant research interest owing to their diverse range of applications in facial state analysis. Current mainstream FAU recognition models have a notable limitation, i.e., focusing only on the accuracy of AU recognition and overlooking explanations of corresponding AU states. In this paper, we propose an end-to-end Vision-Language joint learning network for explainable FAU recognition (termed VL-FAU), which aims to reinforce AU representation capability and language interpretability through the integration of joint multimodal tasks. Specifically, VL-FAU brings together language models to generate fine-grained local muscle descriptions and distinguishable global face description when optimising FAU recognition. Through this, the global facial representation and its local AU representations will achieve higher distinguishability among different AUs and different subjects. In addition, multi-level AU representation learning is utilised to improve AU individual attention-aware representation capabilities based on multi-scale combined facial stem feature. Extensive experiments on DISFA and BP4D AU datasets show that the proposed approach achieves superior performance over the state-of-the-art methods on most of the metrics. In addition, compared with mainstream FAU recognition methods, VL-FAU can provide local- and global-level interpretability language descriptions with the AUs' predictions.

Towards End-to-End Explainable Facial Action Unit Recognition via Vision-Language Joint Learning / Ge, Xuri; Fu, Junchen; Chen, Fuhai; An, Shan; Sebe, Nicu; Jose, J. M.. - (2024), pp. 8189-8198. ( 32nd ACM International Conference on Multimedia, MM 2024 aus 2024) [10.1145/3664647.3681443].

Towards End-to-End Explainable Facial Action Unit Recognition via Vision-Language Joint Learning

Nicu Sebe;
2024-01-01

Abstract

Facial action units (AUs), as defined in the Facial Action Coding System (FACS), have received significant research interest owing to their diverse range of applications in facial state analysis. Current mainstream FAU recognition models have a notable limitation, i.e., focusing only on the accuracy of AU recognition and overlooking explanations of corresponding AU states. In this paper, we propose an end-to-end Vision-Language joint learning network for explainable FAU recognition (termed VL-FAU), which aims to reinforce AU representation capability and language interpretability through the integration of joint multimodal tasks. Specifically, VL-FAU brings together language models to generate fine-grained local muscle descriptions and distinguishable global face description when optimising FAU recognition. Through this, the global facial representation and its local AU representations will achieve higher distinguishability among different AUs and different subjects. In addition, multi-level AU representation learning is utilised to improve AU individual attention-aware representation capabilities based on multi-scale combined facial stem feature. Extensive experiments on DISFA and BP4D AU datasets show that the proposed approach achieves superior performance over the state-of-the-art methods on most of the metrics. In addition, compared with mainstream FAU recognition methods, VL-FAU can provide local- and global-level interpretability language descriptions with the AUs' predictions.
2024
MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
New York
Association for Computing Machinery, Inc
9798400706868
Ge, Xuri; Fu, Junchen; Chen, Fuhai; An, Shan; Sebe, Nicu; Jose, J. M.
Towards End-to-End Explainable Facial Action Unit Recognition via Vision-Language Joint Learning / Ge, Xuri; Fu, Junchen; Chen, Fuhai; An, Shan; Sebe, Nicu; Jose, J. M.. - (2024), pp. 8189-8198. ( 32nd ACM International Conference on Multimedia, MM 2024 aus 2024) [10.1145/3664647.3681443].
File in questo prodotto:
File Dimensione Formato  
3664647.3681443.pdf

accesso aperto

Tipologia: Versione editoriale (Publisher’s layout)
Licenza: Creative commons
Dimensione 2.24 MB
Formato Adobe PDF
2.24 MB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11572/439451
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 2
  • ???jsp.display-item.citation.isi??? 0
  • OpenAlex ND
social impact