Face-to-face communication relies on the seamless integration of multisensory signals, including voice, gaze, and head movements, to convey meaning effectively. This poses a fundamental computational challenge: optimally binding signals sharing the same communicative intention (e.g., looking at the addressee while speaking) and segregating unrelated signals (e.g., looking away while coughing), all within the rapid turn-taking dynamics of conversation. Critically, the computational mechanisms underlying this extraordinary feat remain largely unknown. Here, we cast face-to-face communication as a Bayesian Causal Inference problem to formally test whether prior expectations arbitrate between the integration and segregation of vocal and bodily signals. Specifically, we asked whether there is a stronger prior tendency to integrate audiovisual signals that convey the same communicative intention, thus establishing a crossmodal pragmatic correspondence. Additionally, we evaluated whether observers solve causal inference by adopting optimal Bayesian decision strategies or non-optimal approximate heuristics. In a spatial localization task, participants watched audiovisual clips of a speaker where the audio (voice) and the video (bodily cues) were sampled either from congruent positions or at increasing spatial disparities. Crucially, we manipulated the pragmatic correspondence of the signals: in a communicative condition, the speaker addressed the participant with their head, gaze and speech; in a non-communicative condition, the speaker kept the head down and produced a meaningless vocalization. We measured audiovisual integration through the ventriloquist effect, which quantifies how much the perceived audio position is misplaced towards the video position. Combining psychophysics with computational modelling, we show that observers solved audiovisual causal inference using non-optimal heuristics that nevertheless approximate optimal Bayesian inference with high accuracy. Remarkably, participants showed a stronger tendency to integrate vocal and bodily information when signals conveyed congruent communicative intent, suggesting that pragmatic correspondences enhance multisensory integration. Collectively, our findings provide novel and compelling evidence that face-to-face communication is shaped by deeply ingrained expectations about how multisensory signals should be structured and interpreted.
Prior expectations guide multisensory integration during face-to-face communication / Mazzi, Giulia; Ferrari, Ambra; Mencaroni, Maria Laura; Valzolgher, Chiara; Tommasini, Mirko; Pavani, Francesco; Benetti, Stefania. - In: PLOS COMPUTATIONAL BIOLOGY. - ISSN 1553-7358. - 21:9(2025), pp. 1-27. [10.1371/journal.pcbi.1013468]
Prior expectations guide multisensory integration during face-to-face communication
Mazzi, Giulia
Co-primo
;Ferrari, AmbraCo-primo
;Mencaroni, Maria LauraSecondo
;Valzolgher, Chiara;Tommasini, Mirko;Pavani, FrancescoPenultimo
;Benetti, StefaniaUltimo
2025-01-01
Abstract
Face-to-face communication relies on the seamless integration of multisensory signals, including voice, gaze, and head movements, to convey meaning effectively. This poses a fundamental computational challenge: optimally binding signals sharing the same communicative intention (e.g., looking at the addressee while speaking) and segregating unrelated signals (e.g., looking away while coughing), all within the rapid turn-taking dynamics of conversation. Critically, the computational mechanisms underlying this extraordinary feat remain largely unknown. Here, we cast face-to-face communication as a Bayesian Causal Inference problem to formally test whether prior expectations arbitrate between the integration and segregation of vocal and bodily signals. Specifically, we asked whether there is a stronger prior tendency to integrate audiovisual signals that convey the same communicative intention, thus establishing a crossmodal pragmatic correspondence. Additionally, we evaluated whether observers solve causal inference by adopting optimal Bayesian decision strategies or non-optimal approximate heuristics. In a spatial localization task, participants watched audiovisual clips of a speaker where the audio (voice) and the video (bodily cues) were sampled either from congruent positions or at increasing spatial disparities. Crucially, we manipulated the pragmatic correspondence of the signals: in a communicative condition, the speaker addressed the participant with their head, gaze and speech; in a non-communicative condition, the speaker kept the head down and produced a meaningless vocalization. We measured audiovisual integration through the ventriloquist effect, which quantifies how much the perceived audio position is misplaced towards the video position. Combining psychophysics with computational modelling, we show that observers solved audiovisual causal inference using non-optimal heuristics that nevertheless approximate optimal Bayesian inference with high accuracy. Remarkably, participants showed a stronger tendency to integrate vocal and bodily information when signals conveyed congruent communicative intent, suggesting that pragmatic correspondences enhance multisensory integration. Collectively, our findings provide novel and compelling evidence that face-to-face communication is shaped by deeply ingrained expectations about how multisensory signals should be structured and interpreted.| File | Dimensione | Formato | |
|---|---|---|---|
|
journal.pcbi.1013468 (1).pdf
accesso aperto
Descrizione: online first
Tipologia:
Versione editoriale (Publisher’s layout)
Licenza:
Creative commons
Dimensione
2.33 MB
Formato
Adobe PDF
|
2.33 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione



