Longitudinal Dialogues (LD) are the most challenging type of conversations for human-machine dialogue systems. LDs include the recollections of events, personal thoughts, and emotions specific to each individual in a sparse sequence of dialogue sessions. Dialogue systems designed for LDs should uniquely interact with the users over multiple sessions and long periods of time (e.g. weeks). Over an extended period of time, the machine should learn about the users' life-events and participants from the responses shared during each dialogue session, and create a personal user model. The acquired user model must consider individuals' states, profiles, and experiences that vary among users and dialogue sessions. The acquisition of a dialogue corpus is the first key step in the process of training a dialogue model. There has been limited research on the problem of collecting personal conversations from users over a long period of time. Corpora acquisitions have been designed either for open-domain information retrieval or slot-filling tasks with stereotypical user models "\textit{averaged}" among users. In contrast, the level of personalization in LDs is beyond a set of personal preferences and can not be learned from a limited set of persona statements. Advancement in human evaluation is another required step to make progress in dialogue system research. Current automatic evaluation measures are poor surrogates, at best. There are no agreed-upon human evaluation protocols and it is difficult to develop them. As a result, researchers either perform non-replicable, non-transparent, and inconsistent procedures or, worse, limit themselves to automated metrics. In this thesis, we study the design and training of dialogue models for LDs. Our first contribution is a methodology for data collection and elicitation of multi-session personal dialogues. Using the proposed methodology, we collect a dialogue corpus of human-machine LDs, followed by a case study in the mental health domain. In the second contribution, we propose an unsupervised approach to automatically parse the users' responses at each interaction and construct the graph of users' personal space of events and participants. We extend this contribution further by studying the Information Status of the events in a personal narrative and introducing a novel challenging task of identifying new events. In our third contribution, we address the problems of non-comparability and inconsistency of human evaluation tasks in the literature, and propose to standardize the human evaluation of the response generation model. We then present a detailed protocol for the task of human evaluation of generated responses. Last but not least, we investigate whether general-purpose Pre-trained Language Models (PLM) are appropriate for the problem of grounded response generation in LDs. We experiment with different representations of the personal knowledge extracted from previous dialogue sessions of the user, including a novel graph representation of the mentioned events and participants. We present the automatic and human evaluations of the models, the contribution of the knowledge in the response generation, and the natural language generation errors by each model.
Response Generation in Longitudinal Dialogues / Mousavi, Seyed Mahed. - (2023 May 08), pp. 1-73. [10.15168/11572_376853]
Response Generation in Longitudinal Dialogues
Mousavi, Seyed Mahed
2023-05-08
Abstract
Longitudinal Dialogues (LD) are the most challenging type of conversations for human-machine dialogue systems. LDs include the recollections of events, personal thoughts, and emotions specific to each individual in a sparse sequence of dialogue sessions. Dialogue systems designed for LDs should uniquely interact with the users over multiple sessions and long periods of time (e.g. weeks). Over an extended period of time, the machine should learn about the users' life-events and participants from the responses shared during each dialogue session, and create a personal user model. The acquired user model must consider individuals' states, profiles, and experiences that vary among users and dialogue sessions. The acquisition of a dialogue corpus is the first key step in the process of training a dialogue model. There has been limited research on the problem of collecting personal conversations from users over a long period of time. Corpora acquisitions have been designed either for open-domain information retrieval or slot-filling tasks with stereotypical user models "\textit{averaged}" among users. In contrast, the level of personalization in LDs is beyond a set of personal preferences and can not be learned from a limited set of persona statements. Advancement in human evaluation is another required step to make progress in dialogue system research. Current automatic evaluation measures are poor surrogates, at best. There are no agreed-upon human evaluation protocols and it is difficult to develop them. As a result, researchers either perform non-replicable, non-transparent, and inconsistent procedures or, worse, limit themselves to automated metrics. In this thesis, we study the design and training of dialogue models for LDs. Our first contribution is a methodology for data collection and elicitation of multi-session personal dialogues. Using the proposed methodology, we collect a dialogue corpus of human-machine LDs, followed by a case study in the mental health domain. In the second contribution, we propose an unsupervised approach to automatically parse the users' responses at each interaction and construct the graph of users' personal space of events and participants. We extend this contribution further by studying the Information Status of the events in a personal narrative and introducing a novel challenging task of identifying new events. In our third contribution, we address the problems of non-comparability and inconsistency of human evaluation tasks in the literature, and propose to standardize the human evaluation of the response generation model. We then present a detailed protocol for the task of human evaluation of generated responses. Last but not least, we investigate whether general-purpose Pre-trained Language Models (PLM) are appropriate for the problem of grounded response generation in LDs. We experiment with different representations of the personal knowledge extracted from previous dialogue sessions of the user, including a novel graph representation of the mentioned events and participants. We present the automatic and human evaluations of the models, the contribution of the knowledge in the response generation, and the natural language generation errors by each model.File | Dimensione | Formato | |
---|---|---|---|
Mahed_PhD_Thesis_Final.pdf
accesso aperto
Tipologia:
Tesi di dottorato (Doctoral Thesis)
Licenza:
Creative commons
Dimensione
4.98 MB
Formato
Adobe PDF
|
4.98 MB | Adobe PDF | Visualizza/Apri |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione