Much research in Interactive Question Answering (IQA) has centered on artificially collected series of context questions. Instead, the goal of this paper is to emphasize the importance of evaluating IQA systems against realistic user questions. We do this by comparing the highly popular TREC QA context task data against two more realistic data sets: firstly, a corpus of real user interaction logs that we collected through a publicly accessible chat-bot, and secondly, a corpus of QA dialogues collected in a Wizard-of-Oz study. We compare these data using basic quantitative measures and different measures for expressing inter-utterance coherence. We conclude with proposals for choosing test data for a new evaluation campaign that is centered on realistic user-system interactions, and that is well suited for empirical and Machine Learning approaches.
File in questo prodotto:
Non ci sono file associati a questo prodotto.