Proceedings of the 3rd Workshop on Natural Language Processing for Conversational AI

Alexandros Papangelis, Paweł Budzianowski, Bing Liu, Elnaz Nouri, Abhinav Rastogi, Yun-Nung Chen (Editors)

Anthology ID:
Association for Computational Linguistics
Bib Export formats:

pdf bib
Proceedings of the 3rd Workshop on Natural Language Processing for Conversational AI
Alexandros Papangelis | Paweł Budzianowski | Bing Liu | Elnaz Nouri | Abhinav Rastogi | Yun-Nung Chen

pdf bib
Not So Fast, Classifier Accuracy and Entropy Reduction in Incremental Intent Classification
Lianna Hrycyk | Alessandra Zarcone | Luzian Hahn

Incremental intent classification requires the assignment of intent labels to partial utterances. However, partial utterances do not necessarily contain enough information to be mapped to the intent class of their complete utterance (correctly and with a certain degree of confidence). Using the final interpretation as the ground truth to measure a classifier’s accuracy during intent classification of partial utterances is thus problematic. We release inCLINC, a dataset of partial and full utterances with human annotations of plausible intent labels for different portions of each utterance, as an upper (human) baseline for incremental intent classification. We analyse the incremental annotations and propose entropy reduction as a measure of human annotators’ convergence on an interpretation (i.e. intent label). We argue that, when the annotators do not converge to one or a few possible interpretations and yet the classifier already identifies the final intent class early on, it is a sign of overfitting that can be ascribed to artefacts in the dataset.

pdf bib
Amendable Generation for Dialogue State Tracking
Xin Tian | Liankai Huang | Yingzhan Lin | Siqi Bao | Huang He | Yunyi Yang | Hua Wu | Fan Wang | Shuqi Sun

In task-oriented dialogue systems, recent dialogue state tracking methods tend to perform one-pass generation of the dialogue state based on the previous dialogue state. The mistakes of these models made at the current turn are prone to be carried over to the next turn, causing error propagation. In this paper, we propose a novel Amendable Generation for Dialogue State Tracking (AG-DST), which contains a two-pass generation process : (1) generating a primitive dialogue state based on the dialogue of the current turn and the previous dialogue state, and (2) amending the primitive dialogue state from the first pass. With the additional amending generation pass, our model is tasked to learn more robust dialogue state tracking by amending the errors that still exist in the primitive dialogue state, which plays the role of reviser in the double-checking process and alleviates unnecessary error propagation. Experimental results show that AG-DST significantly outperforms previous works in two active DST datasets (MultiWOZ 2.2 and WOZ 2.0), achieving new state-of-the-art performances.

pdf bib
What Went Wrong? Explaining Overall Dialogue Quality through Utterance-Level Impacts
James D. Finch | Sarah E. Finch | Jinho D. Choi

Improving user experience of a dialogue system often requires intensive developer effort to read conversation logs, run statistical analyses, and intuit the relative importance of system shortcomings. This paper presents a novel approach to automated analysis of conversation logs that learns the relationship between user-system interactions and overall dialogue quality. Unlike prior work on utterance-level quality prediction, our approach learns the impact of each interaction from the overall user rating without utterance-level annotation, allowing resultant model conclusions to be derived on the basis of empirical evidence and at low cost. Our model identifies interactions that have a strong correlation with the overall dialogue quality in a chatbot setting. Experiments show that the automated analysis from our model agrees with expert judgments, making this work the first to show that such weakly-supervised learning of utterance-level quality prediction is highly achievable.

pdf bib
Semi-supervised Intent Discovery with Contrastive Learning
Xiang Shen | Yinge Sun | Yao Zhang | Mani Najmabadi

User intent discovery is a key step in developing a Natural Language Understanding (NLU) module at the core of any modern Conversational AI system. Typically, human experts review a representative sample of user input data to discover new intents, which is subjective, costly, and error-prone. In this work, we aim to assist the NLU developers by presenting a novel method for discovering new intents at scale given a corpus of utterances. Our method utilizes supervised contrastive learning to leverage information from a domain-relevant, already labeled dataset and identifies new intents in the corpus at hand using unsupervised K-means clustering. Our method outperforms the state-of-the-art by a large margin up to 2 % and 13 % on two benchmark datasets, measured by clustering accuracy. Furthermore, we apply our method on a large dataset from the travel domain to demonstrate its effectiveness on a real-world use case.

pdf bib
CS-BERT : a pretrained model for customer service dialoguesCS-BERT: a pretrained model for customer service dialogues
Peiyao Wang | Joyce Fang | Julia Reinspach

Large-scale pretrained transformer models have demonstrated state-of-the-art (SOTA) performance in a variety of NLP tasks. Nowadays, numerous pretrained models are available in different model flavors and different languages, and can be easily adapted to one’s downstream task. However, only a limited number of models are available for dialogue tasks, and in particular, goal-oriented dialogue tasks. In addition, the available pretrained models are trained on general domain language, creating a mismatch between the pretraining language and the downstream domain launguage. In this contribution, we present CS-BERT, a BERT model pretrained on millions of dialogues in the customer service domain. We evaluate CS-BERT on several downstream customer service dialogue tasks, and demonstrate that our in-domain pretraining is advantageous compared to other pretrained models in both zero-shot experiments as well as in finetuning experiments, especially in a low-resource data setting.

pdf bib
PLATO-KAG : Unsupervised Knowledge-Grounded Conversation via Joint ModelingPLATO-KAG: Unsupervised Knowledge-Grounded Conversation via Joint Modeling
Xinxian Huang | Huang He | Siqi Bao | Fan Wang | Hua Wu | Haifeng Wang

Large-scale conversation models are turning to leveraging external knowledge to improve the factual accuracy in response generation. Considering the infeasibility to annotate the external knowledge for large-scale dialogue corpora, it is desirable to learn the knowledge selection and response generation in an unsupervised manner. In this paper, we propose PLATO-KAG (Knowledge-Augmented Generation), an unsupervised learning approach for end-to-end knowledge-grounded conversation modeling. For each dialogue context, the top-k relevant knowledge elements are selected and then employed in knowledge-grounded response generation. The two components of knowledge selection and response generation are optimized jointly and effectively under a balanced objective. Experimental results on two publicly available datasets validate the superiority of PLATO-KAG.

pdf bib
Personalized Search-based Query Rewrite System for Conversational AIAI
Eunah Cho | Ziyan Jiang | Jie Hao | Zheng Chen | Saurabh Gupta | Xing Fan | Chenlei Guo

Query rewrite (QR) is an emerging component in conversational AI systems, reducing user defect. User defect is caused by various reasons, such as errors in the spoken dialogue system, users’ slips of the tongue or their abridged language. Many of the user defects stem from personalized factors, such as user’s speech pattern, dialect, or preferences. In this work, we propose a personalized search-based QR framework, which focuses on automatic reduction of user defect. We build a personalized index for each user, which encompasses diverse affinity layers to reflect personal preferences for each user in the conversational AI. Our personalized QR system contains retrieval and ranking layers. Supported by user feedback based learning, training our models does not require hand-annotated data. Experiments on personalized test set showed that our personalized QR system is able to correct systematic and user errors by utilizing phonetic and semantic inputs.

pdf bib
AuGPT : Auxiliary Tasks and Data Augmentation for End-To-End Dialogue with Pre-Trained Language ModelsAuGPT: Auxiliary Tasks and Data Augmentation for End-To-End Dialogue with Pre-Trained Language Models
Jonáš Kulhánek | Vojtěch Hudeček | Tomáš Nekvinda | Ondřej Dušek

Attention-based pre-trained language models such as GPT-2 brought considerable progress to end-to-end dialogue modelling. However, they also present considerable risks for task-oriented dialogue, such as lack of knowledge grounding or diversity. To address these issues, we introduce modified training objectives for language model finetuning, and we employ massive data augmentation via back-translation to increase the diversity of the training data. We further examine the possibilities of combining data from multiples sources to improve performance on the target dataset. We carefully evaluate our contributions with both human and automatic methods. Our model substantially outperforms the baseline on the MultiWOZ data and shows competitive performance with state of the art in both automatic and human evaluation.

pdf bib
Using Pause Information for More Accurate Entity Recognition
Sahas Dendukuri | Pooja Chitkara | Joel Ruben Antony Moniz | Xiao Yang | Manos Tsagkias | Stephen Pulman

Entity tags in human-machine dialog are integral to natural language understanding (NLU) tasks in conversational assistants. However, current systems struggle to accurately parse spoken queries with the typical use of text input alone, and often fail to understand the user intent. Previous work in linguistics has identified a cross-language tendency for longer speech pauses surrounding nouns as compared to verbs. We demonstrate that the linguistic observation on pauses can be used to improve accuracy in machine-learnt language understanding tasks. Analysis of pauses in French and English utterances from a commercial voice assistant shows the statistically significant difference in pause duration around multi-token entity span boundaries compared to within entity spans. Additionally, in contrast to text-based NLU, we apply pause duration to enrich contextual embeddings to improve shallow parsing of entities. Results show that our proposed novel embeddings improve the relative error rate by up to 8 % consistently across three domains for French, without any added annotation or alignment costs to the parser.

pdf bib
Teach Me What to Say and I Will Learn What to Pick : Unsupervised Knowledge Selection Through Response Generation with Pretrained Generative ModelsI Will Learn What to Pick: Unsupervised Knowledge Selection Through Response Generation with Pretrained Generative Models
Ehsan Lotfi | Maxime De Bruyn | Jeska Buhmann | Walter Daelemans

Knowledge Grounded Conversation Models are usually based on a selection / retrieval module and a generation module, trained separately or simultaneously, with or without having access to a ‘gold’ knowledge option. With the introduction of large pre-trained generative models, the selection and generation part have become more and more entangled, shifting the focus towards enhancing knowledge incorporation (from multiple sources) instead of trying to pick the best knowledge option. These approaches however depend on knowledge labels and/or a separate dense retriever for their best performance. In this work we study the unsupervised selection abilities of pre-trained generative models (e.g. BART) and show that by adding a score-and-aggregate module between encoder and decoder, they are capable of learning to pick the proper knowledge through minimising the language modelling loss (i.e. without having access to knowledge labels). Trained as such, our model-K-Mine-shows competitive selection and generation performance against models that benefit from knowledge labels and/or separate dense retriever.

pdf bib
Influence of user personality on dialogue task performance : A case study using a rule-based dialogue system
Ao Guo | Atsumoto Ohashi | Ryu Hirai | Yuya Chiba | Yuiko Tsunomori | Ryuichiro Higashinaka

Endowing a task-oriented dialogue system with adaptiveness to user personality can greatly help improve the performance of a dialogue task. However, such a dialogue system can be practically challenging to implement, because it is unclear how user personality influences dialogue task performance. To explore the relationship between user personality and dialogue task performance, we enrolled participants via crowdsourcing to first answer specified personality questionnaires and then chat with a dialogue system to accomplish assigned tasks. A rule-based dialogue system on the prevalent Multi-Domain Wizard-of-Oz (MultiWOZ) task was used. A total of 211 participants’ personalities and their 633 dialogues were collected and analyzed. The results revealed that sociable and extroverted people tended to fail the task, whereas neurotic people were more likely to succeed. We extracted features related to user dialogue behaviors and performed further analysis to determine which kind of behavior influences task performance. As a result, we identified that average utterance length and slots per utterance are the key features of dialogue behavior that are highly correlated with both task performance and user personality.

pdf bib
Towards Zero and Few-shot Knowledge-seeking Turn Detection in Task-orientated Dialogue Systems
Di Jin | Shuyang Gao | Seokhwan Kim | Yang Liu | Dilek Hakkani-Tur

Most prior work on task-oriented dialogue systems is restricted to supporting domain APIs. However, users may have requests that are out of the scope of these APIs. This work focuses on identifying such user requests. Existing methods for this task mainly rely on fine-tuning pre-trained models on large annotated data. We propose a novel method, REDE, based on adaptive representation learning and density estimation. REDE can be applied to zero-shot cases, and quickly learns a high-performing detector with only a few shots by updating less than 3 K parameters. We demonstrate REDE’s competitive performance on DSTC9 data and our newly collected test set.