Annual Meeting of the Association for Computational Linguistics (2021)



pdf (full)
bib (full)
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

pdf bib
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
Chengqing Zong | Fei Xia | Wenjie Li | Roberto Navigli

pdf bib
How Did This Get Funded? ! Automatically Identifying Quirky Scientific AchievementsHow Did This Get Funded?! Automatically Identifying Quirky Scientific Achievements
Chen Shani | Nadav Borenstein | Dafna Shahaf

Humor is an important social phenomenon, serving complex social and psychological functions. However, despite being studied for millennia humor is computationally not well understood, often considered an AI-complete problem. In this work, we introduce a novel setting in humor mining : automatically detecting funny and unusual scientific papers. We are inspired by the Ig Nobel prize, a satirical prize awarded annually to celebrate funny scientific achievements (example past winner : Are cows more likely to lie down the longer they stand?). This challenging task has unique characteristics that make it particularly suitable for automatic learning. We construct a dataset containing thousands of funny papers and use it to learn classifiers, combining findings from psychology and linguistics with recent advances in NLP. We use our models to identify potentially funny papers in a large dataset of over 630,000 articles. The results demonstrate the potential of our methods, and more broadly the utility of integrating state-of-the-art NLP methods with insights from more traditional disciplines

pdf bib
Engage the Public : Poll Question Generation for Social Media Posts
Zexin Lu | Keyang Ding | Yuji Zhang | Jing Li | Baolin Peng | Lemao Liu

This paper presents a novel task to generate poll questions for social media posts. It offers an easy way to hear the voice from the public and learn from their feelings to important social topics. While most related work tackles formal languages (e.g., exam papers), we generate poll questions for short and colloquial social media messages exhibiting severe data sparsity. To deal with that, we propose to encode user comments and discover latent topics therein as contexts. They are then incorporated into a sequence-to-sequence (S2S) architecture for question generation and its extension with dual decoders to additionally yield poll choices (answers). For experiments, we collect a large-scale Chinese dataset from Sina Weibo containing over 20 K polls. The results show that our model outperforms the popular S2S models without exploiting topics from comments and the dual decoder design can further benefit the prediction of both questions and answers. Human evaluations further exhibit our superiority in yielding high-quality polls helpful to draw user engagements.

pdf bib
HateCheck : Functional Tests for Hate Speech Detection ModelsHateCheck: Functional Tests for Hate Speech Detection Models
Paul Röttger | Bertie Vidgen | Dong Nguyen | Zeerak Waseem | Helen Margetts | Janet Pierrehumbert

Detecting online hate is a difficult task that even state-of-the-art models struggle with. Typically, hate speech detection models are evaluated by measuring their performance on held-out test data using metrics such as accuracy and F1 score. However, this approach makes it difficult to identify specific model weak points. It also risks overestimating generalisable model performance due to increasingly well-evidenced systematic gaps and biases in hate speech datasets. To enable more targeted diagnostic insights, we introduce HateCheck, a suite of functional tests for hate speech detection models. We specify 29 model functionalities motivated by a review of previous research and a series of interviews with civil society stakeholders. We craft test cases for each functionality and validate their quality through a structured annotation process. To illustrate HateCheck’s utility, we test near-state-of-the-art transformer models as well as two popular commercial models, revealing critical model weaknesses.

pdf bib
Generalising Multilingual Concept-to-Text NLG with Language Agnostic DelexicalisationNLG with Language Agnostic Delexicalisation
Giulio Zhou | Gerasimos Lampouras

Concept-to-text Natural Language Generation is the task of expressing an input meaning representation in natural language. Previous approaches in this task have been able to generalise to rare or unseen instances by relying on a delexicalisation of the input. However, this often requires that the input appears verbatim in the output text. This poses challenges in multilingual settings, where the task expands to generate the output text in multiple languages given the same input. In this paper, we explore the application of multilingual models in concept-to-text and propose Language Agnostic Delexicalisation, a novel delexicalisation method that uses multilingual pretrained embeddings, and employs a character-level post-editing model to inflect words in their correct form during relexicalisation. Our experiments across five datasets and five languages show that multilingual models outperform monolingual models in concept-to-text and that our framework outperforms previous approaches, especially in low resource conditions.

pdf bib
Dual Slot Selector via Local Reliability Verification for Dialogue State Tracking
Jinyu Guo | Kai Shuang | Jijie Li | Zihan Wang

The goal of dialogue state tracking (DST) is to predict the current dialogue state given all previous dialogue contexts. Existing approaches generally predict the dialogue state at every turn from scratch. However, the overwhelming majority of the slots in each turn should simply inherit the slot values from the previous turn. Therefore, the mechanism of treating slots equally in each turn not only is inefficient but also may lead to additional errors because of the redundant slot value generation. To address this problem, we devise the two-stage DSS-DST which consists of the Dual Slot Selector based on the current turn dialogue, and the Slot Value Generator based on the dialogue history. The Dual Slot Selector determines each slot whether to update slot value or to inherit the slot value from the previous turn from two aspects : (1) if there is a strong relationship between it and the current turn dialogue utterances ; (2) if a slot value with high reliability can be obtained for it through the current turn dialogue. The slots selected to be updated are permitted to enter the Slot Value Generator to update values by a hybrid method, while the other slots directly inherit the values from the previous turn. Empirical results show that our method achieves 56.93 %, 60.73 %, and 58.04 % joint accuracy on MultiWOZ 2.0, MultiWOZ 2.1, and MultiWOZ 2.2 datasets respectively and achieves a new state-of-the-art performance with significant improvements.

pdf bib
BoB : BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized DataBoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data
Haoyu Song | Yan Wang | Kaiyan Zhang | Wei-Nan Zhang | Ting Liu

Maintaining a consistent persona is essential for dialogue agents. Although tremendous advancements have been brought, the limited-scale of annotated personalized dialogue datasets is still a barrier towards training robust and consistent persona-based dialogue models. This work shows how this challenge can be addressed by disentangling persona-based dialogue generation into two sub-tasks with a novel BERT-over-BERT (BoB) model. Specifically, the model consists of a BERT-based encoder and two BERT-based decoders, where one decoder is for response generation, and another is for consistency understanding. In particular, to learn the ability of consistency understanding from large-scale non-dialogue inference data, we train the second decoder in an unlikelihood manner. Under different limited data settings, both automatic and human evaluations demonstrate that the proposed model outperforms strong baselines in response quality and persona consistency.

pdf bib
GL-GIN : Fast and Accurate Non-Autoregressive Model for Joint Multiple Intent Detection and Slot FillingGL-GIN: Fast and Accurate Non-Autoregressive Model for Joint Multiple Intent Detection and Slot Filling
Libo Qin | Fuxuan Wei | Tianbao Xie | Xiao Xu | Wanxiang Che | Ting Liu

Multi-intent SLU can handle multiple intents in an utterance, which has attracted increasing attention. However, the state-of-the-art joint models heavily rely on autoregressive approaches, resulting in two issues : slow inference speed and information leakage. In this paper, we explore a non-autoregressive model for joint multiple intent detection and slot filling, achieving more fast and accurate. Specifically, we propose a Global-Locally Graph Interaction Network (GL-GIN) where a local slot-aware graph interaction layer is proposed to model slot dependency for alleviating uncoordinated slots problem while a global intent-slot graph interaction layer is introduced to model the interaction between multiple intents and all slots in the utterance. Experimental results on two public datasets show that our framework achieves state-of-the-art performance while being 11.5 times faster.

pdf bib
Modularized Interaction Network for Named Entity Recognition
Fei Li | Zheng Wang | Siu Cheung Hui | Lejian Liao | Dandan Song | Jing Xu | Guoxiu He | Meihuizi Jia

Although the existing Named Entity Recognition (NER) models have achieved promising performance, they suffer from certain drawbacks. The sequence labeling-based NER models do not perform well in recognizing long entities as they focus only on word-level information, while the segment-based NER models which focus on processing segment instead of single word are unable to capture the word-level dependencies within the segment. Moreover, as boundary detection and type prediction may cooperate with each other for the NER task, it is also important for the two sub-tasks to mutually reinforce each other by sharing their information. In this paper, we propose a novel Modularized Interaction Network (MIN) model which utilizes both segment-level information and word-level dependencies, and incorporates an interaction mechanism to support information sharing between boundary detection and type prediction to enhance the performance for the NER task. We have conducted extensive experiments based on three NER benchmark datasets. The performance results have shown that the proposed MIN model has outperformed the current state-of-the-art models.

pdf bib
UniRE : A Unified Label Space for Entity Relation ExtractionUniRE: A Unified Label Space for Entity Relation Extraction
Yijun Wang | Changzhi Sun | Yuanbin Wu | Hao Zhou | Lei Li | Junchi Yan

Many joint entity relation extraction models setup two separated label spaces for the two sub-tasks (i.e., entity detection and relation classification). We argue that this setting may hinder the information interaction between entities and relations. In this work, we propose to eliminate the different treatment on the two sub-tasks’ label spaces. The input of our model is a table containing all word pairs from a sentence. Entities and relations are represented by squares and rectangles in the table. We apply a unified classifier to predict each cell’s label, which unifies the learning of two sub-tasks. For testing, an effective (yet fast) approximate decoder is proposed for finding squares and rectangles from tables. Experiments on three benchmarks (ACE04, ACE05, SciERC) show that, using only half the number of parameters, our model achieves competitive accuracy with the best extractor, and is faster.

pdf bib
Understanding the Properties of Minimum Bayes Risk Decoding in Neural Machine TranslationBayes Risk Decoding in Neural Machine Translation
Mathias Müller | Rico Sennrich

Neural Machine Translation (NMT) currently exhibits biases such as producing translations that are too short and overgenerating frequent words, and shows poor robustness to copy noise in training data or domain shift. Recent work has tied these shortcomings to beam search the de facto standard inference algorithm in NMT and Eikema & Aziz (2020) propose to use Minimum Bayes Risk (MBR) decoding on unbiased samples instead. In this paper, we empirically investigate the properties of MBR decoding on a number of previously reported biases and failure cases of beam search. We find that MBR still exhibits a length and token frequency bias, owing to the MT metrics used as utility functions, but that MBR also increases robustness against copy noise in the training data and domain shift.

pdf bib
Multi-Head Highly Parallelized LSTM Decoder for Neural Machine TranslationLSTM Decoder for Neural Machine Translation
Hongfei Xu | Qiuhui Liu | Josef van Genabith | Deyi Xiong | Meng Zhang

One of the reasons Transformer translation models are popular is that self-attention networks for context modelling can be easily parallelized at sequence level. However, the computational complexity of a self-attention network is O(n^2), increasing quadratically with sequence length. By contrast, the complexity of LSTM-based approaches is only O(n). In practice, however, LSTMs are much slower to train than self-attention networks as they can not be parallelized at sequence level : to model context, the current LSTM state relies on the full LSTM computation of the preceding state. This has to be computed n times for a sequence of length n. The linear transformations involved in the LSTM gate and state computations are the major cost factors in this. To enable sequence-level parallelization of LSTMs, we approximate full LSTM context modelling by computing hidden states and gates with the current input and a simple bag-of-words representation of the preceding tokens context. This allows us to compute each input step efficiently in parallel, avoiding the formerly costly sequential linear transformations. We then connect the outputs of each parallel step with computationally cheap element-wise computations. We call this the Highly Parallelized LSTM. To further constrain the number of LSTM parameters, we compute several small HPLSTMs in parallel like multi-head attention in the Transformer. The experiments show that our MHPLSTM decoder achieves significant BLEU improvements, while being even slightly faster than the self-attention network in training, and much faster than the standard LSTM.O(n^2), increasing quadratically with sequence length. By contrast, the complexity of LSTM-based approaches is only O(n). In practice, however, LSTMs are much slower to train than self-attention networks as they cannot be parallelized at sequence level: to model context, the current LSTM state relies on the full LSTM computation of the preceding state. This has to be computed n times for a sequence of length n. The linear transformations involved in the LSTM gate and state computations are the major cost factors in this. To enable sequence-level parallelization of LSTMs, we approximate full LSTM context modelling by computing hidden states and gates with the current input and a simple bag-of-words representation of the preceding tokens context. This allows us to compute each input step efficiently in parallel, avoiding the formerly costly sequential linear transformations. We then connect the outputs of each parallel step with computationally cheap element-wise computations. We call this the Highly Parallelized LSTM. To further constrain the number of LSTM parameters, we compute several small HPLSTMs in parallel like multi-head attention in the Transformer. The experiments show that our MHPLSTM decoder achieves significant BLEU improvements, while being even slightly faster than the self-attention network in training, and much faster than the standard LSTM.

pdf bib
A Bidirectional Transformer Based Alignment Model for Unsupervised Word Alignment
Jingyi Zhang | Josef van Genabith

Word alignment and machine translation are two closely related tasks. Neural translation models, such as RNN-based and Transformer models, employ a target-to-source attention mechanism which can provide rough word alignments, but with a rather low accuracy. High-quality word alignment can help neural machine translation in many different ways, such as missing word detection, annotation transfer and lexicon injection. Existing methods for learning word alignment include statistical word aligners (e.g. GIZA++) and recently neural word alignment models. This paper presents a bidirectional Transformer based alignment (BTBA) model for unsupervised learning of the word alignment task. Our BTBA model predicts the current target word by attending the source context and both left-side and right-side target context to produce accurate target-to-source attention (alignment). We further fine-tune the target-to-source attention in the BTBA model to obtain better alignments using a full context based optimization method and self-supervised training. We test our method on three word alignment tasks and show that our method outperforms both previous neural word alignment approaches and the popular statistical word aligner GIZA++.

pdf bib
Learning Language Specific Sub-network for Multilingual Machine Translation
Zehui Lin | Liwei Wu | Mingxuan Wang | Lei Li

Multilingual neural machine translation aims at learning a single translation model for multiple languages. These jointly trained models often suffer from performance degradationon rich-resource language pairs. We attribute this degeneration to parameter interference. In this paper, we propose LaSS to jointly train a single unified multilingual MT model. LaSS learns Language Specific Sub-network (LaSS) for each language pair to counter parameter interference. Comprehensive experiments on IWSLT and WMT datasets with various Transformer architectures show that LaSS obtains gains on 36 language pairs by up to 1.2 BLEU. Besides, LaSS shows its strong generalization performance at easy adaptation to new language pairs and zero-shot translation. LaSS boosts zero-shot translation with an average of 8.3 BLEU on 30 language pairs. Codes and trained models are available at

pdf bib
Bridge-Based Active Domain Adaptation for Aspect Term Extraction
Zhuang Chen | Tieyun Qian

As a fine-grained task, the annotation cost of aspect term extraction is extremely high. Recent attempts alleviate this issue using domain adaptation that transfers common knowledge across domains. Since most aspect terms are domain-specific, they can not be transferred directly. Existing methods solve this problem by associating aspect terms with pivot words (we call this passive domain adaptation because the transfer of aspect terms relies on the links to pivots). However, all these methods need either manually labeled pivot words or expensive computing resources to build associations. In this paper, we propose a novel active domain adaptation method. Our goal is to transfer aspect terms by actively supplementing transferable knowledge. To this end, we construct syntactic bridges by recognizing syntactic roles as pivots instead of as links to pivots. We also build semantic bridges by retrieving transferable semantic prototypes. Extensive experiments show that our method significantly outperforms previous approaches.

pdf bib
Multimodal Sentiment Detection Based on Multi-channel Graph Neural Networks
Xiaocui Yang | Shi Feng | Yifei Zhang | Daling Wang

With the popularity of smartphones, we have witnessed the rapid proliferation of multimodal posts on various social media platforms. We observe that the multimodal sentiment expression has specific global characteristics, such as the interdependencies of objects or scenes within the image. However, most previous studies only considered the representation of a single image-text post and failed to capture the global co-occurrence characteristics of the dataset. In this paper, we propose Multi-channel Graph Neural Networks with Sentiment-awareness (MGNNS) for image-text sentiment detection. Specifically, we first encode different modalities to capture hidden representations. Then, we introduce multi-channel graph neural networks to learn multimodal representations based on the global characteristics of the dataset. Finally, we implement multimodal in-depth fusion with the multi-head attention mechanism to predict the sentiment of image-text pairs. Extensive experiments conducted on three publicly available datasets demonstrate the effectiveness of our approach for multimodal sentiment detection.

pdf bib
PASS : Perturb-and-Select Summarizer for Product ReviewsPASS: Perturb-and-Select Summarizer for Product Reviews
Nadav Oved | Ran Levy

The product reviews summarization task aims to automatically produce a short summary for a set of reviews of a given product. Such summaries are expected to aggregate a range of different opinions in a concise, coherent and informative manner. This challenging task gives rise to two shortcomings in existing work. First, summarizers tend to favor generic content that appears in reviews for many different products, resulting in template-like, less informative summaries. Second, as reviewers often disagree on the pros and cons of a given product, summarizers sometimes yield inconsistent, self-contradicting summaries. We propose the PASS system (Perturb-and-Select Summarizer) that employs a large pre-trained Transformer-based model (T5 in our case), which follows a few-shot fine-tuning scheme. A key component of the PASS system relies on applying systematic perturbations to the model’s input during inference, which allows it to generate multiple different summaries per product. We develop a method for ranking these summaries according to desired criteria, coherence in our case, enabling our system to almost entirely avoid the problem of self-contradiction. We compare our system against strong baselines on publicly available datasets, and show that it produces summaries which are more informative, diverse and coherent.

pdf bib
Deep Differential Amplifier for Extractive Summarization
Ruipeng Jia | Yanan Cao | Fang Fang | Yuchen Zhou | Zheng Fang | Yanbing Liu | Shi Wang

For sentence-level extractive summarization, there is a disproportionate ratio of selected and unselected sentences, leading to flatting the summary features when maximizing the accuracy. The imbalanced classification of summarization is inherent, which ca n’t be addressed by common algorithms easily. In this paper, we conceptualize the single-document extractive summarization as a rebalance problem and present a deep differential amplifier framework. Specifically, we first calculate and amplify the semantic difference between each sentence and all other sentences, and then apply the residual unit as the second item of the differential amplifier to deepen the architecture. Finally, to compensate for the imbalance, the corresponding objective loss of minority class is boosted by a weighted cross-entropy. In contrast to previous approaches, this model pays more attention to the pivotal information of one sentence, instead of all the informative context modeling by recurrent or Transformer architecture. We demonstrate experimentally on two benchmark datasets that our summarizer performs competitively against state-of-the-art methods. Our source code will be available on Github.

pdf bib
Self-Supervised Multimodal Opinion Summarization
Jinbae Im | Moonki Kim | Hoyeop Lee | Hyunsouk Cho | Sehee Chung

Recently, opinion summarization, which is the generation of a summary from multiple reviews, has been conducted in a self-supervised manner by considering a sampled review as a pseudo summary. However, non-text data such as image and metadata related to reviews have been considered less often. To use the abundant information contained in non-text data, we propose a self-supervised multimodal opinion summarization framework called MultimodalSum. Our framework obtains a representation of each modality using a separate encoder for each modality, and the text decoder generates a summary. To resolve the inherent heterogeneity of multimodal data, we propose a multimodal training pipeline. We first pretrain the text encoderdecoder based solely on text modality data. Subsequently, we pretrain the non-text modality encoders by considering the pretrained text decoder as a pivot for the homogeneous representation of multimodal data. Finally, to fuse multimodal representations, we train the entire framework in an end-to-end manner. We demonstrate the superiority of MultimodalSum by conducting experiments on Yelp and Amazon datasets.

pdf bib
Improving the Faithfulness of Attention-based Explanations with Task-specific Information for Text Classification
George Chrysostomou | Nikolaos Aletras

Neural network architectures in natural language processing often use attention mechanisms to produce probability distributions over input token representations. Attention has empirically been demonstrated to improve performance in various tasks, while its weights have been extensively used as explanations for model predictions. Recent studies (Jain and Wallace, 2019 ; Serrano and Smith, 2019 ; Wiegreffe and Pinter, 2019) have showed that it can not generally be considered as a faithful explanation (Jacovi and Goldberg, 2020) across encoders and tasks. In this paper, we seek to improve the faithfulness of attention-based explanations for text classification. We achieve this by proposing a new family of Task-Scaling (TaSc) mechanisms that learn task-specific non-contextualised information to scale the original attention weights. Evaluation tests for explanation faithfulness, show that the three proposed variants of TaSc improve attention-based explanations across two attention mechanisms, five encoders and five text classification datasets without sacrificing predictive performance. Finally, we demonstrate that TaSc consistently provides more faithful attention-based explanations compared to three widely-used interpretability techniques.

pdf bib
Learning Relation Alignment for Calibrated Cross-modal Retrieval
Shuhuai Ren | Junyang Lin | Guangxiang Zhao | Rui Men | An Yang | Jingren Zhou | Xu Sun | Hongxia Yang

Despite the achievements of large-scale multimodal pre-training approaches, cross-modal retrieval, e.g., image-text retrieval, remains a challenging task. To bridge the semantic gap between the two modalities, previous studies mainly focus on word-region alignment at the object level, lacking the matching between the linguistic relation among the words and the visual relation among the regions. The neglect of such relation consistency impairs the contextualized representation of image-text pairs and hinders the model performance and the interpretability. In this paper, we first propose a novel metric, Intra-modal Self-attention Distance (ISD), to quantify the relation consistency by measuring the semantic distance between linguistic and visual relations. In response, we present Inter-modal Alignment on Intra-modal Self-attentions (IAIS), a regularized training method to optimize the ISD and calibrate intra-modal self-attentions from the two modalities mutually via inter-modal alignment. The IAIS regularizer boosts the performance of prevailing models on Flickr30k and MS COCO datasets by a considerable margin, which demonstrates the superiority of our approach.

pdf bib
KM-BART : Knowledge Enhanced Multimodal BART for Visual Commonsense GenerationKM-BART: Knowledge Enhanced Multimodal BART for Visual Commonsense Generation
Yiran Xing | Zai Shi | Zhao Meng | Gerhard Lakemeyer | Yunpu Ma | Roger Wattenhofer

We present Knowledge Enhanced Multimodal BART (KM-BART), which is a Transformer-based sequence-to-sequence model capable of reasoning about commonsense knowledge from multimodal inputs of images and texts. We adapt the generative BART architecture (Lewis et al., 2020) to a multimodal model with visual and textual inputs. We further develop novel pretraining tasks to improve the model performance on the Visual Commonsense Generation (VCG) task. In particular, our pretraining task of Knowledge-based Commonsense Generation (KCG) boosts model performance on the VCG task by leveraging commonsense knowledge from a large language model pretrained on external commonsense knowledge graphs. To the best of our knowledge, we are the first to propose a dedicated task for improving model performance on the VCG task. Experimental results show that our model reaches state-of-the-art performance on the VCG task (Park et al., 2020) by applying these novel pretraining tasks.

pdf bib
Cascaded Head-colliding Attention
Lin Zheng | Zhiyong Wu | Lingpeng Kong

Transformers have advanced the field of natural language processing (NLP) on a variety of important tasks. At the cornerstone of the Transformer architecture is the multi-head attention (MHA) mechanism which models pairwise interactions between the elements of the sequence. Despite its massive success, the current framework ignores interactions among different heads, leading to the problem that many of the heads are redundant in practice, which greatly wastes the capacity of the model. To improve parameter efficiency, we re-formulate the MHA as a latent variable model from a probabilistic perspective. We present cascaded head-colliding attention (CODA) which explicitly models the interactions between attention heads through a hierarchical variational distribution. We conduct extensive experiments and demonstrate that CODA outperforms the transformer baseline, by 0.6 perplexity on Wikitext-103 in language modeling, and by 0.6 BLEU on WMT14 EN-DE in machine translation, due to its improvements on the parameter efficiency.cascaded head-colliding attention (CODA) which explicitly models the interactions between attention heads through a hierarchical variational distribution. We conduct extensive experiments and demonstrate that CODA outperforms the transformer baseline, by 0.6 perplexity on Wikitext-103 in language modeling, and by 0.6 BLEU on WMT14 EN-DE in machine translation, due to its improvements on the parameter efficiency.

pdf bib
Structural Knowledge Distillation : Tractably Distilling Information for Structured Predictor
Xinyu Wang | Yong Jiang | Zhaohui Yan | Zixia Jia | Nguyen Bach | Tao Wang | Zhongqiang Huang | Fei Huang | Kewei Tu

Knowledge distillation is a critical technique to transfer knowledge between models, typically from a large model (the teacher) to a more fine-grained one (the student). The objective function of knowledge distillation is typically the cross-entropy between the teacher and the student’s output distributions. However, for structured prediction problems, the output space is exponential in size ; therefore, the cross-entropy objective becomes intractable to compute and optimize directly. In this paper, we derive a factorized form of the knowledge distillation objective for structured prediction, which is tractable for many typical choices of the teacher and student models. In particular, we show the tractability and empirical effectiveness of structural knowledge distillation between sequence labeling and dependency parsing models under four different scenarios : 1) the teacher and student share the same factorization form of the output structure scoring function ; 2) the student factorization produces more fine-grained substructures than the teacher factorization ; 3) the teacher factorization produces more fine-grained substructures than the student factorization ; 4) the factorization forms from the teacher and the student are incompatible.

pdf bib
Parameter-efficient Multi-task Fine-tuning for Transformers via Shared Hypernetworks
Rabeeh Karimi Mahabadi | Sebastian Ruder | Mostafa Dehghani | James Henderson

State-of-the-art parameter-efficient fine-tuning methods rely on introducing adapter modules between the layers of a pretrained language model. However, such modules are trained separately for each task and thus do not enable sharing information across tasks. In this paper, we show that we can learn adapter parameters for all layers and tasks by generating them using shared hypernetworks, which condition on task, adapter position, and layer i d in a transformer model. This parameter-efficient multi-task learning framework allows us to achieve the best of both worlds by sharing knowledge across tasks via hypernetworks while enabling the model to adapt to each individual task through task-specific adapters. Experiments on the well-known GLUE benchmark show improved performance in multi-task learning while adding only 0.29 % parameters per task. We additionally demonstrate substantial performance improvements in few-shot domain generalization across a variety of tasks. Our code is publicly available in

pdf bib
OoMMix : Out-of-manifold Regularization in Contextual Embedding Space for Text ClassificationOoMMix: Out-of-manifold Regularization in Contextual Embedding Space for Text Classification
Seonghyeon Lee | Dongha Lee | Hwanjo Yu

Recent studies on neural networks with pre-trained weights (i.e., BERT) have mainly focused on a low-dimensional subspace, where the embedding vectors computed from input words (or their contexts) are located. In this work, we propose a new approach, called OoMMix, to finding and regularizing the remainder of the space, referred to as out-of-manifold, which can not be accessed through the words. Specifically, we synthesize the out-of-manifold embeddings based on two embeddings obtained from actually-observed words, to utilize them for fine-tuning the network. A discriminator is trained to detect whether an input embedding is located inside the manifold or not, and simultaneously, a generator is optimized to produce new embeddings that can be easily identified as out-of-manifold by the discriminator. These two modules successfully collaborate in a unified and end-to-end manner for regularizing the out-of-manifold. Our extensive evaluation on various text classification benchmarks demonstrates the effectiveness of our approach, as well as its good compatibility with existing data augmentation techniques which aim to enhance the manifold.

pdf bib
Understanding and Countering Stereotypes : A Computational Approach to the Stereotype Content Model
Kathleen C. Fraser | Isar Nejadgholi | Svetlana Kiritchenko

Stereotypical language expresses widely-held beliefs about different social categories. Many stereotypes are overtly negative, while others may appear positive on the surface, but still lead to negative consequences. In this work, we present a computational approach to interpreting stereotypes in text through the Stereotype Content Model (SCM), a comprehensive causal theory from social psychology. The SCM proposes that stereotypes can be understood along two primary dimensions : warmth and competence. We present a method for defining warmth and competence axes in semantic embedding space, and show that the four quadrants defined by this subspace accurately represent the warmth and competence concepts, according to annotated lexicons. We then apply our computational SCM model to textual stereotype data and show that it compares favourably with survey-based studies in the psychological literature. Furthermore, we explore various strategies to counter stereotypical beliefs with anti-stereotypes. It is known that countering stereotypes with anti-stereotypical examples is one of the most effective ways to reduce biased thinking, yet the problem of generating anti-stereotypes has not been previously studied. Thus, a better understanding of how to generate realistic and effective anti-stereotypes can contribute to addressing pressing societal concerns of stereotyping, prejudice, and discrimination.

pdf bib
Structurizing Misinformation Stories via Rationalizing Fact-Checks
Shan Jiang | Christo Wilson

Misinformation has recently become a well-documented matter of public concern. Existing studies on this topic have hitherto adopted a coarse concept of misinformation, which incorporates a broad spectrum of story types ranging from political conspiracies to misinterpreted pranks. This paper aims to structurize these misinformation stories by leveraging fact-check articles. Our intuition is that key phrases in a fact-check article that identify the misinformation type(s) (e.g., doctored images, urban legends) also act as rationales that determine the verdict of the fact-check (e.g., false). We experiment on rationalized models with domain knowledge as weak supervision to extract these phrases as rationales, and then cluster semantically similar rationales to summarize prevalent misinformation types. Using archived fact-checks from, we identify ten types of misinformation stories. We discuss how these types have evolved over the last ten years and compare their prevalence between the 2016/2020 US presidential elections and the H1N1 / COVID-19 pandemics.

pdf bib
Learning from Perturbations : Diverse and Informative Dialogue Generation with Inverse Adversarial Training
Wangchunshu Zhou | Qifei Li | Chenle Li

In this paper, we propose Inverse Adversarial Training (IAT) algorithm for training neural dialogue systems to avoid generic responses and model dialogue history better. In contrast to standard adversarial training algorithms, IAT encourages the model to be sensitive to the perturbation in the dialogue history and therefore learning from perturbations. By giving higher rewards for responses whose output probability reduces more significantly when dialogue history is perturbed, the model is encouraged to generate more diverse and consistent responses. By penalizing the model when generating the same response given perturbed dialogue history, the model is forced to better capture dialogue history and generate more informative responses. Experimental results on two benchmark datasets show that our approach can better model dialogue history and generate more diverse and consistent responses. In addition, we point out a problem of the widely used maximum mutual information (MMI) based methods for improving the diversity of dialogue response generation models and demonstrate it empirically.

pdf bib
CitationIE : Leveraging the Citation Graph for Scientific Information ExtractionCitationIE: Leveraging the Citation Graph for Scientific Information Extraction
Vijay Viswanathan | Graham Neubig | Pengfei Liu

Automatically extracting key information from scientific documents has the potential to help scientists work more efficiently and accelerate the pace of scientific progress. Prior work has considered extracting document-level entity clusters and relations end-to-end from raw scientific text, which can improve literature search and help identify methods and materials for a given problem. Despite the importance of this task, most existing works on scientific information extraction (SciIE) consider extraction solely based on the content of an individual paper, without considering the paper’s place in the broader literature. In contrast to prior work, we augment our text representations by leveraging a complementary source of document context : the citation graph of referential links between citing and cited papers. On a test set of English-language scientific documents, we show that simple ways of utilizing the structure and content of the citation graph can each lead to significant gains in different scientific information extraction tasks. When these tasks are combined, we observe a sizable improvement in end-to-end information extraction over the state-of-the-art, suggesting the potential for future work along this direction. We release software tools to facilitate citation-aware SciIE development.

pdf bib
From Discourse to Narrative : Knowledge Projection for Event Relation Extraction
Jialong Tang | Hongyu Lin | Meng Liao | Yaojie Lu | Xianpei Han | Le Sun | Weijian Xie | Jin Xu

Current event-centric knowledge graphs highly rely on explicit connectives to mine relations between events. Unfortunately, due to the sparsity of connectives, these methods severely undermine the coverage of EventKGs. The lack of high-quality labelled corpora further exacerbates that problem. In this paper, we propose a knowledge projection paradigm for event relation extraction : projecting discourse knowledge to narratives by exploiting the commonalities between them. Specifically, we propose Multi-tier Knowledge Projection Network (MKPNet), which can leverage multi-tier discourse knowledge effectively for event relation extraction. In this way, the labelled data requirement is significantly reduced, and implicit event relations can be effectively extracted. Intrinsic experimental results show that MKPNet achieves the new state-of-the-art performance and extrinsic experimental results verify the value of the extracted event relations.

pdf bib
AdvPicker : Effectively Leveraging Unlabeled Data via Adversarial Discriminator for Cross-Lingual NERAdvPicker: Effectively Leveraging Unlabeled Data via Adversarial Discriminator for Cross-Lingual NER
Weile Chen | Huiqiang Jiang | Qianhui Wu | Börje Karlsson | Yi Guan

Neural methods have been shown to achieve high performance in Named Entity Recognition (NER), but rely on costly high-quality labeled data for training, which is not always available across languages. While previous works have shown that unlabeled data in a target language can be used to improve cross-lingual model performance, we propose a novel adversarial approach (AdvPicker) to better leverage such data and further improve results. We design an adversarial learning framework in which an encoder learns entity domain knowledge from labeled source-language data and better shared features are captured via adversarial training-where a discriminator selects less language-dependent target-language data via similarity to the source language. Experimental results on standard benchmark datasets well demonstrate that the proposed method benefits strongly from this data selection process and outperforms existing state-of-the-art methods ; without requiring any additional external resources (e.g., gazetteers or via machine translation).

pdf bib
Compare to The Knowledge : Graph Neural Fake News Detection with External Knowledge
Linmei Hu | Tianchi Yang | Luhao Zhang | Wanjun Zhong | Duyu Tang | Chuan Shi | Nan Duan | Ming Zhou

Nowadays, fake news detection, which aims to verify whether a news document is trusted or fake, has become urgent and important. Most existing methods rely heavily on linguistic and semantic features from the news content, and fail to effectively exploit external knowledge which could help determine whether the news document is trusted. In this paper, we propose a novel end-to-end graph neural model called CompareNet, which compares the news to the knowledge base (KB) through entities for fake news detection. Considering that fake news detection is correlated with topics, we also incorporate topics to enrich the news representation. Specifically, we first construct a directed heterogeneous document graph for each news incorporating topics and entities. Based on the graph, we develop a heterogeneous graph attention network for learning the topic-enriched news representation as well as the contextual entity representations that encode the semantics of the news content. The contextual entity representations are then compared to the corresponding KB-based entity representations through a carefully designed entity comparison network, to capture the consistency between the news content and KB. Finally, the topic-enriched news representation combining the entity comparison features is fed into a fake news classifier. Experimental results on two benchmark datasets demonstrate that CompareNet significantly outperforms state-of-the-art methods.directed heterogeneous document graph for each news incorporating topics and entities. Based on the graph, we develop a heterogeneous graph attention network for learning the topic-enriched news representation as well as the contextual entity representations that encode the semantics of the news content. The contextual entity representations are then compared to the corresponding KB-based entity representations through a carefully designed entity comparison network, to capture the consistency between the news content and KB. Finally, the topic-enriched news representation combining the entity comparison features is fed into a fake news classifier. Experimental results on two benchmark datasets demonstrate that CompareNet significantly outperforms state-of-the-art methods.

pdf bib
Discontinuous Named Entity Recognition as Maximal Clique Discovery
Yucheng Wang | Bowen Yu | Hongsong Zhu | Tingwen Liu | Nan Yu | Limin Sun

Named entity recognition (NER) remains challenging when entity mentions can be discontinuous. Existing methods break the recognition process into several sequential steps. In training, they predict conditioned on the golden intermediate results, while at inference relying on the model output of the previous steps, which introduces exposure bias. To solve this problem, we first construct a segment graph for each sentence, in which each node denotes a segment (a continuous entity on its own, or a part of discontinuous entities), and an edge links two nodes that belong to the same entity. The nodes and edges can be generated respectively in one stage with a grid tagging scheme and learned jointly using a novel architecture named Mac. Then discontinuous NER can be reformulated as a non-parametric process of discovering maximal cliques in the graph and concatenating the spans in each clique. Experiments on three benchmarks show that our method outperforms the state-of-the-art (SOTA) results, with up to 3.5 percentage points improvement on F1, and achieves 5x speedup over the SOTA model.

pdf bib
Do Context-Aware Translation Models Pay the Right Attention?
Kayo Yin | Patrick Fernandes | Danish Pruthi | Aditi Chaudhary | André F. T. Martins | Graham Neubig

Context-aware machine translation models are designed to leverage contextual information, but often fail to do so. As a result, they inaccurately disambiguate pronouns and polysemous words that require context for resolution. In this paper, we ask several questions : What contexts do human translators use to resolve ambiguous words? Are models paying large amounts of attention to the same context? What if we explicitly train them to do so? To answer these questions, we introduce SCAT (Supporting Context for Ambiguous Translations), a new English-French dataset comprising supporting context words for 14 K translations that professional translators found useful for pronoun disambiguation. Using SCAT, we perform an in-depth analysis of the context used to disambiguate, examining positional and lexical characteristics of the supporting words. Furthermore, we measure the degree of alignment between the model’s attention scores and the supporting context from SCAT, and apply a guided attention strategy to encourage agreement between the two.

pdf bib
Adapting High-resource NMT Models to Translate Low-resource Related Languages without Parallel DataNMT Models to Translate Low-resource Related Languages without Parallel Data
Wei-Jen Ko | Ahmed El-Kishky | Adithya Renduchintala | Vishrav Chaudhary | Naman Goyal | Francisco Guzmán | Pascale Fung | Philipp Koehn | Mona Diab

The scarcity of parallel data is a major obstacle for training high-quality machine translation systems for low-resource languages. Fortunately, some low-resource languages are linguistically related or similar to high-resource languages ; these related languages may share many lexical or syntactic structures. In this work, we exploit this linguistic overlap to facilitate translating to and from a low-resource language with only monolingual data, in addition to any parallel data in the related high-resource language. Our method, NMT-Adapt, combines denoising autoencoding, back-translation and adversarial objectives to utilize monolingual data for low-resource adaptation. We experiment on 7 languages from three different language families and show that our technique significantly improves translation into low-resource language compared to other translation baselines.

pdf bib
Multilingual Speech Translation from Efficient Finetuning of Pretrained Models
Xian Li | Changhan Wang | Yun Tang | Chau Tran | Yuqing Tang | Juan Pino | Alexei Baevski | Alexis Conneau | Michael Auli

We present a simple yet effective approach to build multilingual speech-to-text (ST) translation through efficient transfer learning from a pretrained speech encoder and text decoder. Our key finding is that a minimalistic LNA (LayerNorm and Attention) finetuning can achieve zero-shot crosslingual and cross-modality transfer ability by only finetuning 10 50 % of the pretrained parameters. This effectively leverages large pretrained models at low training cost such as wav2vec 2.0 for acoustic modeling, and mBART for multilingual text generation. This sets a new state-of-the-art for 36 translation directions (and surpassing cascaded ST for 26 of them) on the large-scale multilingual ST benchmark CoVoST 2 (+6.4 BLEU on average for En-X directions and +6.7 BLEU for X-En directions). Our approach demonstrates strong zero-shot performance in a many-to-many multilingual model (+5.6 BLEU on average across 28 non-English directions), making it an appealing approach for attaining high-quality speech translation with improved parameter and data efficiency.

pdf bib
What Context Features Can Transformer Language Models Use?
Joe O’Connor | Jacob Andreas

Transformer-based language models benefit from conditioning on contexts of hundreds to thousands of previous tokens. What aspects of these contexts contribute to accurate model prediction? We describe a series of experiments that measure usable information by selectively ablating lexical and structural information in transformer language models trained on English Wikipedia. In both mid- and long-range contexts, we find that several extremely destructive context manipulationsincluding shuffling word order within sentences and deleting all words other than nounsremove less than 15 % of the usable information. Our results suggest that long contexts, but not their detailed syntactic and propositional content, are important for the low perplexity of current transformer language models.

pdf bib
A Targeted Assessment of Incremental Processing in Neural Language Models and Humans
Ethan Wilcox | Pranali Vani | Roger Levy

We present a targeted, scaled-up comparison of incremental processing in humans and neural language models by collecting by-word reaction time data for sixteen different syntactic test suites across a range of structural phenomena. Human reaction time data comes from a novel online experimental paradigm called the Interpolated Maze task. We compare human reaction times to by-word probabilities for four contemporary language models, with different architectures and trained on a range of data set sizes. We find that across many phenomena, both humans and language models show increased processing difficulty in ungrammatical sentence regions with human and model ‘accuracy’ scores a la Marvin and Linzen (2018) about equal. However, although language model outputs match humans in direction, we show that models systematically under-predict the difference in magnitude of incremental processing difficulty between grammatical and ungrammatical sentences. Specifically, when models encounter syntactic violations they fail to accurately predict the longer reading times observed in the human data. These results call into question whether contemporary language models are approaching human-like performance for sensitivity to syntactic violations.

pdf bib
The Possible, the Plausible, and the Desirable : Event-Based Modality Detection for Language Processing
Valentina Pyatkin | Shoval Sadde | Aynat Rubinstein | Paul Portner | Reut Tsarfaty

Modality is the linguistic ability to describe vents with added information such as how desirable, plausible, or feasible they are. Modality is important for many NLP downstream tasks such as the detection of hedging, uncertainty, speculation, and more. Previous studies that address modality detection in NLP often restrict modal expressions to a closed syntactic class, and the modal sense labels are vastly different across different studies, lacking an accepted standard. Furthermore, these senses are often analyzed independently of the events that they modify. This work builds on the theoretical foundations of the Georgetown Gradable Modal Expressions (GME) work by Rubinstein et al. (2013) to propose an event-based modality detection task where modal expressions can be words of any syntactic class and sense labels are drawn from a comprehensive taxonomy which harmonizes the modal concepts contributed by the different studies. We present experiments on the GME corpus aiming to detect and classify fine-grained modal concepts and associate them with their modified events. We show that detecting and classifying modal expressions is not only feasible, it also improves the detection of modal events in their own right.

Prosodic segmentation for parsing spoken dialogue
Elizabeth Nielsen | Mark Steedman | Sharon Goldwater

Parsing spoken dialogue poses unique difficulties, including disfluencies and unmarked boundaries between sentence-like units. Previous work has shown that prosody can help with parsing disfluent speech (Tran et al. 2018), but has assumed that the input to the parser is already segmented into sentence-like units (SUs), which is n’t true in existing speech applications. We investigate how prosody affects a parser that receives an entire dialogue turn as input (a turn-based model), instead of gold standard pre-segmented SUs (an SU-based model). In experiments on the English Switchboard corpus, we find that when using transcripts alone, the turn-based model has trouble segmenting SUs, leading to worse parse performance than the SU-based model. However, prosody can effectively replace gold standard SU boundaries : with prosody, the turn-based model performs as well as the SU-based model (91.38 vs. 91.06 F1 score, respectively), despite performing two tasks (SU segmentation and parsing) rather than one (parsing alone). Analysis shows that pitch and intensity features are the most important for this corpus, since they allow the model to correctly distinguish an SU boundary from a speech disfluency a distinction that the model otherwise struggles to make.

pdf bib
Robust Knowledge Graph Completion with Stacked Convolutions and a Student Re-Ranking Network
Justin Lovelace | Denis Newman-Griffis | Shikhar Vashishth | Jill Fain Lehman | Carolyn Rosé

Knowledge Graph (KG) completion research usually focuses on densely connected benchmark datasets that are not representative of real KGs. We curate two KG datasets that include biomedical and encyclopedic knowledge and use an existing commonsense KG dataset to explore KG completion in the more realistic setting where dense connectivity is not guaranteed. We develop a deep convolutional network that utilizes textual entity representations and demonstrate that our model outperforms recent KG completion methods in this challenging setting. We find that our model’s performance improvements stem primarily from its robustness to sparsity. We then distill the knowledge from the convolutional network into a student network that re-ranks promising candidate entities. This re-ranking stage leads to further improvements in performance and demonstrates the effectiveness of entity re-ranking for KG completion.

pdf bib
A DQN-based Approach to Finding Precise Evidences for Fact VerificationDQN-based Approach to Finding Precise Evidences for Fact Verification
Hai Wan | Haicheng Chen | Jianfeng Du | Weilin Luo | Rongzhen Ye

Computing precise evidences, namely minimal sets of sentences that support or refute a given claim, rather than larger evidences is crucial in fact verification (FV), since larger evidences may contain conflicting pieces some of which support the claim while the other refute, thereby misleading FV. Despite being important, precise evidences are rarely studied by existing methods for FV. It is challenging to find precise evidences due to a large search space with lots of local optimums. Inspired by the strong exploration ability of the deep Q-learning network (DQN), we propose a DQN-based approach to retrieval of precise evidences. In addition, to tackle the label bias on Q-values computed by DQN, we design a post-processing strategy which seeks best thresholds for determining the true labels of computed evidences. Experimental results confirm the effectiveness of DQN in computing precise evidences and demonstrate improvements in achieving accurate claim verification.

pdf bib
Unsupervised Out-of-Domain Detection via Pre-trained Transformers
Keyang Xu | Tongzheng Ren | Shikun Zhang | Yihao Feng | Caiming Xiong

Deployed real-world machine learning applications are often subject to uncontrolled and even potentially malicious inputs. Those out-of-domain inputs can lead to unpredictable outputs and sometimes catastrophic safety issues. Prior studies on out-of-domain detection require in-domain task labels and are limited to supervised classification scenarios. Our work tackles the problem of detecting out-of-domain samples with only unsupervised in-domain data. We utilize the latent representations of pre-trained transformers and propose a simple yet effective method to transform features across all layers to construct out-of-domain detectors efficiently. Two domain-specific fine-tuning approaches are further proposed to boost detection accuracy. Our empirical evaluations of related methods on two datasets validate that our method greatly improves out-of-domain detection ability in a more general scenario.

pdf bib
Selecting Informative Contexts Improves Language Model Fine-tuning
Richard Antonello | Nicole Beckage | Javier Turek | Alexander Huth

Language model fine-tuning is essential for modern natural language processing, but is computationally expensive and time-consuming. Further, the effectiveness of fine-tuning is limited by the inclusion of training examples that negatively affect performance. Here we present a general fine-tuning method that we call information gain filtration for improving the overall training efficiency and final performance of language model fine-tuning. We define the information gain of an example as the improvement on a validation metric after training on that example. A secondary learner is then trained to approximate this quantity. During fine-tuning, this learner selects informative examples and skips uninformative ones. We show that our method has consistent improvement across datasets, fine-tuning tasks, and language model architectures. For example, we achieve a median perplexity of 54.0 on a books dataset compared to 57.3 for standard fine-tuning. We present statistical evidence that offers insight into the improvements of our method over standard fine-tuning. The generality of our method leads us to propose a new paradigm for language model fine-tuning we encourage researchers to release pretrained secondary learners on common corpora to promote efficient and effective fine-tuning, thereby improving the performance and reducing the overall energy footprint of language model fine-tuning.

pdf bib
Explainable Prediction of Text Complexity : The Missing Preliminaries for Text Simplification
Cristina Garbacea | Mengtian Guo | Samuel Carton | Qiaozhu Mei

Text simplification reduces the language complexity of professional content for accessibility purposes. End-to-end neural network models have been widely adopted to directly generate the simplified version of input text, usually functioning as a blackbox. We show that text simplification can be decomposed into a compact pipeline of tasks to ensure the transparency and explainability of the process. The first two steps in this pipeline are often neglected : 1) to predict whether a given piece of text needs to be simplified, and 2) if yes, to identify complex parts of the text. The two tasks can be solved separately using either lexical or deep learning methods, or solved jointly. Simply applying explainable complexity prediction as a preliminary step, the out-of-sample text simplification performance of the state-of-the-art, black-box simplification models can be improved by a large margin.

pdf bib
Multi-Task Retrieval for Knowledge-Intensive Tasks
Jean Maillard | Vladimir Karpukhin | Fabio Petroni | Wen-tau Yih | Barlas Oguz | Veselin Stoyanov | Gargi Ghosh

Retrieving relevant contexts from a large corpus is a crucial step for tasks such as open-domain question answering and fact checking. Although neural retrieval outperforms traditional methods like tf-idf and BM25, its performance degrades considerably when applied to out-of-domain data. Driven by the question of whether a neural retrieval model can be _ universal _ and perform robustly on a wide variety of problems, we propose a multi-task trained model. Our approach not only outperforms previous methods in the few-shot setting, but also rivals specialised neural retrievers, even when in-domain training data is abundant. With the help of our retriever, we improve existing models for downstream tasks and closely match or improve the state of the art on multiple benchmarks.

pdf bib
Analyzing the Source and Target Contributions to Predictions in Neural Machine Translation
Elena Voita | Rico Sennrich | Ivan Titov

In Neural Machine Translation (and, more generally, conditional language modeling), the generation of a target token is influenced by two types of context : the source and the prefix of the target sequence. While many attempts to understand the internal workings of NMT models have been made, none of them explicitly evaluates relative source and target contributions to a generation decision. We argue that this relative contribution can be evaluated by adopting a variant of Layerwise Relevance Propagation (LRP). Its underlying ‘conservation principle’ makes relevance propagation unique : differently from other methods, it evaluates not an abstract quantity reflecting token importance, but the proportion of each token’s influence. We extend LRP to the Transformer and conduct an analysis of NMT models which explicitly evaluates the source and target relative contributions to the generation process. We analyze changes in these contributions when conditioning on different types of prefixes, when varying the training objective or the amount of training data, and during the training process. We find that models trained with more data tend to rely on source information more and to have more sharp token contributions ; the training process is non-monotonic with several stages of different nature.

pdf bib
Comparing Test Sets with Item Response Theory
Clara Vania | Phu Mon Htut | William Huang | Dhara Mungra | Richard Yuanzhe Pang | Jason Phang | Haokun Liu | Kyunghyun Cho | Samuel R. Bowman

Recent years have seen numerous NLP datasets introduced to evaluate the performance of fine-tuned models on natural language understanding tasks. Recent results from large pretrained models, though, show that many of these datasets are largely saturated and unlikely to be able to detect further progress. What kind of datasets are still effective at discriminating among strong models, and what kind of datasets should we expect to be able to detect future improvements? To measure this uniformly across datasets, we draw on Item Response Theory and evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples. We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models, while SNLI, MNLI, and CommitmentBank seem to be saturated for current strong models. We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.

pdf bib
More Identifiable yet Equally Performant Transformers for Text Classification
Rishabh Bhardwaj | Navonil Majumder | Soujanya Poria | Eduard Hovy

Interpretability is an important aspect of the trustworthiness of a model’s predictions. Transformer’s predictions are widely explained by the attention weights, i.e., a probability distribution generated at its self-attention unit (head). Current empirical studies provide shreds of evidence that attention weights are not explanations by proving that they are not unique. A recent study showed theoretical justifications to this observation by proving the non-identifiability of attention weights. For a given input to a head and its output, if the attention weights generated in it are unique, we call the weights identifiable. In this work, we provide deeper theoretical analysis and empirical observations on the identifiability of attention weights. Ignored in the previous works, we find the attention weights are more identifiable than we currently perceive by uncovering the hidden role of the key vector. However, the weights are still prone to be non-unique attentions that make them unfit for interpretation. To tackle this issue, we provide a variant of the encoder layer that decouples the relationship between key and value vector and provides identifiable weights up to the desired length of the input. We prove the applicability of such variations by providing empirical justifications on varied text classification tasks. The implementations are available at

pdf bib
AugNLG : Few-shot Natural Language Generation using Self-trained Data AugmentationAugNLG: Few-shot Natural Language Generation using Self-trained Data Augmentation
Xinnuo Xu | Guoyin Wang | Young-Bum Kim | Sungjin Lee

Natural Language Generation (NLG) is a key component in a task-oriented dialogue system, which converts the structured meaning representation (MR) to the natural language. For large-scale conversational systems, where it is common to have over hundreds of intents and thousands of slots, neither template-based approaches nor model-based approaches are scalable. Recently, neural NLGs started leveraging transfer learning and showed promising results in few-shot settings. This paper proposes AugNLG, a novel data augmentation approach that combines a self-trained neural retrieval model with a few-shot learned NLU model, to automatically create MR-to-Text data from open-domain texts. The proposed system mostly outperforms the state-of-the-art methods on the FewshotWOZ data in both BLEU and Slot Error Rate. We further confirm improved results on the FewshotSGD data and provide comprehensive analysis results on key components of our system. Our code and data are available at

pdf bib
A Dataset and Baselines for Multilingual Reply Suggestion
Mozhi Zhang | Wei Wang | Budhaditya Deb | Guoqing Zheng | Milad Shokouhi | Ahmed Hassan Awadallah

Reply suggestion models help users process emails and chats faster. Previous work only studies English reply suggestion. Instead, we present MRS, a multilingual reply suggestion dataset with ten languages. MRS can be used to compare two families of models : 1) retrieval models that select the reply from a fixed set and 2) generation models that produce the reply from scratch. Therefore, MRS complements existing cross-lingual generalization benchmarks that focus on classification and sequence labeling tasks. We build a generation model and a retrieval model as baselines for MRS. The two models have different strengths in the monolingual setting, and they require different strategies to generalize across languages. MRS is publicly available at

pdf bib
Align Voting Behavior with Public Statements for Legislator Representation Learning
Xinyi Mou | Zhongyu Wei | Lei Chen | Shangyi Ning | Yancheng He | Changjian Jiang | Xuanjing Huang

Ideology of legislators is typically estimated by ideal point models from historical records of votes. It represents legislators and legislation as points in a latent space and shows promising results for modeling voting behavior. However, it fails to capture more specific attitudes of legislators toward emerging issues and is unable to model newly-elected legislators without voting histories. In order to mitigate these two problems, we explore to incorporate both voting behavior and public statements on Twitter to jointly model legislators. In addition, we propose a novel task, namely hashtag usage prediction to model the ideology of legislators on Twitter. In practice, we construct a heterogeneous graph for the legislative context and use relational graph neural networks to learn the representation of legislators with the guidance of historical records of their voting and hashtag usage. Experiment results indicate that our model yields significant improvements for the task of roll call vote prediction. Further analysis further demonstrates that legislator representation we learned captures nuances in statements.

pdf bib
Attention Calibration for Transformer in Neural Machine Translation
Yu Lu | Jiali Zeng | Jiajun Zhang | Shuangzhi Wu | Mu Li

Attention mechanisms have achieved substantial improvements in neural machine translation by dynamically selecting relevant inputs for different predictions. However, recent studies have questioned the attention mechanisms’ capability for discovering decisive inputs. In this paper, we propose to calibrate the attention weights by introducing a mask perturbation model that automatically evaluates each input’s contribution to the model outputs. We increase the attention weights assigned to the indispensable tokens, whose removal leads to a dramatic performance decrease. The extensive experiments on the Transformer-based translation have demonstrated the effectiveness of our model. We further find that the calibrated attention weights are more uniform at lower layers to collect multiple information while more concentrated on the specific inputs at higher layers. Detailed analyses also show a great need for calibration in the attention weights with high entropy where the model is unconfident about its decision.

pdf bib
Towards Argument Mining for Social Good : A Survey
Eva Maria Vecchi | Neele Falk | Iman Jundi | Gabriella Lapesa

This survey builds an interdisciplinary picture of Argument Mining (AM), with a strong focus on its potential to address issues related to Social and Political Science. More specifically, we focus on AM challenges related to its applications to social media and in the multilingual domain, and then proceed to the widely debated notion of argument quality. We propose a novel definition of argument quality which is integrated with that of deliberative quality from the Social Science literature. Under our definition, the quality of a contribution needs to be assessed at multiple levels : the contribution itself, its preceding context, and the consequential effect on the development of the upcoming discourse. The latter has not received the deserved attention within the community. We finally define an application of AM for Social Good : (semi-)automatic moderation, a highly integrative application which (a) represents a challenging testbed for the integrated notion of quality we advocate, (b) allows the empirical quantification of argument / deliberative quality to benefit from the developments in other NLP fields (i.e. hate speech detection, fact checking, debiasing), and (c) has a clearly beneficial potential at the level of its societal thanks to its real-world application (even if extremely ambitious).

pdf bib
Factorising Meaning and Form for Intent-Preserving Paraphrasing
Tom Hosking | Mirella Lapata

We propose a method for generating paraphrases of English questions that retain the original intent but use a different surface form. Our model combines a careful choice of training objective with a principled information bottleneck, to induce a latent encoding space that disentangles meaning and form. We train an encoder-decoder model to reconstruct a question from a paraphrase with the same meaning and an exemplar with the same surface form, leading to separated encoding spaces. We use a Vector-Quantized Variational Autoencoder to represent the surface form as a set of discrete latent variables, allowing us to use a classifier to select a different surface form at test time. Crucially, our method does not require access to an external source of target exemplars. Extensive experiments and a human evaluation show that we are able to generate paraphrases with a better tradeoff between semantic preservation and syntactic novelty compared to previous methods.

pdf bib
AggGen : Ordering and Aggregating while GeneratingAggGen: Ordering and Aggregating while Generating
Xinnuo Xu | Ondřej Dušek | Verena Rieser | Ioannis Konstas

We present AggGen (pronounced ‘again’) a data-to-text model which re-introduces two explicit sentence planning stages into neural data-to-text systems : input ordering and input aggregation. In contrast to previous work using sentence planning, our model is still end-to-end : AggGen performs sentence planning at the same time as generating text by learning latent alignments (via semantic facts) between input representation and target text. Experiments on the WebNLG and E2E challenge data show that by using fact-based alignments our approach is more interpretable, expressive, robust to noise, and easier to control, while retaining the advantages of end-to-end systems in terms of fluency. Our code is available at

pdf bib
BACO : A Background Knowledge- and Content-Based Framework for Citing Sentence GenerationBACO: A Background Knowledge- and Content-Based Framework for Citing Sentence Generation
Yubin Ge | Ly Dinh | Xiaofeng Liu | Jinsong Su | Ziyao Lu | Ante Wang | Jana Diesner

In this paper, we focus on the problem of citing sentence generation, which entails generating a short text to capture the salient information in a cited paper and the connection between the citing and cited paper. We present BACO, a BAckground knowledge- and COntent-based framework for citing sentence generation, which considers two types of information : (1) background knowledge by leveraging structural information from a citation network ; and (2) content, which represents in-depth information about what to cite and why to cite. First, a citation network is encoded to provide background knowledge. Second, we apply salience estimation to identify what to cite by estimating the importance of sentences in the cited paper. During the decoding stage, both types of information are combined to facilitate the text generation, and then we conduct a joint training for the generator and citation function classification to make the model aware of why to cite. Our experimental results show that our framework outperforms comparative baselines.

pdf bib
Directed Acyclic Graph Network for Conversational Emotion Recognition
Weizhou Shen | Siyue Wu | Yunyi Yang | Xiaojun Quan

The modeling of conversational context plays a vital role in emotion recognition from conversation (ERC). In this paper, we put forward a novel idea of encoding the utterances with a directed acyclic graph (DAG) to better model the intrinsic structure within a conversation, and design a directed acyclic neural network, namely DAG-ERC, to implement this idea. In an attempt to combine the strengths of conventional graph-based neural models and recurrence-based neural models, DAG-ERC provides a more intuitive way to model the information flow between long-distance conversation background and nearby context. Extensive experiments are conducted on four ERC benchmarks with state-of-the-art models employed as baselines for comparison. The empirical results demonstrate the superiority of this new model and confirm the motivation of the directed acyclic graph architecture for ERC.

pdf bib
Topic-Driven and Knowledge-Aware Transformer for Dialogue Emotion Detection
Lixing Zhu | Gabriele Pergola | Lin Gui | Deyu Zhou | Yulan He

Emotion detection in dialogues is challenging as it often requires the identification of thematic topics underlying a conversation, the relevant commonsense knowledge, and the intricate transition patterns between the affective states. In this paper, we propose a Topic-Driven Knowledge-Aware Transformer to handle the challenges above. We firstly design a topic-augmented language model (LM) with an additional layer specialized for topic detection. The topic-augmented LM is then combined with commonsense statements derived from a knowledge base based on the dialogue contextual information. Finally, a transformer-based encoder-decoder architecture fuses the topical and commonsense information, and performs the emotion label sequence prediction. The model has been experimented on four datasets in dialogue emotion detection, demonstrating its superiority empirically over the existing state-of-the-art approaches. Quantitative and qualitative results show that the model can discover topics which help in distinguishing emotion categories.

pdf bib
Topic-Aware Evidence Reasoning and Stance-Aware Aggregation for Fact Verification
Jiasheng Si | Deyu Zhou | Tongzhe Li | Xingyu Shi | Yulan He

Fact verification is a challenging task that requires simultaneously reasoning and aggregating over multiple retrieved pieces of evidence to evaluate the truthfulness of a claim. Existing approaches typically (i) explore the semantic interaction between the claim and evidence at different granularity levels but fail to capture their topical consistency during the reasoning process, which we believe is crucial for verification ; (ii) aggregate multiple pieces of evidence equally without considering their implicit stances to the claim, thereby introducing spurious information. To alleviate the above issues, we propose a novel topic-aware evidence reasoning and stance-aware aggregation model for more accurate fact verification, with the following four key properties : 1) checking topical consistency between the claim and evidence ; 2) maintaining topical coherence among multiple pieces of evidence ; 3) ensuring semantic similarity between the global topic information and the semantic representation of evidence ; 4) aggregating evidence based on their implicit stances to the claim. Extensive experiments conducted on the two benchmark datasets demonstrate the superiority of the proposed model over several state-of-the-art approaches for fact verification. The source code can be obtained from

pdf bib
A Survey of Code-switching : Linguistic and Social Perspectives for Language Technologies
A. Seza Doğruöz | Sunayana Sitaram | Barbara E. Bullock | Almeida Jacqueline Toribio

The analysis of data in which multiple languages are represented has gained popularity among computational linguists in recent years. So far, much of this research focuses mainly on the improvement of computational methods and largely ignores linguistic and social aspects of C-S discussed across a wide range of languages within the long-established literature in linguistics. To fill this gap, we offer a survey of code-switching (C-S) covering the literature in linguistics with a reflection on the key issues in language technologies. From the linguistic perspective, we provide an overview of structural and functional patterns of C-S focusing on the literature from European and Indian contexts as highly multilingual areas. From the language technologies perspective, we discuss how massive language models fail to represent diverse C-S types due to lack of appropriate training data, lack of robust evaluation benchmarks for C-S (across multilingual situations and types of C-S) and lack of end-to- end systems that cover sociolinguistic aspects of C-S as well. Our survey will be a step to- wards an outcome of mutual benefit for computational scientists and linguists with a shared interest in multilingualism and C-S.

pdf bib
Dialogue Response Selection with Hierarchical Curriculum Learning
Yixuan Su | Deng Cai | Qingyu Zhou | Zibo Lin | Simon Baker | Yunbo Cao | Shuming Shi | Nigel Collier | Yan Wang

We study the learning of a matching model for dialogue response selection. Motivated by the recent finding that models trained with random negative samples are not ideal in real-world scenarios, we propose a hierarchical curriculum learning framework that trains the matching model in an easy-to-difficult scheme. Our learning framework consists of two complementary curricula : (1) corpus-level curriculum (CC) ; and (2) instance-level curriculum (IC). In CC, the model gradually increases its ability in finding the matching clues between the dialogue context and a response candidate. As for IC, it progressively strengthens the model’s ability in identifying the mismatching information between the dialogue context and a response candidate. Empirical studies on three benchmark datasets with three state-of-the-art matching models demonstrate that the proposed learning framework significantly improves the model performance across various evaluation metrics.

pdf bib
A Joint Model for Dropped Pronoun Recovery and Conversational Discourse Parsing in Chinese Conversational SpeechChinese Conversational Speech
Jingxuan Yang | Kerui Xu | Jun Xu | Si Li | Sheng Gao | Jun Guo | Nianwen Xue | Ji-Rong Wen

In this paper, we present a neural model for joint dropped pronoun recovery (DPR) and conversational discourse parsing (CDP) in Chinese conversational speech. We show that DPR and CDP are closely related, and a joint model benefits both tasks. We refer to our model as DiscProReco, and it first encodes the tokens in each utterance in a conversation with a directed Graph Convolutional Network (GCN). The token states for an utterance are then aggregated to produce a single state for each utterance. The utterance states are then fed into a biaffine classifier to construct a conversational discourse graph. A second (multi-relational) GCN is then applied to the utterance states to produce a discourse relation-augmented representation for the utterances, which are then fused together with token states in each utterance as input to a dropped pronoun recovery layer. The joint model is trained and evaluated on a new Structure Parsing-enhanced Dropped Pronoun Recovery (SPDPR) data set that we annotated with both two types of information. Experimental results on the SPDPR dataset and other benchmarks show that DiscProReco significantly outperforms the state-of-the-art baselines of both tasks.

pdf bib
Named Entity Recognition with Small Strongly Labeled and Large Weakly Labeled Data
Haoming Jiang | Danqing Zhang | Tianyu Cao | Bing Yin | Tuo Zhao

Weak supervision has shown promising results in many natural language processing tasks, such as Named Entity Recognition (NER). Existing work mainly focuses on learning deep NER models only with weak supervision, i.e., without any human annotation, and shows that by merely using weakly labeled data, one can achieve good performance, though still underperforms fully supervised NER with manually / strongly labeled data. In this paper, we consider a more practical scenario, where we have both a small amount of strongly labeled data and a large amount of weakly labeled data. Unfortunately, we observe that weakly labeled data does not necessarily improve, or even deteriorate the model performance (due to the extensive noise in the weak labels) when we train deep NER models over a simple or weighted combination of the strongly labeled and weakly labeled data. To address this issue, we propose a new multi-stage computational framework NEEDLE with three essential ingredients : (1) weak label completion, (2) noise-aware loss function, and (3) final fine-tuning over the strongly labeled data. Through experiments on E-commerce query NER and Biomedical NER, we demonstrate that NEEDLE can effectively suppress the noise of the weak labels and outperforms existing methods. In particular, we achieve new SOTA F1-scores on 3 Biomedical NER datasets : BC5CDR-chem 93.74, BC5CDR-disease 90.69, NCBI-disease 92.28.

pdf bib
Improving Named Entity Recognition by External Context Retrieving and Cooperative Learning
Xinyu Wang | Yong Jiang | Nguyen Bach | Tao Wang | Zhongqiang Huang | Fei Huang | Kewei Tu

Recent advances in Named Entity Recognition (NER) show that document-level contexts can significantly improve model performance. In many application scenarios, however, such contexts are not available. In this paper, we propose to find external contexts of a sentence by retrieving and selecting a set of semantically relevant texts through a search engine, with the original sentence as the query. We find empirically that the contextual representations computed on the retrieval-based input view, constructed through the concatenation of a sentence and its external contexts, can achieve significantly improved performance compared to the original input view based only on the sentence. Furthermore, we can improve the model performance of both input views by Cooperative Learning, a training method that encourages the two input views to produce similar contextual representations or output label distributions. Experiments show that our approach can achieve new state-of-the-art performance on 8 NER data sets across 5 domains.

pdf bib
Bird’s Eye : Probing for Linguistic Graph Structures with a Simple Information-Theoretic Approach
Yifan Hou | Mrinmaya Sachan

NLP has a rich history of representing our prior understanding of language in the form of graphs. Recent work on analyzing contextualized text representations has focused on hand-designed probe models to understand how and to what extent do these representations encode a particular linguistic phenomenon. However, due to the inter-dependence of various phenomena and randomness of training probe models, detecting how these representations encode the rich information in these linguistic graphs remains a challenging problem. In this paper, we propose a new information-theoretic probe, Bird’s Eye, which is a fairly simple probe method for detecting if and how these representations encode the information in these linguistic graphs. Instead of using model performance, our probe takes an information-theoretic view of probing and estimates the mutual information between the linguistic graph embedded in a continuous space and the contextualized word representations. Furthermore, we also propose an approach to use our probe to investigate localized linguistic information in the linguistic graphs using perturbation analysis. We call this probing setup Worm’s Eye. Using these probes, we analyze the BERT models on its ability to encode a syntactic and a semantic graph structure, and find that these models encode to some degree both syntactic as well as semantic information ; albeit syntactic information to a greater extent.

pdf bib
Bad Seeds : Evaluating Lexical Methods for Bias Measurement
Maria Antoniak | David Mimno

A common factor in bias measurement methods is the use of hand-curated seed lexicons, but there remains little guidance for their selection. We gather seeds used in prior work, documenting their common sources and rationales, and in case studies of three English-language corpora, we enumerate the different types of social biases and linguistic features that, once encoded in the seeds, can affect subsequent bias measurements. Seeds developed in one context are often re-used in other contexts, but documentation and evaluation remain necessary precursors to relying on seeds for sensitive measurements.

pdf bib
Intrinsic Bias Metrics Do Not Correlate with Application Bias
Seraphina Goldfarb-Tarrant | Rebecca Marchant | Ricardo Muñoz Sánchez | Mugdha Pandya | Adam Lopez

Natural Language Processing (NLP) systems learn harmful societal biases that cause them to amplify inequality as they are deployed in more and more situations. To guide efforts at debiasing these systems, the NLP community relies on a variety of metrics that quantify bias in models. Some of these metrics are intrinsic, measuring bias in word embedding spaces, and some are extrinsic, measuring bias in downstream tasks that the word embeddings enable. Do these intrinsic and extrinsic metrics correlate with each other? We compare intrinsic and extrinsic metrics across hundreds of trained models covering different tasks and experimental conditions. Our results show no reliable correlation between these metrics that holds in all scenarios across tasks and languages. We urge researchers working on debiasing to focus on extrinsic measures of bias, and to make using these measures more feasible via creation of new challenge sets and annotated test data. To aid this effort, we release code, a new intrinsic metric, and an annotated test set focused on gender bias in hate speech.

pdf bib
Crafting Adversarial Examples for Neural Machine Translation
Xinze Zhang | Junzhe Zhang | Zhenhua Chen | Kun He

Effective adversary generation for neural machine translation (NMT) is a crucial prerequisite for building robust machine translation systems. In this work, we investigate veritable evaluations of NMT adversarial attacks, and propose a novel method to craft NMT adversarial examples. We first show the current NMT adversarial attacks may be improperly estimated by the commonly used mono-directional translation, and we propose to leverage the round-trip translation technique to build valid metrics for evaluating NMT adversarial attacks. Our intuition is that an effective NMT adversarial example, which imposes minor shifting on the source and degrades the translation dramatically, would naturally lead to a semantic-destroyed round-trip translation result. We then propose a promising black-box attack method called Word Saliency speedup Local Search (WSLS) that could effectively attack the mainstream NMT architectures. Comprehensive experiments demonstrate that the proposed metrics could accurately evaluate the attack effectiveness, and the proposed WSLS could significantly break the state-of-art NMT models with small perturbation. Besides, WSLS exhibits strong transferability on attacking Baidu and Bing online translators.

pdf bib
Hierarchical Context-aware Network for Dense Video Event Captioning
Lei Ji | Xianglin Guo | Haoyang Huang | Xilin Chen

Dense video event captioning aims to generate a sequence of descriptive captions for each event in a long untrimmed video. Video-level context provides important information and facilities the model to generate consistent and less redundant captions between events. In this paper, we introduce a novel Hierarchical Context-aware Network for dense video event captioning (HCN) to capture context from various aspects. In detail, the model leverages local and global context with different mechanisms to jointly learn to generate coherent captions. The local context module performs full interaction between neighbor frames and the global context module selectively attends to previous or future events. According to our extensive experiment on both Youcook2 and Activitynet Captioning datasets, the video-level HCN model outperforms the event-level context-agnostic model by a large margin. The code is available at

pdf bib
Control Image Captioning Spatially and Temporally
Kun Yan | Lei Ji | Huaishao Luo | Ming Zhou | Nan Duan | Shuai Ma

Generating image captions with user intention is an emerging need. The recently published Localized Narratives dataset takes mouse traces as another input to the image captioning task, which is an intuitive and efficient way for a user to control what to describe in the image. However, how to effectively employ traces to improve generation quality and controllability is still under exploration. This paper aims to solve this problem by proposing a novel model called LoopCAG, which connects Contrastive constraints and Attention Guidance in a Loop manner, engaged explicit spatial and temporal constraints to the generating process. Precisely, each generated sentence is temporally aligned to the corresponding trace sequence through a contrastive learning strategy. Besides, each generated text token is supervised to attend to the correct visual objects under heuristic spatial attention guidance. Comprehensive experimental results demonstrate that our LoopCAG model learns better correspondence among the three modalities (vision, language, and traces) and achieves SOTA performance on trace-controlled image captioning task. Moreover, the controllability and explainability of LoopCAG are validated by analyzing spatial and temporal sensitivity during the generation process.

pdf bib
Edited Media Understanding Frames : Reasoning About the Intent and Implications of Visual Misinformation
Jeff Da | Maxwell Forbes | Rowan Zellers | Anthony Zheng | Jena D. Hwang | Antoine Bosselut | Yejin Choi

Understanding manipulated media, from automatically generated ‘deepfakes’ to manually edited ones, raises novel research challenges. Because the vast majority of edited or manipulated images are benign, such as photoshopped images for visual enhancements, the key challenge is to understand the complex layers of underlying intents of media edits and their implications with respect to disinformation. In this paper, we study Edited Media Frames, a new formalism to understand visual media manipulation as structured annotations with respect to the intents, emotional reactions, attacks on individuals, and the overall implications of disinformation. We introduce a dataset for our task, EMU, with 56k question-answer pairs written in rich natural language. We evaluate a wide variety of vision-and-language models for our task, and introduce a new model PELICAN, which builds upon recent progress in pretrained multimodal representations. Our model obtains promising results on our dataset, with humans rating its answers as accurate 48.2 % of the time. At the same time, there is still much work to be done and we provide analysis that highlights areas for further progress.

pdf bib
Modeling Fine-Grained Entity Types with Box Embeddings
Yasumasa Onoe | Michael Boratko | Andrew McCallum | Greg Durrett

Neural entity typing models typically represent fine-grained entity types as vectors in a high-dimensional space, but such spaces are not well-suited to modeling these types’ complex interdependencies. We study the ability of box embeddings, which embed concepts as d-dimensional hyperrectangles, to capture hierarchies of types even when these relationships are not defined explicitly in the ontology. Our model represents both types and entity mentions as boxes. Each mention and its context are fed into a BERT-based model to embed that mention in our box space ; essentially, this model leverages typological clues present in the surface text to hypothesize a type representation for the mention. Box containment can then be used to derive both the posterior probability of a mention exhibiting a given type and the conditional probability relations between types themselves. We compare our approach with a vector-based typing model and observe state-of-the-art performance on several entity typing benchmarks. In addition to competitive typing performance, our box-based model shows better performance in prediction consistency (predicting a supertype and a subtype together) and confidence (i.e., calibration), demonstrating that the box-based model captures the latent type hierarchies better than the vector-based model does.

pdf bib
Weight Distillation : Transferring the Knowledge in Neural Network Parameters
Ye Lin | Yanyang Li | Ziyang Wang | Bei Li | Quan Du | Tong Xiao | Jingbo Zhu

Knowledge distillation has been proven to be effective in model acceleration and compression. It transfers knowledge from a large neural network to a small one by using the large neural network predictions as targets of the small neural network. But this way ignores the knowledge inside the large neural networks, e.g., parameters. Our preliminary study as well as the recent success in pre-training suggests that transferring parameters are more effective in distilling knowledge. In this paper, we propose Weight Distillation to transfer the knowledge in parameters of a large neural network to a small neural network through a parameter generator. On the WMT16 En-Ro, NIST12 Zh-En, and WMT14 En-De machine translation tasks, our experiments show that weight distillation learns a small network that is 1.88 2.94x faster than the large network but with competitive BLEU performance. When fixing the size of small networks, weight distillation outperforms knowledge distillation by 0.51 1.82 BLEU points.

pdf bib
PlotCoder : Hierarchical Decoding for Synthesizing Visualization Code in Programmatic ContextPlotCoder: Hierarchical Decoding for Synthesizing Visualization Code in Programmatic Context
Xinyun Chen | Linyuan Gong | Alvin Cheung | Dawn Song

Creating effective visualization is an important part of data analytics. While there are many libraries for creating visualization, writing such code remains difficult given the myriad of parameters that users need to provide. In this paper, we propose the new task of synthesizing visualization programs from a combination of natural language utterances and code context. To tackle the learning problem, we introduce PlotCoder, a new hierarchical encoder-decoder architecture that models both the code context and the input utterance. We use PlotCoder to first determine the template of the visualization code, followed by predicting the data to be plotted. We use Jupyter notebooks containing visualization programs crawled from GitHub to train PlotCoder. On a comprehensive set of test samples from those notebooks, we show that PlotCoder correctly predicts the plot type of about 70 % samples, and synthesizes the correct programs for 35 % samples, performing 3-4.5 % better than the baselines.

pdf bib
Changing the World by Changing the Data
Anna Rogers

NLP community is currently investing a lot more research and resources into development of deep learning models than training data. While we have made a lot of progress, it is now clear that our models learn all kinds of spurious patterns, social biases, and annotation artifacts. Algorithmic solutions have so far had limited success. An alternative that is being actively discussed is more careful design of datasets so as to deliver specific signals. This position paper maps out the arguments for and against data curation, and argues that fundamentally the point is moot : curation already is and will be happening, and it is changing the world. The question is only how much thought we want to invest into that process.

pdf bib
On the Effectiveness of Adapter-based Tuning for Pretrained Language Model Adaptation
Ruidan He | Linlin Liu | Hai Ye | Qingyu Tan | Bosheng Ding | Liying Cheng | Jiawei Low | Lidong Bing | Luo Si

Adapter-based tuning has recently arisen as an alternative to fine-tuning. It works by adding light-weight adapter modules to a pretrained language model (PrLM) and only updating the parameters of adapter modules when learning on a downstream task. As such, it adds only a few trainable parameters per new task, allowing a high degree of parameter sharing. Prior studies have shown that adapter-based tuning often achieves comparable results to fine-tuning. However, existing work only focuses on the parameter-efficient aspect of adapter-based tuning while lacking further investigation on its effectiveness. In this paper, we study the latter. We first show that adapter-based tuning better mitigates forgetting issues than fine-tuning since it yields representations with less deviation from those generated by the initial PrLM. We then empirically compare the two tuning methods on several downstream NLP tasks and settings. We demonstrate that 1) adapter-based tuning outperforms fine-tuning on low-resource and cross-lingual tasks ; 2) it is more robust to overfitting and less sensitive to changes in learning rates.

pdf bib
Integrating Semantics and Neighborhood Information with Graph-Driven Generative Models for Document Retrieval
Zijing Ou | Qinliang Su | Jianxing Yu | Bang Liu | Jingwen Wang | Ruihui Zhao | Changyou Chen | Yefeng Zheng

With the need of fast retrieval speed and small memory footprint, document hashing has been playing a crucial role in large-scale information retrieval. To generate high-quality hashing code, both semantics and neighborhood information are crucial. However, most existing methods leverage only one of them or simply combine them via some intuitive criteria, lacking a theoretical principle to guide the integration process. In this paper, we encode the neighborhood information with a graph-induced Gaussian distribution, and propose to integrate the two types of information with a graph-driven generative model. To deal with the complicated correlations among documents, we further propose a tree-structured approximation method for learning. Under the approximation, we prove that the training objective can be decomposed into terms involving only singleton or pairwise documents, enabling the model to be trained as efficiently as uncorrelated ones. Extensive experimental results on three benchmark datasets show that our method achieves superior performance over state-of-the-art methods, demonstrating the effectiveness of the proposed model for simultaneously preserving semantic and neighborhood information.

pdf bib
KaggleDBQA : Realistic Evaluation of Text-to-SQL ParsersKaggleDBQA: Realistic Evaluation of Text-to-SQL Parsers
Chia-Hsuan Lee | Oleksandr Polozov | Matthew Richardson

The goal of database question answering is to enable natural language querying of real-life relational databases in diverse application domains. Recently, large-scale datasets such as Spider and WikiSQL facilitated novel modeling techniques for text-to-SQL parsing, improving zero-shot generalization to unseen databases. In this work, we examine the challenges that still prevent these techniques from practical deployment. First, we present KaggleDBQA, a new cross-domain evaluation dataset of real Web databases, with domain-specific data types, original formatting, and unrestricted questions. Second, we re-examine the choice of evaluation tasks for text-to-SQL parsers as applied in real-life settings. Finally, we augment our in-domain evaluation task with database documentation, a naturally occurring source of implicit domain knowledge. We show that KaggleDBQA presents a challenge to state-of-the-art zero-shot parsers but a more realistic evaluation setting and creative use of associated database documentation boosts their accuracy by over 13.2 %, doubling their performance.

pdf bib
Better than Average : Paired Evaluation of NLP systemsNLP systems
Maxime Peyrard | Wei Zhao | Steffen Eger | Robert West

Evaluation in NLP is usually done by comparing the scores of competing systems independently averaged over a common set of test instances. In this work, we question the use of averages for aggregating evaluation scores into a final number used to decide which system is best, since the average, as well as alternatives such as the median, ignores the pairing arising from the fact that systems are evaluated on the same test instances. We illustrate the importance of taking the instancelevel pairing of evaluation scores into account and demonstrate, both theoretically and empirically, the advantages of aggregation methods based on pairwise comparisons, such as the BradleyTerry (BT) model, a mechanism based on the estimated probability that a given system scores better than another on the test set. By re-evaluating 296 real NLP evaluation setups across four tasks and 18 evaluation metrics, we show that the choice of aggregation mechanism matters and yields different conclusions as to which systems are state of the art in about 30 % of the setups. To facilitate the adoption of pairwise evaluation, we release a practical tool for performing the full analysis of evaluation scores with the mean, median, BT, and two variants of BT (Elo and TrueSkill), alongside functionality for appropriate statistical testing.

pdf bib
Distributed Representations of Emotion Categories in Emotion Space
Xiangyu Wang | Chengqing Zong

Emotion category is usually divided into different ones by human beings, but it is indeed difficult to clearly distinguish and define the boundaries between different emotion categories. The existing studies working on emotion detection usually focus on how to improve the performance of model prediction, in which emotions are represented with one-hot vectors. However, emotion relations are ignored in one-hot representations. In this article, we first propose a general framework to learn the distributed representations for emotion categories in emotion space from a given emotion classification dataset. Furthermore, based on the soft labels predicted by the pre-trained neural network model, we derive a simple and effective algorithm. Experiments have validated that the proposed representations in emotion space can express emotion relations much better than word vectors in semantic space.

pdf bib
Style is NOT a single variable : Case Studies for Cross-Stylistic Language UnderstandingNOT a single variable: Case Studies for Cross-Stylistic Language Understanding
Dongyeop Kang | Eduard Hovy

Every natural text is written in some style. Style is formed by a complex combination of different stylistic factors, including formality markers, emotions, metaphors, etc. One can not form a complete understanding of a text without considering these factors. The factors combine and co-vary in complex ways to form styles. Studying the nature of the covarying combinations sheds light on stylistic language in general, sometimes called cross-style language understanding. This paper provides the benchmark corpus (XSLUE) that combines existing datasets and collects a new one for sentence-level cross-style language understanding and evaluation. The benchmark contains text in 15 different styles under the proposed four theoretical groupings : figurative, personal, affective, and interpersonal groups. For valid evaluation, we collect an additional diagnostic set by annotating all 15 styles on the same text. Using XSLUE, we propose three interesting cross-style applications in classification, correlation, and generation. First, our proposed cross-style classifier trained with multiple styles together helps improve overall classification performance against individually-trained style classifiers. Second, our study shows that some styles are highly dependent on each other in human-written text. Finally, we find that combinations of some contradictive styles likely generate stylistically less appropriate text. We believe our benchmark and case studies help explore interesting future directions for cross-style research. The preprocessed datasets and code are publicly available.

pdf bib
A Unified Generative Framework for Aspect-based Sentiment Analysis
Hang Yan | Junqi Dai | Tuo Ji | Xipeng Qiu | Zheng Zhang

Aspect-based Sentiment Analysis (ABSA) aims to identify the aspect terms, their corresponding sentiment polarities, and the opinion terms. There exist seven subtasks in ABSA. Most studies only focus on the subsets of these subtasks, which leads to various complicated ABSA models while hard to solve these subtasks in a unified framework. In this paper, we redefine every subtask target as a sequence mixed by pointer indexes and sentiment class indexes, which converts all ABSA subtasks into a unified generative formulation. Based on the unified formulation, we exploit the pre-training sequence-to-sequence model BART to solve all ABSA subtasks in an end-to-end framework. Extensive experiments on four ABSA datasets for seven subtasks demonstrate that our framework achieves substantial performance gain and provides a real unified end-to-end solution for the whole ABSA subtasks, which could benefit multiple tasks.

pdf bib
Discovering Dialogue Slots with Weak Supervision
Vojtěch Hudeček | Ondřej Dušek | Zhou Yu

Task-oriented dialogue systems typically require manual annotation of dialogue slots in training data, which is costly to obtain. We propose a method that eliminates this requirement : We use weak supervision from existing linguistic annotation models to identify potential slot candidates, then automatically identify domain-relevant slots by using clustering algorithms. Furthermore, we use the resulting slot annotation to train a neural-network-based tagger that is able to perform slot tagging with no human intervention. This tagger is trained solely on the outputs of our method and thus does not rely on any labeled data. Our model demonstrates state-of-the-art performance in slot tagging without labeled training data on four different dialogue domains. Moreover, we find that slot annotations discovered by our model significantly improve the performance of an end-to-end dialogue response generation model, compared to using no slot annotation at all.

pdf bib
PROTAUGMENT : Unsupervised diverse short-texts paraphrasing for intent detection meta-learningPROTAUGMENT: Unsupervised diverse short-texts paraphrasing for intent detection meta-learning
Thomas Dopierre | Christophe Gravier | Wilfried Logerais

Recent research considers few-shot intent detection as a meta-learning problem : the model is learning to learn from a consecutive set of small tasks named episodes. In this work, we propose ProtAugment, a meta-learning algorithm for short texts classification (the intent detection task). ProtAugment is a novel extension of Prototypical Networks, that limits overfitting on the bias introduced by the few-shots classification objective at each episode. It relies on diverse paraphrasing : a conditional language model is first fine-tuned for paraphrasing, and diversity is later introduced at the decoding stage at each meta-learning episode. The diverse paraphrasing is unsupervised as it is applied to unlabelled data, and then fueled to the Prototypical Network training objective as a consistency loss. ProtAugment is the state-of-the-art method for intent detection meta-learning, at no extra labeling efforts and without the need to fine-tune a conditional language model on a given application domain.

pdf bib
Towards Robustness of Text-to-SQL Models against Synonym SubstitutionSQL Models against Synonym Substitution
Yujian Gan | Xinyun Chen | Qiuping Huang | Matthew Purver | John R. Woodward | Jinxia Xie | Pengsheng Huang

Recently, there has been significant progress in studying neural networks to translate text descriptions into SQL queries. Despite achieving good performance on some public benchmarks, existing text-to-SQL models typically rely on the lexical matching between words in natural language (NL) questions and tokens in table schemas, which may render the models vulnerable to attacks that break the schema linking mechanism. In this work, we investigate the robustness of text-to-SQL models to synonym substitution. In particular, we introduce Spider-Syn, a human-curated dataset based on the Spider benchmark for text-to-SQL translation. NL questions in Spider-Syn are modified from Spider, by replacing their schema-related words with manually selected synonyms that reflect real-world question paraphrases. We observe that the accuracy dramatically drops by eliminating such explicit correspondence between NL questions and table schemas, even if the synonyms are not adversarially selected to conduct worst-case attacks. Finally, we present two categories of approaches to improve the model robustness. The first category of approaches utilizes additional synonym annotations for table schemas by modifying the model input, while the second category is based on adversarial training. We demonstrate that both categories of approaches significantly outperform their counterparts without the defense, and the first category of approaches are more effective.

pdf bib
LGESQL : Line Graph Enhanced Text-to-SQL Model with Mixed Local and Non-Local RelationsLGESQL: Line Graph Enhanced Text-to-SQL Model with Mixed Local and Non-Local Relations
Ruisheng Cao | Lu Chen | Zhi Chen | Yanbin Zhao | Su Zhu | Kai Yu

This work aims to tackle the challenging heterogeneous graph encoding problem in the text-to-SQL task. Previous methods are typically node-centric and merely utilize different weight matrices to parameterize edge types, which 1) ignore the rich semantics embedded in the topological structure of edges, and 2) fail to distinguish local and non-local relations for each node. To this end, we propose a Line Graph Enhanced Text-to-SQL (LGESQL) model to mine the underlying relational features without constructing meta-paths. By virtue of the line graph, messages propagate more efficiently through not only connections between nodes, but also the topology of directed edges. Furthermore, both local and non-local relations are integrated distinctively during the graph iteration. We also design an auxiliary task called graph pruning to improve the discriminative capability of the encoder. Our framework achieves state-of-the-art results (62.8 % with Glove, 72.0 % with Electra) on the cross-domain text-to-SQL benchmark Spider at the time of writing.

pdf bib
Multi-stage Pre-training over Simplified Multimodal Pre-training Models
Tongtong Liu | Fangxiang Feng | Xiaojie Wang

Multimodal pre-training models, such as LXMERT, have achieved excellent results in downstream tasks. However, current pre-trained models require large amounts of training data and have huge model sizes, which make them impossible to apply in low-resource situations. How to obtain similar or even better performance than a larger model under the premise of less pre-training data and smaller model size has become an important problem. In this paper, we propose a new Multi-stage Pre-training (MSP) method, which uses information at different granularities from word, phrase to sentence in both texts and images to pre-train a model in stages. We also design several different pre-training tasks suitable for the information granularity in different stage in order to efficiently capture the diverse knowledge from a limited corpus. We take a Simplified LXMERT (LXMERT-S) which is with 45.9 % parameters of the original LXMERT model and only 11.44 % of the original pre-training data as the testbed of our MSP method. Experimental results show that our method achieves comparable performance to the original LXMERT model in all downstream tasks, and even outperforms the original model in Image-Text Retrieval task.

pdf bib
Beyond Sentence-Level End-to-End Speech Translation : Context Helps
Biao Zhang | Ivan Titov | Barry Haddow | Rico Sennrich

Document-level contextual information has shown benefits to text-based machine translation, but whether and how context helps end-to-end (E2E) speech translation (ST) is still under-studied. We fill this gap through extensive experiments using a simple concatenation-based context-aware ST model, paired with adaptive feature selection on speech encodings for computational efficiency. We investigate several decoding approaches, and introduce in-model ensemble decoding which jointly performs document- and sentence-level translation using the same model. Our results on the MuST-C benchmark with Transformer demonstrate the effectiveness of context to E2E ST. Compared to sentence-level ST, context-aware ST obtains better translation quality (+0.18-2.61 BLEU), improves pronoun and homophone translation, shows better robustness to (artificial) audio segmentation errors, and reduces latency and flicker to deliver higher quality for simultaneous translation.

pdf bib
Missing Modality Imagination Network for Emotion Recognition with Uncertain Missing Modalities
Jinming Zhao | Ruichen Li | Qin Jin

Multimodal fusion has been proved to improve emotion recognition performance in previous works. However, in real-world applications, we often encounter the problem of missing modality, and which modalities will be missing is uncertain. It makes the fixed multimodal fusion fail in such cases. In this work, we propose a unified model, Missing Modality Imagination Network (MMIN), to deal with the uncertain missing modality problem. MMIN learns robust joint multimodal representations, which can predict the representation of any missing modality given available modalities under different missing modality conditions. Comprehensive experiments on two benchmark datasets demonstrate that the unified MMIN model significantly improves emotion recognition performance under both uncertain missing-modality testing conditions and full-modality ideal testing condition. The code will be available at

pdf bib
Stacked Acoustic-and-Textual Encoding : Integrating the Pre-trained Models into Speech Translation Encoders
Chen Xu | Bojie Hu | Yanyang Li | Yuhao Zhang | Shen Huang | Qi Ju | Tong Xiao | Jingbo Zhu

Encoder pre-training is promising in end-to-end Speech Translation (ST), given the fact that speech-to-translation data is scarce. But ST encoders are not simple instances of Automatic Speech Recognition (ASR) or Machine Translation (MT) encoders. For example, we find that ASR encoders lack the global context representation, which is necessary for translation, whereas MT encoders are not designed to deal with long but locally attentive acoustic sequences. In this work, we propose a Stacked Acoustic-and-Textual Encoding (SATE) method for speech translation. Our encoder begins with processing the acoustic sequence as usual, but later behaves more like an MT encoder for a global representation of the input sequence. In this way, it is straightforward to incorporate the pre-trained models into the system. Also, we develop an adaptor module to alleviate the representation inconsistency between the pre-trained ASR encoder and MT encoder, and develop a multi-teacher knowledge distillation method to preserve the pre-training knowledge. Experimental results on the LibriSpeech En-Fr and MuST-C En-De ST tasks show that our method achieves state-of-the-art BLEU scores of 18.3 and 25.2. To our knowledge, we are the first to develop an end-to-end ST system that achieves comparable or even better BLEU performance than the cascaded ST counterpart when large-scale ASR and MT data is available.

pdf bib
N-ary Constituent Tree Parsing with Recursive Semi-Markov ModelMarkov Model
Xin Xin | Jinlong Li | Zeqi Tan

In this paper, we study the task of graph-based constituent parsing in the setting that binarization is not conducted as a pre-processing step, where a constituent tree may consist of nodes with more than two children. Previous graph-based methods on this setting typically generate hidden nodes with the dummy label inside the n-ary nodes, in order to transform the tree into a binary tree for prediction. The limitation is that the hidden nodes break the sibling relations of the n-ary node’s children. Consequently, the dependencies of such sibling constituents might not be accurately modeled and is being ignored. To solve this limitation, we propose a novel graph-based framework, which is called recursive semi-Markov model. The main idea is to utilize 1-order semi-Markov model to predict the immediate children sequence of a constituent candidate, which then recursively serves as a child candidate of its parent. In this manner, the dependencies of sibling constituents can be described by 1-order transition features, which solves the above limitation. Through experiments, the proposed framework obtains the F1 of 95.92 % and 92.50 % on the datasets of PTB and CTB 5.1 respectively. Specially, the recursive semi-Markov model shows advantages in modeling nodes with more than two children, whose average F1 can be improved by 0.3-1.1 points in PTB and 2.3-6.8 points in CTB 5.1.

pdf bib
Automated Concatenation of Embeddings for Structured Prediction
Xinyu Wang | Yong Jiang | Nguyen Bach | Tao Wang | Zhongqiang Huang | Fei Huang | Kewei Tu

Pretrained contextualized embeddings are powerful word representations for structured prediction tasks. Recent work found that better word representations can be obtained by concatenating different types of embeddings. However, the selection of embeddings to form the best concatenated representation usually varies depending on the task and the collection of candidate embeddings, and the ever-increasing number of embedding types makes it a more difficult problem. In this paper, we propose Automated Concatenation of Embeddings (ACE) to automate the process of finding better concatenations of embeddings for structured prediction tasks, based on a formulation inspired by recent progress on neural architecture search. Specifically, a controller alternately samples a concatenation of embeddings, according to its current belief of the effectiveness of individual embedding types in consideration for a task, and updates the belief based on a reward. We follow strategies in reinforcement learning to optimize the parameters of the controller and compute the reward based on the accuracy of a task model, which is fed with the sampled concatenation as input and trained on a task dataset. Empirical results on 6 tasks and 21 datasets show that our approach outperforms strong baselines and achieves state-of-the-art performance with fine-tuned embeddings in all the evaluations.

pdf bib
Multi-View Cross-Lingual Structured Prediction with Minimum Supervision
Zechuan Hu | Yong Jiang | Nguyen Bach | Tao Wang | Zhongqiang Huang | Fei Huang | Kewei Tu

In structured prediction problems, cross-lingual transfer learning is an efficient way to train quality models for low-resource languages, and further improvement can be obtained by learning from multiple source languages. However, not all source models are created equal and some may hurt performance on the target language. Previous work has explored the similarity between source and target sentences as an approximate measure of strength for different source models. In this paper, we propose a multi-view framework, by leveraging a small number of labeled target sentences, to effectively combine multiple source models into an aggregated source view at different granularity levels (language, sentence, or sub-structure), and transfer it to a target view based on a task-specific model. By encouraging the two views to interact with each other, our framework can dynamically adjust the confidence level of each source model and improve the performance of both views during training. Experiments for three structured prediction tasks on sixteen data sets show that our framework achieves significant improvement over all existing approaches, including these with access to additional source language data.

pdf bib
Assessing the Representations of Idiomaticity in Vector Models with a Noun Compound Dataset Labeled at Type and Token Levels
Marcos Garcia | Tiago Kramer Vieira | Carolina Scarton | Marco Idiart | Aline Villavicencio

Accurate assessment of the ability of embedding models to capture idiomaticity may require evaluation at token rather than type level, to account for degrees of idiomaticity and possible ambiguity between literal and idiomatic usages. However, most existing resources with annotation of idiomaticity include ratings only at type level. This paper presents the Noun Compound Type and Token Idiomaticity (NCTTI) dataset, with human annotations for 280 noun compounds in English and 180 in Portuguese at both type and token level. We compiled 8,725 and 5,091 token level annotations for English and Portuguese, respectively, which are strongly correlated with the corresponding scores obtained at type level. The NCTTI dataset is used to explore how vector space models reflect the variability of idiomaticity across sentences. Several experiments using state-of-the-art contextualised models suggest that their representations are not capturing the noun compounds idiomaticity as human annotators. This new multilingual resource also contains suggestions for paraphrases of the noun compounds both at type and token levels, with uses for lexical substitution or disambiguation in context.

pdf bib
Locate and Label : A Two-stage Identifier for Nested Named Entity Recognition
Yongliang Shen | Xinyin Ma | Zeqi Tan | Shuai Zhang | Wen Wang | Weiming Lu

Named entity recognition (NER) is a well-studied task in natural language processing. Traditional NER research only deals with flat entities and ignores nested entities. The span-based methods treat entity recognition as a span classification task. Although these methods have the innate ability to handle nested NER, they suffer from high computational cost, ignorance of boundary information, under-utilization of the spans that partially match with entities, and difficulties in long entity recognition. To tackle these issues, we propose a two-stage entity identifier. First we generate span proposals by filtering and boundary regression on the seed spans to locate the entities, and then label the boundary-adjusted span proposals with the corresponding categories. Our method effectively utilizes the boundary information of entities and partially matched spans during training. Through boundary regression, entities of any length can be covered theoretically, which improves the ability to recognize long entities. In addition, many low-quality seed spans are filtered out in the first stage, which reduces the time complexity of inference. Experiments on nested NER datasets demonstrate that our proposed method outperforms previous state-of-the-art models.

pdf bib
Text2Event : Controllable Sequence-to-Structure Generation for End-to-end Event ExtractionText2Event: Controllable Sequence-to-Structure Generation for End-to-end Event Extraction
Yaojie Lu | Hongyu Lin | Jin Xu | Xianpei Han | Jialong Tang | Annan Li | Le Sun | Meng Liao | Shaoyi Chen

Event extraction is challenging due to the complex structure of event records and the semantic gap between text and event. Traditional methods usually extract event records by decomposing the complex structure prediction task into multiple subtasks. In this paper, we propose Text2Event, a sequence-to-structure generation paradigm that can directly extract events from the text in an end-to-end manner. Specifically, we design a sequence-to-structure network for unified event extraction, a constrained decoding algorithm for event knowledge injection during inference, and a curriculum learning algorithm for efficient model learning. Experimental results show that, by uniformly modeling all tasks in a single model and universally predicting different labels, our method can achieve competitive performance using only record-level annotations in both supervised learning and transfer learning settings.

pdf bib
A Neural Transition-based Joint Model for Disease Named Entity Recognition and Normalization
Zongcheng Ji | Tian Xia | Mei Han | Jing Xiao

Disease is one of the fundamental entities in biomedical research. Recognizing such entities from biomedical text and then normalizing them to a standardized disease vocabulary offer a tremendous opportunity for many downstream applications. Previous studies have demonstrated that joint modeling of the two sub-tasks has superior performance than the pipelined counterpart. Although the neural joint model based on multi-task learning framework has achieved state-of-the-art performance, it suffers from the boundary inconsistency problem due to the separate decoding procedures. Moreover, it ignores the rich information (e.g., the text surface form) of each candidate concept in the vocabulary, which is quite essential for entity normalization. In this work, we propose a neural transition-based joint model to alleviate these two issues. We transform the end-to-end disease recognition and normalization task as an action sequence prediction task, which not only jointly learns the model with shared representations of the input, but also jointly searches the output by state transitions in one search space. Moreover, we introduce attention mechanisms to take advantage of the text surface form of each candidate concept for better normalization performance. Experimental results conducted on two publicly available datasets show the effectiveness of the proposed method.

pdf bib
Cascade versus Direct Speech Translation : Do the Differences Still Make a Difference?
Luisa Bentivogli | Mauro Cettolo | Marco Gaido | Alina Karakanta | Alberto Martinelli | Matteo Negri | Marco Turchi

Five years after the first published proofs of concept, direct approaches to speech translation (ST) are now competing with traditional cascade solutions. In light of this steady progress, can we claim that the performance gap between the two is closed? Starting from this question, we present a systematic comparison between state-of-the-art systems representative of the two paradigms. Focusing on three language directions (English-German / Italian / Spanish), we conduct automatic and manual evaluations, exploiting high-quality professional post-edits and annotations. Our multi-faceted analysis on one of the few publicly available ST benchmarks attests for the first time that : i) the gap between the two paradigms is now closed, and ii) the subtle differences observed in their behavior are not sufficient for humans neither to distinguish them nor to prefer one over the other.

pdf bib
ERNIE-Doc : A Retrospective Long-Document Modeling TransformerERNIE-Doc: A Retrospective Long-Document Modeling Transformer
SiYu Ding | Junyuan Shang | Shuohuan Wang | Yu Sun | Hao Tian | Hua Wu | Haifeng Wang

Transformers are not suited for processing long documents, due to their quadratically increasing memory and time consumption. Simply truncating a long document or applying the sparse attention mechanism will incur the context fragmentation problem or lead to an inferior modeling capability against comparable model sizes. In this paper, we propose ERNIE-Doc, a document-level language pretraining model based on Recurrence Transformers. Two well-designed techniques, namely the retrospective feed mechanism and the enhanced recurrence mechanism, enable ERNIE-Doc, which has a much longer effective context length, to capture the contextual information of a complete document. We pretrain ERNIE-Doc to explicitly learn the relationships among segments with an additional document-aware segment-reordering objective. Various experiments were conducted on both English and Chinese document-level tasks. ERNIE-Doc improved the state-of-the-art language modeling result of perplexity to 16.8 on WikiText-103. Moreover, it outperformed competitive pretraining models by a large margin on most language understanding tasks, such as text classification and question answering.

pdf bib
Marginal Utility Diminishes : Exploring the Minimum Knowledge for BERT Knowledge DistillationBERT Knowledge Distillation
Yuanxin Liu | Fandong Meng | Zheng Lin | Weiping Wang | Jie Zhou

Recently, knowledge distillation (KD) has shown great success in BERT compression. Instead of only learning from the teacher’s soft label as in conventional KD, researchers find that the rich information contained in the hidden layers of BERT is conducive to the student’s performance. To better exploit the hidden knowledge, a common practice is to force the student to deeply mimic the teacher’s hidden states of all the tokens in a layer-wise manner. In this paper, however, we observe that although distilling the teacher’s hidden state knowledge (HSK) is helpful, the performance gain (marginal utility) diminishes quickly as more HSK is distilled. To understand this effect, we conduct a series of analysis. Specifically, we divide the HSK of BERT into three dimensions, namely depth, length and width. We first investigate a variety of strategies to extract crucial knowledge for each single dimension and then jointly compress the three dimensions. In this way, we show that 1) the student’s performance can be improved by extracting and distilling the crucial HSK, and 2) using a tiny fraction of HSK can achieve the same performance as extensive HSK distillation. Based on the second finding, we further propose an efficient KD paradigm to compress BERT, which does not require loading the teacher during the training of student. For two kinds of student models and computing devices, the proposed KD paradigm gives rise to training speedup of 2.7x 3.4x.

pdf bib
Rational LAMOL : A Rationale-based Lifelong Learning FrameworkLAMOL: A Rationale-based Lifelong Learning Framework
Kasidis Kanwatchara | Thanapapas Horsuwan | Piyawat Lertvittayakumjorn | Boonserm Kijsirikul | Peerapon Vateekul

Lifelong learning (LL) aims to train a neural network on a stream of tasks while retaining knowledge from previous tasks. However, many prior attempts in NLP still suffer from the catastrophic forgetting issue, where the model completely forgets what it just learned in the previous tasks. In this paper, we introduce Rational LAMOL, a novel end-to-end LL framework for language models. In order to alleviate catastrophic forgetting, Rational LAMOL enhances LAMOL, a recent LL model, by applying critical freezing guided by human rationales. When the human rationales are not available, we propose exploiting unsupervised generated rationales as substitutions. In the experiment, we tested Rational LAMOL on permutations of three datasets from the ERASER benchmark. The results show that our proposed framework outperformed vanilla LAMOL on most permutations. Furthermore, unsupervised rationale generation was able to consistently improve the overall LL performance from the baseline without relying on human-annotated rationales.

pdf bib
LeeBERT : Learned Early Exit for BERT with cross-level optimizationLeeBERT: Learned Early Exit for BERT with cross-level optimization
Wei Zhu

Pre-trained language models like BERT are performant in a wide range of natural language tasks. However, they are resource exhaustive and computationally expensive for industrial scenarios. Thus, early exits are adopted at each layer of BERT to perform adaptive computation by predicting easier samples with the first few layers to speed up the inference. In this work, to improve efficiency without performance drop, we propose a novel training scheme called Learned Early Exit for BERT (LeeBERT). First, we ask each exit to learn from each other, rather than learning only from the last layer. Second, the weights of different loss terms are learned, thus balancing off different objectives. We formulate the optimization of LeeBERT as a bi-level optimization problem, and we propose a novel cross-level optimization (CLO) algorithm to improve the optimization results. Experiments on the GLUE benchmark show that our proposed methods improve the performance of the state-of-the-art (SOTA) early exit methods for pre-trained models.

pdf bib
Unsupervised Extractive Summarization-Based Representations for Accurate and Explainable Collaborative Filtering
Reinald Adrian Pugoy | Hung-Yu Kao

We pioneer the first extractive summarization-based collaborative filtering model called ESCOFILT. Our proposed model specifically produces extractive summaries for each item and user. Unlike other types of explanations, summary-level explanations closely resemble real-life explanations. The strength of ESCOFILT lies in the fact that it unifies representation and explanation. In other words, extractive summaries both represent and explain the items and users. Our model uniquely integrates BERT, K-Means embedding clustering, and multilayer perceptron to learn sentence embeddings, representation-explanations, and user-item interactions, respectively. We argue that our approach enhances both rating prediction accuracy and user / item explainability. Our experiments illustrate that ESCOFILT’s prediction accuracy is better than the other state-of-the-art recommender models. Furthermore, we propose a comprehensive set of criteria that assesses the real-life explainability of explanations. Our explainability study demonstrates the superiority of and preference for summary-level explanations over other explanation types.

pdf bib
Competence-based Multimodal Curriculum Learning for Medical Report Generation
Fenglin Liu | Shen Ge | Xian Wu

Medical report generation task, which targets to produce long and coherent descriptions of medical images, has attracted growing research interests recently. Different from the general image captioning tasks, medical report generation is more challenging for data-driven neural models. This is mainly due to 1) the serious data bias and 2) the limited medical data. To alleviate the data bias and make best use of available data, we propose a Competence-based Multimodal Curriculum Learning framework (CMCL). Specifically, CMCL simulates the learning process of radiologists and optimizes the model in a step by step manner. Firstly, CMCL estimates the difficulty of each training instance and evaluates the competence of current model ; Secondly, CMCL selects the most suitable batch of training instances considering current model competence. By iterating above two steps, CMCL can gradually improve the model’s performance. The experiments on the public IU-Xray and MIMIC-CXR datasets show that CMCL can be incorporated into existing models to improve their performance.

pdf bib
Meta-KD : A Meta Knowledge Distillation Framework for Language Model Compression across DomainsKD: A Meta Knowledge Distillation Framework for Language Model Compression across Domains
Haojie Pan | Chengyu Wang | Minghui Qiu | Yichang Zhang | Yaliang Li | Jun Huang

Pre-trained language models have been applied to various NLP tasks with considerable performance gains. However, the large model sizes, together with the long inference time, limit the deployment of such models in real-time applications. One line of model compression approaches considers knowledge distillation to distill large teacher models into small student models. Most of these studies focus on single-domain only, which ignores the transferable knowledge from other domains. We notice that training a teacher with transferable knowledge digested across domains can achieve better generalization capability to help knowledge distillation. Hence we propose a Meta-Knowledge Distillation (Meta-KD) framework to build a meta-teacher model that captures transferable knowledge across domains and passes such knowledge to students. Specifically, we explicitly force the meta-teacher to capture transferable knowledge at both instance-level and feature-level from multiple domains, and then propose a meta-distillation algorithm to learn single-domain student models with guidance from the meta-teacher. Experiments on public multi-domain NLP tasks show the effectiveness and superiority of the proposed Meta-KD framework. Further, we also demonstrate the capability of Meta-KD in the settings where the training data is scarce.

pdf bib
A Semantic-based Method for Unsupervised Commonsense Question Answering
Yilin Niu | Fei Huang | Jiaming Liang | Wenkai Chen | Xiaoyan Zhu | Minlie Huang

Unsupervised commonsense question answering is appealing since it does not rely on any labeled task data. Among existing work, a popular solution is to use pre-trained language models to score candidate choices directly conditioned on the question or context. However, such scores from language models can be easily affected by irrelevant factors, such as word frequencies, sentence structures, etc. These distracting factors may not only mislead the model to choose a wrong answer but also make it oversensitive to lexical perturbations in candidate answers. In this paper, we present a novel SEmantic-based Question Answering method (SEQA) for unsupervised commonsense question answering. Instead of directly scoring each answer choice, our method first generates a set of plausible answers with generative models (e.g., GPT-2), and then uses these plausible answers to select the correct choice by considering the semantic similarity between each plausible answer and each choice. We devise a simple, yet sound formalism for this idea and verify its effectiveness and robustness with extensive experiments. We evaluate the proposed method on four benchmark datasets, and our method achieves the best results in unsupervised settings. Moreover, when attacked by TextFooler with synonym replacement, SEQA demonstrates much less performance drops than baselines, thereby indicating stronger robustness.

pdf bib
Online Learning Meets Machine Translation Evaluation : Finding the Best Systems with the Least Human EffortOnline Learning Meets Machine Translation Evaluation: Finding the Best Systems with the Least Human Effort
Vânia Mendonça | Ricardo Rei | Luisa Coheur | Alberto Sardinha | Ana Lúcia Santos

In Machine Translation, assessing the quality of a large amount of automatic translations can be challenging. Automatic metrics are not reliable when it comes to high performing systems. In addition, resorting to human evaluators can be expensive, especially when evaluating multiple systems. To overcome the latter challenge, we propose a novel application of online learning that, given an ensemble of Machine Translation systems, dynamically converges to the best systems, by taking advantage of the human feedback available. Our experiments on WMT’19 datasets show that our online approach quickly converges to the top-3 ranked systems for the language pairs considered, despite the lack of human feedback for many translations.

pdf bib
How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models
Phillip Rust | Jonas Pfeiffer | Ivan Vulić | Sebastian Ruder | Iryna Gurevych

In this work, we provide a systematic and comprehensive empirical comparison of pretrained multilingual language models versus their monolingual counterparts with regard to their monolingual task performance. We study a set of nine typologically diverse languages with readily available pretrained monolingual models on a set of five diverse monolingual downstream tasks. We first aim to establish, via fair and controlled comparisons, if a gap between the multilingual and the corresponding monolingual representation of that language exists, and subsequently investigate the reason for any performance difference. To disentangle conflating factors, we train new monolingual models on the same data, with monolingually and multilingually trained tokenizers. We find that while the pretraining data size is an important factor, a designated monolingual tokenizer plays an equally important role in the downstream performance. Our results show that languages that are adequately represented in the multilingual model’s vocabulary exhibit negligible performance decreases over their monolingual counterparts. We further find that replacing the original multilingual tokenizer with the specialized monolingual tokenizer improves the downstream performance of the multilingual model for almost every task and language.

pdf bib
Human-in-the-Loop for Data Collection : a Multi-Target Counter Narrative Dataset to Fight Online Hate Speech
Margherita Fanton | Helena Bonaldi | Serra Sinem Tekiroğlu | Marco Guerini

Undermining the impact of hateful content with informed and non-aggressive responses, called counter narratives, has emerged as a possible solution for having healthier online communities. Thus, some NLP studies have started addressing the task of counter narrative generation. Although such studies have made an effort to build hate speech / counter narrative (HS / CN) datasets for neural generation, they fall short in reaching either high-quality and/or high-quantity. In this paper, we propose a novel human-in-the-loop data collection methodology in which a generative language model is refined iteratively by using its own data from the previous loops to generate new training samples that experts review and/or post-edit. Our experiments comprised several loops including diverse dynamic variations. Results show that the methodology is scalable and facilitates diverse, novel, and cost-effective data collection. To our knowledge, the resulting dataset is the only expert-based multi-target HS / CN dataset available to the community.

pdf bib
Joint Models for Answer Verification in Question Answering Systems
Zeyu Zhang | Thuy Vu | Alessandro Moschitti

This paper studies joint models for selecting correct answer sentences among the top k provided by answer sentence selection (AS2) modules, which are core components of retrieval-based Question Answering (QA) systems. Our work shows that a critical step to effectively exploiting an answer set regards modeling the interrelated information between pair of answers. For this purpose, we build a three-way multi-classifier, which decides if an answer supports, refutes, or is neutral with respect to another one. More specifically, our neural architecture integrates a state-of-the-art AS2 module with the multi-classifier, and a joint layer connecting all components. We tested our models on WikiQA, TREC-QA, and a real-world dataset. The results show that our models obtain the new state of the art in AS2.

pdf bib
TAT-QA : A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in FinanceTAT-QA: A Question Answering Benchmark on a Hybrid of Tabular and Textual Content in Finance
Fengbin Zhu | Wenqiang Lei | Youcheng Huang | Chao Wang | Shuo Zhang | Jiancheng Lv | Fuli Feng | Tat-Seng Chua

Hybrid data combining both tabular and textual content (e.g., financial reports) are quite pervasive in the real world. However, Question Answering (QA) over such hybrid data is largely neglected in existing research. In this work, we extract samples from real financial reports to build a new large-scale QA dataset containing both Tabular And Textual data, named TAT-QA, where numerical reasoning is usually required to infer the answer, such as addition, subtraction, multiplication, division, counting, comparison / sorting, and the compositions. We further propose a novel QA model termed TAGOP, which is capable of reasoning over both tables and text. It adopts sequence tagging to extract relevant cells from the table along with relevant spans from the text to infer their semantics, and then applies symbolic reasoning over them with a set of aggregation operators to arrive at the final answer. TAGOP achieves 58.0 % inF1, which is an 11.1 % absolute increase over the previous best baseline model, according to our experiments on TAT-QA. But this result still lags far behind performance of expert human, i.e.90.8 % in F1. It is demonstrated that our TAT-QA is very challenging and can serve as a benchmark for training and testing powerful QA models that address hybrid form data.

pdf bib
Position Bias Mitigation : A Knowledge-Aware Graph Model for Emotion Cause Extraction
Hanqi Yan | Lin Gui | Gabriele Pergola | Yulan He

The Emotion Cause Extraction (ECE) task aims to identify clauses which contain emotion-evoking information for a particular emotion expressed in text. We observe that a widely-used ECE dataset exhibits a bias that the majority of annotated cause clauses are either directly before their associated emotion clauses or are the emotion clauses themselves. Existing models for ECE tend to explore such relative position information and suffer from the dataset bias. To investigate the degree of reliance of existing ECE models on clause relative positions, we propose a novel strategy to generate adversarial examples in which the relative position information is no longer the indicative feature of cause clauses. We test the performance of existing models on such adversarial examples and observe a significant performance drop. To address the dataset bias, we propose a novel graph-based method to explicitly model the emotion triggering paths by leveraging the commonsense knowledge to enhance the semantic dependencies between a candidate clause and an emotion clause. Experimental results show that our proposed approach performs on par with the existing state-of-the-art methods on the original ECE dataset, and is more robust against adversarial attacks compared to existing models.

pdf bib
Every Bite Is an Experience : Key Point Analysis of Business ReviewsKey Point Analysis of Business Reviews
Roy Bar-Haim | Lilach Eden | Yoav Kantor | Roni Friedman | Noam Slonim

Previous work on review summarization focused on measuring the sentiment toward the main aspects of the reviewed product or business, or on creating a textual summary. These approaches provide only a partial view of the data : aspect-based sentiment summaries lack sufficient explanation or justification for the aspect rating, while textual summaries do not quantify the significance of each element, and are not well-suited for representing conflicting views. Recently, Key Point Analysis (KPA) has been proposed as a summarization framework that provides both textual and quantitative summary of the main points in the data. We adapt KPA to review data by introducing Collective Key Point Mining for better key point extraction ; integrating sentiment analysis into KPA ; identifying good key point candidates for review summaries ; and leveraging the massive amount of available reviews and their metadata. We show empirically that these novel extensions of KPA substantially improve its performance. We demonstrate that promising results can be achieved without any domain-specific annotation, while human supervision can lead to further improvement.

pdf bib
Consistency Regularization for Cross-Lingual Fine-Tuning
Bo Zheng | Li Dong | Shaohan Huang | Wenhui Wang | Zewen Chi | Saksham Singhal | Wanxiang Che | Ting Liu | Xia Song | Furu Wei

Fine-tuning pre-trained cross-lingual language models can transfer task-specific supervision from one language to the others. In this work, we propose to improve cross-lingual fine-tuning with consistency regularization. Specifically, we use example consistency regularization to penalize the prediction sensitivity to four types of data augmentations, i.e., subword sampling, Gaussian noise, code-switch substitution, and machine translation. In addition, we employ model consistency to regularize the models trained with two augmented versions of the same training set. Experimental results on the XTREME benchmark show that our method significantly improves cross-lingual fine-tuning across various tasks, including text classification, question answering, and sequence labeling.

pdf bib
Rejuvenating Low-Frequency Words : Making the Most of Parallel Data in Non-Autoregressive Translation
Liang Ding | Longyue Wang | Xuebo Liu | Derek F. Wong | Dacheng Tao | Zhaopeng Tu

Knowledge distillation (KD) is commonly used to construct synthetic data for training non-autoregressive translation (NAT) models. However, there exists a discrepancy on low-frequency words between the distilled and the original data, leading to more errors on predicting low-frequency words. To alleviate the problem, we directly expose the raw data into NAT by leveraging pretraining. By analyzing directed alignments, we found that KD makes low-frequency source words aligned with targets more deterministically but fails to align sufficient low-frequency words from target to source. Accordingly, we propose reverse KD to rejuvenate more alignments for low-frequency target words. To make the most of authentic and synthetic data, we combine these complementary approaches as a new training strategy for further boosting NAT performance. We conduct experiments on five translation benchmarks over two advanced architectures. Results demonstrate that the proposed approach can significantly and universally improve translation quality by reducing translation errors on low-frequency words. Encouragingly, our approach achieves 28.2 and 33.9 BLEU points on the WMT14 English-German and WMT16 Romanian-English datasets, respectively. Our code, data, and trained models are available at.

pdf bib
G-Transformer for Document-Level Machine TranslationG-Transformer for Document-Level Machine Translation
Guangsheng Bao | Yue Zhang | Zhiyang Teng | Boxing Chen | Weihua Luo

Document-level MT models are still far from satisfactory. Existing work extend translation unit from single sentence to multiple sentences. However, study shows that when we further enlarge the translation unit to a whole document, supervised training of Transformer can fail. In this paper, we find such failure is not caused by overfitting, but by sticking around local minima during training. Our analysis shows that the increased complexity of target-to-source attention is a reason for the failure. As a solution, we propose G-Transformer, introducing locality assumption as an inductive bias into Transformer, reducing the hypothesis space of the attention from target to source. Experiments show that G-Transformer converges faster and more stably than Transformer, achieving new state-of-the-art BLEU scores for both nonpretraining and pre-training settings on three benchmark datasets.

pdf bib
Prevent the Language Model from being Overconfident in Neural Machine Translation
Mengqi Miao | Fandong Meng | Yijin Liu | Xiao-Hua Zhou | Jie Zhou

The Neural Machine Translation (NMT) model is essentially a joint language model conditioned on both the source sentence and partial translation. Therefore, the NMT model naturally involves the mechanism of the Language Model (LM) that predicts the next token only based on partial translation. Despite its success, NMT still suffers from the hallucination problem, generating fluent but inadequate translations. The main reason is that NMT pays excessive attention to the partial translation while neglecting the source sentence to some extent, namely overconfidence of the LM. Accordingly, we define the Margin between the NMT and the LM, calculated by subtracting the predicted probability of the LM from that of the NMT model for each token. The Margin is negatively correlated to the overconfidence degree of the LM. Based on the property, we propose a Margin-based Token-level Objective (MTO) and a Margin-based Sentence-level Objective (MSO) to maximize the Margin for preventing the LM from being overconfident. Experiments on WMT14 English-to-German, WMT19 Chinese-to-English, and WMT14 English-to-French translation tasks demonstrate the effectiveness of our approach, with 1.36, 1.50, and 0.63 BLEU improvements, respectively, compared to the Transformer baseline. The human evaluation further verifies that our approaches improve translation adequacy as well as fluency.

pdf bib
Diversifying Dialog Generation via Adaptive Label Smoothing
Yida Wang | Yinhe Zheng | Yong Jiang | Minlie Huang

Neural dialogue generation models trained with the one-hot target distribution suffer from the over-confidence issue, which leads to poor generation diversity as widely reported in the literature. Although existing approaches such as label smoothing can alleviate this issue, they fail to adapt to diverse dialog contexts. In this paper, we propose an Adaptive Label Smoothing (AdaLabel) approach that can adaptively estimate a target label distribution at each time step for different contexts. The maximum probability in the predicted distribution is used to modify the soft target distribution produced by a novel light-weight bi-directional decoder module. The resulting target distribution is aware of both previous and future contexts and is adjusted to avoid over-training the dialogue model. Our model can be trained in an endto-end manner. Extensive experiments on two benchmark datasets show that our approach outperforms various competitive baselines in producing diverse responses.

pdf bib
Out-of-Scope Intent Detection with Self-Supervision and Discriminative Training
Li-Ming Zhan | Haowen Liang | Bo Liu | Lu Fan | Xiao-Ming Wu | Albert Y.S. Lam

Out-of-scope intent detection is of practical importance in task-oriented dialogue systems. Since the distribution of outlier utterances is arbitrary and unknown in the training stage, existing methods commonly rely on strong assumptions on data distribution such as mixture of Gaussians to make inference, resulting in either complex multi-step training procedures or hand-crafted rules such as confidence threshold selection for outlier detection. In this paper, we propose a simple yet effective method to train an out-of-scope intent classifier in a fully end-to-end manner by simulating the test scenario in training, which requires no assumption on data distribution and no additional post-processing or threshold setting. Specifically, we construct a set of pseudo outliers in the training stage, by generating synthetic outliers using inliner features via self-supervision and sampling out-of-scope sentences from easily available open-domain datasets. The pseudo outliers are used to train a discriminative classifier that can be directly applied to and generalize well on the test task. We evaluate our method extensively on four benchmark dialogue datasets and observe significant improvements over state-of-the-art approaches. Our code has been released at

pdf bib
Nested Named Entity Recognition via Explicitly Excluding the Influence of the Best Path
Yiran Wang | Hiroyuki Shindo | Yuji Matsumoto | Taro Watanabe

This paper presents a novel method for nested named entity recognition. As a layered method, our method extends the prior second-best path recognition method by explicitly excluding the influence of the best path. Our method maintains a set of hidden states at each time step and selectively leverages them to build a different potential function for recognition at each level. In addition, we demonstrate that recognizing innermost entities first results in better performance than the conventional outermost entities first scheme. We provide extensive experimental results on ACE2004, ACE2005, and GENIA datasets to show the effectiveness and efficiency of our proposed method.

pdf bib
Revisiting the Negative Data of Distantly Supervised Relation Extraction
Chenhao Xie | Jiaqing Liang | Jingping Liu | Chengsong Huang | Wenhao Huang | Yanghua Xiao

Distantly supervision automatically generates plenty of training samples for relation extraction. However, it also incurs two major problems : noisy labels and imbalanced training data. Previous works focus more on reducing wrongly labeled relations (false positives) while few explore the missing relations that are caused by incompleteness of knowledge base (false negatives). Furthermore, the quantity of negative labels overwhelmingly surpasses the positive ones in previous problem formulations. In this paper, we first provide a thorough analysis of the above challenges caused by negative data. Next, we formulate the problem of relation extraction into as a positive unlabeled learning task to alleviate false negative problem. Thirdly, we propose a pipeline approach, dubbed ReRe, that first performs sentence classification with relational labels and then extracts the subjects / objects. Experimental results show that the proposed method consistently outperforms existing approaches and remains excellent performance even learned with a large quantity of false positive samples. Source code is available online at

pdf bib
Knowing the No-match : Entity Alignment with Dangling Cases
Zequn Sun | Muhao Chen | Wei Hu

This paper studies a new problem setting of entity alignment for knowledge graphs (KGs). Since KGs possess different sets of entities, there could be entities that can not find alignment across them, leading to the problem of dangling entities. As the first attempt to this problem, we construct a new dataset and design a multi-task learning framework for both entity alignment and dangling entity detection. The framework can opt to abstain from predicting alignment for the detected dangling entities. We propose three techniques for dangling entity detection that are based on the distribution of nearest-neighbor distances, i.e., nearest neighbor classification, marginal ranking and background ranking. After detecting and removing dangling entities, an incorporated entity alignment model in our framework can provide more robust alignment for remaining entities. Comprehensive experiments and analyses demonstrate the effectiveness of our framework. We further discover that the dangling entity detection module can, in turn, improve alignment learning and the final performance. The contributed resource is publicly available to foster further research.

pdf bib
BERT is to NLP what AlexNet is to CV : Can Pre-Trained Language Models Identify Analogies?BERT is to NLP what AlexNet is to CV: Can Pre-Trained Language Models Identify Analogies?
Asahi Ushio | Luis Espinosa Anke | Steven Schockaert | Jose Camacho-Collados

Analogies play a central role in human commonsense reasoning. The ability to recognize analogies such as eye is to seeing what ear is to hearing, sometimes referred to as analogical proportions, shape how we structure knowledge and understand language. Surprisingly, however, the task of identifying such analogies has not yet received much attention in the language model era. In this paper, we analyze the capabilities of transformer-based language models on this unsupervised task, using benchmarks obtained from educational settings, as well as more commonly used datasets. We find that off-the-shelf language models can identify analogies to a certain extent, but struggle with abstract and complex relations, and results are highly sensitive to model architecture and hyperparameters. Overall the best results were obtained with GPT-2 and RoBERTa, while configurations using BERT were not able to outperform word embedding models. Our results raise important questions for future work about how, and to what extent, pre-trained language models capture knowledge about abstract semantic relations.

pdf bib
Exploring the Representation of Word Meanings in Context : A Case Study on Homonymy and SynonymyA Case Study on Homonymy and Synonymy
Marcos Garcia

This paper presents a multilingual study of word meaning representations in context. We assess the ability of both static and contextualized models to adequately represent different lexical-semantic relations, such as homonymy and synonymy. To do so, we created a new multilingual dataset that allows us to perform a controlled evaluation of several factors such as the impact of the surrounding context or the overlap between words, conveying the same or different senses. A systematic assessment on four scenarios shows that the best monolingual models based on Transformers can adequately disambiguate homonyms in context. However, as they rely heavily on context, these models fail at representing words with different senses when occurring in similar sentences. Experiments are performed in Galician, Portuguese, English, and Spanish, and both the dataset (with more than 3,000 evaluation items) and new models are freely released with this study.

pdf bib
HERALD : An Annotation Efficient Method to Detect User Disengagement in Social ConversationsHERALD: An Annotation Efficient Method to Detect User Disengagement in Social Conversations
Weixin Liang | Kai-Hui Liang | Zhou Yu

Open-domain dialog systems have a user-centric goal : to provide humans with an engaging conversation experience. User engagement is one of the most important metrics for evaluating open-domain dialog systems, and could also be used as real-time feedback to benefit dialog policy learning. Existing work on detecting user disengagement typically requires hand-labeling many dialog samples. We propose HERALD, an efficient annotation framework that reframes the training data annotation process as a denoising problem. Specifically, instead of manually labeling training samples, we first use a set of labeling heuristics to label training samples automatically. We then denoise the weakly labeled data using the Shapley algorithm. Finally, we use the denoised data to train a user engagement detector. Our experiments show that HERALD improves annotation efficiency significantly and achieves 86 % user disengagement detection accuracy in two dialog corpora.

pdf bib
Value-Agnostic Conversational Semantic Parsing
Emmanouil Antonios Platanios | Adam Pauls | Subhro Roy | Yuchen Zhang | Alexander Kyte | Alan Guo | Sam Thomson | Jayant Krishnamurthy | Jason Wolfe | Jacob Andreas | Dan Klein

Conversational semantic parsers map user utterances to executable programs given dialogue histories composed of previous utterances, programs, and system responses. Existing parsers typically condition on rich representations of history that include the complete set of values and computations previously discussed. We propose a model that abstracts over values to focus prediction on type- and function-level context. This approach provides a compact encoding of dialogue histories and predicted programs, improving generalization and computational efficiency. Our model incorporates several other components, including an atomic span copy operation and structural enforcement of well-formedness constraints on predicted programs, that are particularly advantageous in the low-data regime. Trained on the SMCalFlow and TreeDST datasets, our model outperforms prior work by 7.3 % and 10.6 % respectively in terms of absolute accuracy. Trained on only a thousand examples from each dataset, it outperforms strong baselines by 12.4 % and 6.4 %. These results indicate that simple representations are key to effective generalization in conversational semantic parsing.

pdf bib
MPC-BERT : A Pre-Trained Language Model for Multi-Party Conversation UnderstandingMPC-BERT: A Pre-Trained Language Model for Multi-Party Conversation Understanding
Jia-Chen Gu | Chongyang Tao | Zhenhua Ling | Can Xu | Xiubo Geng | Daxin Jiang

Recently, various neural models for multi-party conversation (MPC) have achieved impressive improvements on a variety of tasks such as addressee recognition, speaker identification and response prediction. However, these existing methods on MPC usually represent interlocutors and utterances individually and ignore the inherent complicated structure in MPC which may provide crucial interlocutor and utterance semantics and would enhance the conversation understanding process. To this end, we present MPC-BERT, a pre-trained model for MPC understanding that considers learning who says what to whom in a unified model with several elaborated self-supervised tasks. Particularly, these tasks can be generally categorized into (1) interlocutor structure modeling including reply-to utterance recognition, identical speaker searching and pointer consistency distinction, and (2) utterance semantics modeling including masked shared utterance restoration and shared node detection. We evaluate MPC-BERT on three downstream tasks including addressee recognition, speaker identification and response selection. Experimental results show that MPC-BERT outperforms previous methods by large margins and achieves new state-of-the-art performance on all three downstream tasks at two benchmarks.

pdf bib
NeuralWOZ : Learning to Collect Task-Oriented Dialogue via Model-Based SimulationNeuralWOZ: Learning to Collect Task-Oriented Dialogue via Model-Based Simulation
Sungdong Kim | Minsuk Chang | Sang-Woo Lee

We propose NeuralWOZ, a novel dialogue collection framework that uses model-based dialogue simulation. NeuralWOZ has two pipelined models, Collector and Labeler. Collector generates dialogues from (1) user’s goal instructions, which are the user context and task constraints in natural language, and (2) system’s API call results, which is a list of possible query responses for user requests from the given knowledge base. Labeler annotates the generated dialogue by formulating the annotation as a multiple-choice problem, in which the candidate labels are extracted from goal instructions and API call results. We demonstrate the effectiveness of the proposed method in the zero-shot domain transfer learning for dialogue state tracking. In the evaluation, the synthetic dialogue corpus generated from NeuralWOZ achieves a new state-of-the-art with improvements of 4.4 % point joint goal accuracy on average across domains, and improvements of 5.7 % point of zero-shot coverage against the MultiWOZ 2.1 dataset.

pdf bib
Structural Guidance for Transformer Language Models
Peng Qian | Tahira Naseem | Roger Levy | Ramón Fernandez Astudillo

Transformer-based language models pre-trained on large amounts of text data have proven remarkably successful in learning generic transferable linguistic representations. Here we study whether structural guidance leads to more human-like systematic linguistic generalization in Transformer language models without resorting to pre-training on very large amounts of data. We explore two general ideas. The Generative Parsing idea jointly models the incremental parse and word sequence as part of the same sequence modeling task. The Structural Scaffold idea guides the language model’s representation via additional structure loss that separately predicts the incremental constituency parse. We train the proposed models along with a vanilla Transformer language model baseline on a 14 million-token and a 46 million-token subset of the BLLIP dataset, and evaluate models’ syntactic generalization performances on SG Test Suites and sized BLiMP. Experiment results across two benchmarks suggest converging evidence that generative structural supervisions can induce more robust and humanlike linguistic generalization in Transformer language models without the need for data intensive pre-training.

pdf bib
Surprisal Estimators for Human Reading Times Need Character Models
Byung-Doh Oh | Christian Clark | William Schuler

While the use of character models has been popular in NLP applications, it has not been explored much in the context of psycholinguistic modeling. This paper presents a character model that can be applied to a structural parser-based processing model to calculate word generation probabilities. Experimental results show that surprisal estimates from a structural processing model using this character model deliver substantially better fits to self-paced reading, eye-tracking, and fMRI data than those from large-scale language models trained on much more data. This may suggest that the proposed processing model provides a more humanlike account of sentence processing, which assumes a larger role of morphology, phonotactics, and orthographic complexity than was previously thought.

pdf bib
H-Transformer-1D : Fast One-Dimensional Hierarchical Attention for SequencesH-Transformer-1D: Fast One-Dimensional Hierarchical Attention for Sequences
Zhenhai Zhu | Radu Soricut

We describe an efficient hierarchical method to compute attention in the Transformer architecture. The proposed attention mechanism exploits a matrix structure similar to the Hierarchical Matrix (H-Matrix) developed by the numerical analysis community, and has linear run time and memory complexity. We perform extensive experiments to show that the inductive bias embodied by our hierarchical attention is effective in capturing the hierarchical structure in the sequences typical for natural language and vision tasks. Our method is superior to alternative sub-quadratic proposals by over +6 points on average on the Long Range Arena benchmark. It also sets a new SOTA test perplexity on One-Billion Word dataset with 5x fewer model parameters than that of the previous-best Transformer-based models.

pdf bib
Towards Propagation Uncertainty : Edge-enhanced Bayesian Graph Convolutional Networks for Rumor DetectionBayesian Graph Convolutional Networks for Rumor Detection
Lingwei Wei | Dou Hu | Wei Zhou | Zhaojuan Yue | Songlin Hu

Detecting rumors on social media is a very critical task with significant implications to the economy, public health, etc. Previous works generally capture effective features from texts and the propagation structure. However, the uncertainty caused by unreliable relations in the propagation structure is common and inevitable due to wily rumor producers and the limited collection of spread data. Most approaches neglect it and may seriously limit the learning of features. Towards this issue, this paper makes the first attempt to explore propagation uncertainty for rumor detection. Specifically, we propose a novel Edge-enhanced Bayesian Graph Convolutional Network (EBGCN) to capture robust structural features. The model adaptively rethinks the reliability of latent relations by adopting a Bayesian approach. Besides, we design a new edge-wise consistency training framework to optimize the model by enforcing consistency on relations. Experiments on three public benchmark datasets demonstrate that the proposed model achieves better performance than baseline methods on both rumor detection and early rumor detection tasks.

pdf bib
A Neural Model for Joint Document and Snippet Ranking in Question Answering for Large Document Collections
Dimitris Pappas | Ion Androutsopoulos

Question answering (QA) systems for large document collections typically use pipelines that (i) retrieve possibly relevant documents, (ii) re-rank them, (iii) rank paragraphs or other snippets of the top-ranked documents, and (iv) select spans of the top-ranked snippets as exact answers. Pipelines are conceptually simple, but errors propagate from one component to the next, without later components being able to revise earlier decisions. We present an architecture for joint document and snippet ranking, the two middle stages, which leverages the intuition that relevant documents have good snippets and good snippets come from relevant documents. The architecture is general and can be used with any neural text relevance ranker. We experiment with two main instantiations of the architecture, based on POSIT-DRMM (PDRMM) and a BERT-based ranker. Experiments on biomedical data from BIOASQ show that our joint models vastly outperform the pipelines in snippet retrieval, the main goal for QA, with fewer trainable parameters, also remaining competitive in document retrieval. Furthermore, our joint PDRMM-based model is competitive with BERT-based models, despite using orders of magnitude fewer parameters. These claims are also supported by human evaluation on two test batches of BIOASQ. To test our key findings on another dataset, we modified the Natural Questions dataset so that it can also be used for document and snippet retrieval. Our joint PDRMM-based model again outperforms the corresponding pipeline in snippet retrieval on the modified Natural Questions dataset, even though it performs worse than the pipeline in document retrieval.

pdf bib
ABCD : A Graph Framework to Convert Complex Sentences to a Covering Set of Simple SentencesABCD: A Graph Framework to Convert Complex Sentences to a Covering Set of Simple Sentences
Yanjun Gao | Ting-Hao Huang | Rebecca J. Passonneau

Atomic clauses are fundamental text units for understanding complex sentences. Identifying the atomic sentences within complex sentences is important for applications such as summarization, argument mining, discourse analysis, discourse parsing, and question answering. Previous work mainly relies on rule-based methods dependent on parsing. We propose a new task to decompose each complex sentence into simple sentences derived from the tensed clauses in the source, and a novel problem formulation as a graph edit task. Our neural model learns to Accept, Break, Copy or Drop elements of a graph that combines word adjacency and grammatical dependencies. The full processing pipeline includes modules for graph construction, graph editing, and sentence generation from the output graph. We introduce DeSSE, a new dataset designed to train and evaluate complex sentence decomposition, and MinWiki, a subset of MinWikiSplit. ABCD achieves comparable performance as two parsing baselines on MinWiki. On DeSSE, which has a more even balance of complex sentence types, our model achieves higher accuracy on the number of atomic sentences than an encoder-decoder baseline. Results include a detailed error analysis.

pdf bib
VECO : Variable and Flexible Cross-lingual Pre-training for Language Understanding and GenerationVECO: Variable and Flexible Cross-lingual Pre-training for Language Understanding and Generation
Fuli Luo | Wei Wang | Jiahao Liu | Yijia Liu | Bin Bi | Songfang Huang | Fei Huang | Luo Si

Existing work in multilingual pretraining has demonstrated the potential of cross-lingual transferability by training a unified Transformer encoder for multiple languages. However, much of this work only relies on the shared vocabulary and bilingual contexts to encourage the correlation across languages, which is loose and implicit for aligning the contextual representations between languages. In this paper, we plug a cross-attention module into the Transformer encoder to explicitly build the interdependence between languages. It can effectively avoid the degeneration of predicting masked words only conditioned on the context in its own language. More importantly, when fine-tuning on downstream tasks, the cross-attention module can be plugged in or out on-demand, thus naturally benefiting a wider range of cross-lingual tasks, from language understanding to generation. As a result, the proposed cross-lingual model delivers new state-of-the-art results on various cross-lingual understanding tasks of the XTREME benchmark, covering text classification, sequence labeling, question answering, and sentence retrieval. For cross-lingual generation tasks, it also outperforms all existing cross-lingual models and state-of-the-art Transformer variants on WMT14 English-to-German and English-to-French translation datasets, with gains of up to 1 2 BLEU.

pdf bib
Check It Again : Progressive Visual Question Answering via Visual Entailment
Qingyi Si | Zheng Lin | Ming yu Zheng | Peng Fu | Weiping Wang

While sophisticated neural-based models have achieved remarkable success in Visual Question Answering (VQA), these models tend to answer questions only according to superficial correlations between question and answer. Several recent approaches have been developed to address this language priors problem. However, most of them predict the correct answer according to one best output without checking the authenticity of answers. Besides, they only explore the interaction between image and question, ignoring the semantics of candidate answers. In this paper, we propose a select-and-rerank (SAR) progressive framework based on Visual Entailment. Specifically, we first select the candidate answers relevant to the question or the image, then we rerank the candidate answers by a visual entailment task, which verifies whether the image semantically entails the synthetic statement of the question and each candidate answer. Experimental results show the effectiveness of our proposed framework, which establishes a new state-of-the-art accuracy on VQA-CP v2 with a 7.55 % improvement.

pdf bib
How is BERT surprised? Layerwise detection of linguistic anomaliesBERT surprised? Layerwise detection of linguistic anomalies
Bai Li | Zining Zhu | Guillaume Thomas | Yang Xu | Frank Rudzicz

Transformer language models have shown remarkable ability in detecting when a word is anomalous in context, but likelihood scores offer no information about the cause of the anomaly. In this work, we use Gaussian models for density estimation at intermediate layers of three language models (BERT, RoBERTa, and XLNet), and evaluate our method on BLiMP, a grammaticality judgement benchmark. In lower layers, surprisal is highly correlated to low token frequency, but this correlation diminishes in upper layers. Next, we gather datasets of morphosyntactic, semantic, and commonsense anomalies from psycholinguistic studies ; we find that the best performing model RoBERTa exhibits surprisal in earlier layers when the anomaly is morphosyntactic than when it is semantic, while commonsense anomalies do not exhibit surprisal at any intermediate layer. These results suggest that language models employ separate mechanisms to detect different types of linguistic anomalies.

pdf bib
Psycholinguistic Tripartite Graph Network for Personality Detection
Tao Yang | Feifan Yang | Haolan Ouyang | Xiaojun Quan

Most of the recent work on personality detection from online posts adopts multifarious deep neural networks to represent the posts and builds predictive models in a data-driven manner, without the exploitation of psycholinguistic knowledge that may unveil the connections between one’s language use and his psychological traits. In this paper, we propose a psycholinguistic knowledge-based tripartite graph network, TrigNet, which consists of a tripartite graph network and a BERT-based graph initializer. The graph network injects structural psycholinguistic knowledge in LIWC, a computerized instrument for psycholinguistic analysis, by constructing a heterogeneous tripartite graph. The initializer is employed to provide initial embeddings for the graph nodes. To reduce the computational cost in graph learning, we further propose a novel flow graph attention network (GAT) that only transmits messages between neighboring parties in the tripartite graph. Benefiting from the tripartite graph, TrigNet can aggregate post information from a psychological perspective, which is a novel way of exploiting domain knowledge. Extensive experiments on two datasets show that TrigNet outperforms the existing state-of-art model by 3.47 and 2.10 points in average F1. Moreover, the flow GAT reduces the FLOPS and Memory measures by 38 % and 32 %, respectively, in comparison to the original GAT in our setting.

pdf bib
Probing Toxic Content in Large Pre-Trained Language Models
Nedjma Ousidhoum | Xinran Zhao | Tianqing Fang | Yangqiu Song | Dit-Yan Yeung

Large pre-trained language models (PTLMs) have been shown to carry biases towards different social groups which leads to the reproduction of stereotypical and toxic content by major NLP systems. We propose a method based on logistic regression classifiers to probe English, French, and Arabic PTLMs and quantify the potentially harmful content that they convey with respect to a set of templates. The templates are prompted by a name of a social group followed by a cause-effect relation. We use PTLMs to predict masked tokens at the end of a sentence in order to examine how likely they enable toxicity towards specific communities. We shed the light on how such negative content can be triggered within unrelated and benign contexts based on evidence from a large-scale study, then we explain how to take advantage of our methodology to assess and mitigate the toxicity transmitted by PTLMs.

pdf bib
HiddenCut : Simple Data Augmentation for Natural Language Understanding with Better GeneralizabilityHiddenCut: Simple Data Augmentation for Natural Language Understanding with Better Generalizability
Jiaao Chen | Dinghan Shen | Weizhu Chen | Diyi Yang

Fine-tuning large pre-trained models with task-specific data has achieved great success in NLP. However, it has been demonstrated that the majority of information within the self-attention networks is redundant and not utilized effectively during the fine-tuning stage. This leads to inferior results when generalizing the obtained models to out-of-domain distributions. To this end, we propose a simple yet effective data augmentation technique, HiddenCut, to better regularize the model and encourage it to learn more generalizable features. Specifically, contiguous spans within the hidden space are dynamically and strategically dropped during training. Experiments show that our HiddenCut method outperforms the state-of-the-art augmentation methods on the GLUE benchmark, and consistently exhibits superior generalization performances on out-of-distribution and challenging counterexamples. We have publicly released our code at

pdf bib
Neural Stylistic Response Generation with Disentangled Latent Variables
Qingfu Zhu | Wei-Nan Zhang | Ting Liu | William Yang Wang

Generating open-domain conversational responses in the desired style usually suffers from the lack of parallel data in the style. Meanwhile, using monolingual stylistic data to increase style intensity often leads to the expense of decreasing content relevance. In this paper, we propose to disentangle the content and style in latent space by diluting sentence-level information in style representations. Combining the desired style representation and a response content representation will then obtain a stylistic response. Our approach achieves a higher BERT-based style intensity score and comparable BLEU scores, compared with baselines. Human evaluation results show that our approach significantly improves style intensity and maintains content relevance.

pdf bib
RADDLE : An Evaluation Benchmark and Analysis Platform for Robust Task-oriented Dialog SystemsRADDLE: An Evaluation Benchmark and Analysis Platform for Robust Task-oriented Dialog Systems
Baolin Peng | Chunyuan Li | Zhu Zhang | Chenguang Zhu | Jinchao Li | Jianfeng Gao

For task-oriented dialog systems to be maximally useful, it must be able to process conversations in a way that is (1) generalizable with a small number of training examples for new task domains, and (2) robust to user input in various styles, modalities, or domains. In pursuit of these goals, we introduce the RADDLE benchmark, a collection of corpora and tools for evaluating the performance of models across a diverse set of domains. By including tasks with limited training data, RADDLE is designed to favor and encourage models with a strong generalization ability. RADDLE also includes a diagnostic checklist that facilitates detailed robustness analysis in aspects such as language variations, speech errors, unseen entities, and out-of-domain utterances. We evaluate recent state-of-the-art systems based on pre-training and fine-tuning, and find that grounded pre-training on heterogeneous dialog corpora performs better than training a separate model per domain. Adversarial training is also proposed to improve model robustness against noisy inputs. Overall, existing models are less than satisfactory in robustness evaluation, which suggests opportunities for future improvement.

pdf bib
A Pre-training Strategy for Zero-Resource Response Selection in Knowledge-Grounded Conversations
Chongyang Tao | Changyu Chen | Jiazhan Feng | Ji-Rong Wen | Rui Yan

Recently, many studies are emerging towards building a retrieval-based dialogue system that is able to effectively leverage background knowledge (e.g., documents) when conversing with humans. However, it is non-trivial to collect large-scale dialogues that are naturally grounded on the background documents, which hinders the effective and adequate training of knowledge selection and response matching. To overcome the challenge, we consider decomposing the training of the knowledge-grounded response selection into three tasks including : 1) query-passage matching task ; 2) query-dialogue history matching task ; 3) multi-turn response matching task, and joint learning all these tasks in a unified pre-trained language model. The former two tasks could help the model in knowledge selection and comprehension, while the last task is designed for matching the proper response with the given query and background knowledge (dialogue history). By this means, the model can be learned to select relevant knowledge and distinguish proper response, with the help of ad-hoc retrieval corpora and a large number of ungrounded multi-turn dialogues. Experimental results on two benchmarks of knowledge-grounded response selection indicate that our model can achieve comparable performance with several existing methods that rely on crowd-sourced data for training.

pdf bib
Dependency-driven Relation Extraction with Attentive Graph Convolutional Networks
Yuanhe Tian | Guimin Chen | Yan Song | Xiang Wan

Syntactic information, especially dependency trees, has been widely used by existing studies to improve relation extraction with better semantic guidance for analyzing the context information associated with the given entities. However, most existing studies suffer from the noise in the dependency trees, especially when they are automatically generated, so that intensively leveraging dependency information may introduce confusions to relation classification and necessary pruning is of great importance in this task. In this paper, we propose a dependency-driven approach for relation extraction with attentive graph convolutional networks (A-GCN). In this approach, an attention mechanism upon graph convolutional networks is applied to different contextual words in the dependency tree obtained from an off-the-shelf dependency parser, to distinguish the importance of different word dependencies. Consider that dependency types among words also contain important contextual guidance, which is potentially helpful for relation extraction, we also include the type information in A-GCN modeling. Experimental results on two English benchmark datasets demonstrate the effectiveness of our A-GCN, which outperforms previous studies and achieves state-of-the-art performance on both datasets.

pdf bib
Claim Matching Beyond English to Scale Global Fact-CheckingEnglish to Scale Global Fact-Checking
Ashkan Kazemi | Kiran Garimella | Devin Gaffney | Scott Hale

Manual fact-checking does not scale well to serve the needs of the internet. This issue is further compounded in non-English contexts. In this paper, we discuss claim matching as a possible solution to scale fact-checking. We define claim matching as the task of identifying pairs of textual messages containing claims that can be served with one fact-check. We construct a novel dataset of WhatsApp tipline and public group messages alongside fact-checked claims that are first annotated for containing claim-like statements and then matched with potentially similar items and annotated for claim matching. Our dataset contains content in high-resource (English, Hindi) and lower-resource (Bengali, Malayalam, Tamil) languages. We train our own embedding model using knowledge distillation and a high-quality teacher model in order to address the imbalance in embedding quality between the low- and high-resource languages in our dataset. We provide evaluations on the performance of our solution and compare with baselines and existing state-of-the-art multilingual embedding models, namely LASER and LaBSE. We demonstrate that our performance exceeds LASER and LaBSE in all settings. We release our annotated datasets, codebooks, and trained embedding model to allow for further research.

pdf bib
Energy-Based Reranking : Improving Neural Machine Translation Using Energy-Based Models
Sumanta Bhattacharyya | Amirmohammad Rooshenas | Subhajit Naskar | Simeng Sun | Mohit Iyyer | Andrew McCallum

The discrepancy between maximum likelihood estimation (MLE) and task measures such as BLEU score has been studied before for autoregressive neural machine translation (NMT) and resulted in alternative training algorithms (Ranzato et al., 2016 ; Norouzi et al., 2016 ; Shen et al., 2016 ; Wu et al., 2018). However, MLE training remains the de facto approach for autoregressive NMT because of its computational efficiency and stability. Despite this mismatch between the training objective and task measure, we notice that the samples drawn from an MLE-based trained NMT support the desired distribution there are samples with much higher BLEU score comparing to the beam decoding output. To benefit from this observation, we train an energy-based model to mimic the behavior of the task measure (i.e., the energy-based model assigns lower energy to samples with higher BLEU score), which is resulted in a re-ranking algorithm based on the samples drawn from NMT : energy-based re-ranking (EBR). We use both marginal energy models (over target sentence) and joint energy models (over both source and target sentences). Our EBR with the joint energy model consistently improves the performance of the Transformer-based NMT : +3.7 BLEU points on IWSLT’14 German-English, +3.37 BELU points on Sinhala-English, +1.4 BLEU points on WMT’16 English-German tasks.

pdf bib
ForecastQA : A Question Answering Challenge for Event Forecasting with Temporal Text DataForecastQA: A Question Answering Challenge for Event Forecasting with Temporal Text Data
Woojeong Jin | Rahul Khanna | Suji Kim | Dong-Ho Lee | Fred Morstatter | Aram Galstyan | Xiang Ren

Event forecasting is a challenging, yet important task, as humans seek to constantly plan for the future. Existing automated forecasting studies rely mostly on structured data, such as time-series or event-based knowledge graphs, to help predict future events. In this work, we aim to formulate a task, construct a dataset, and provide benchmarks for developing methods for event forecasting with large volumes of unstructured text data. To simulate the forecasting scenario on temporal news documents, we formulate the problem as a restricted-domain, multiple-choice, question-answering (QA) task. Unlike existing QA tasks, our task limits accessible information, and thus a model has to make a forecasting judgement. To showcase the usefulness of this task formulation, we introduce ForecastQA, a question-answering dataset consisting of 10,392 event forecasting questions, which have been collected and verified via crowdsourcing efforts. We present our experiments on ForecastQA using BERTbased models and find that our best model achieves 61.0 % accuracy on the dataset, which still lags behind human performance by about 19 %. We hope ForecastQA will support future research efforts in bridging this gap.

pdf bib
Trigger is Not Sufficient : Exploiting Frame-aware Knowledge for Implicit Event Argument Extraction
Kaiwen Wei | Xian Sun | Zequn Zhang | Jingyuan Zhang | Guo Zhi | Li Jin

Implicit Event Argument Extraction seeks to identify arguments that play direct or implicit roles in a given event. However, most prior works focus on capturing direct relations between arguments and the event trigger. The lack of reasoning ability brings many challenges to the extraction of implicit arguments. In this work, we present a Frame-aware Event Argument Extraction (FEAE) learning framework to tackle this issue through reasoning in event frame-level scope. The proposed method leverages related arguments of the expected one as clues to guide the reasoning process. To bridge the gap between oracle knowledge used in the training phase and the imperfect related arguments in the test stage, we further introduce a curriculum knowledge distillation strategy to drive a final model that could operate without extra inputs through mimicking the behavior of a well-informed teacher model. Experimental results demonstrate FEAE obtains new state-of-the-art performance on the RAMS dataset.

pdf bib
AdaTag : Multi-Attribute Value Extraction from Product Profiles with Adaptive DecodingAdaTag: Multi-Attribute Value Extraction from Product Profiles with Adaptive Decoding
Jun Yan | Nasser Zalmout | Yan Liang | Christan Grant | Xiang Ren | Xin Luna Dong

Automatic extraction of product attribute values is an important enabling technology in e-Commerce platforms. This task is usually modeled using sequence labeling architectures, with several extensions to handle multi-attribute extraction. One line of previous work constructs attribute-specific models, through separate decoders or entirely separate models. However, this approach constrains knowledge sharing across different attributes. Other contributions use a single multi-attribute model, with different techniques to embed attribute information. But sharing the entire network parameters across all attributes can limit the model’s capacity to capture attribute-specific characteristics. In this paper we present AdaTag, which uses adaptive decoding to handle extraction. We parameterize the decoder with pretrained attribute embeddings, through a hypernetwork and a Mixture-of-Experts (MoE) module. This allows for separate, but semantically correlated, decoders to be generated on the fly for different attributes. This approach facilitates knowledge sharing, while maintaining the specificity of each attribute. Our experiments on a real-world e-Commerce dataset show marked improvements over previous methods.

pdf bib
Benchmarking Scalable Methods for Streaming Cross Document Entity Coreference
Robert L Logan IV | Andrew McCallum | Sameer Singh | Dan Bikel

Streaming cross document entity coreference (CDC) systems disambiguate mentions of named entities in a scalable manner via incremental clustering. Unlike other approaches for named entity disambiguation (e.g., entity linking), streaming CDC allows for the disambiguation of entities that are unknown at inference time. Thus, it is well-suited for processing streams of data where new entities are frequently introduced. Despite these benefits, this task is currently difficult to study, as existing approaches are either evaluated on datasets that are no longer available, or omit other crucial details needed to ensure fair comparison. In this work, we address this issue by compiling a large benchmark adapted from existing free datasets, and performing a comprehensive evaluation of a number of novel and existing baseline models. We investigate : how to best encode mentions, which clustering algorithms are most effective for grouping mentions, how models transfer to different domains, and how bounding the number of mentions tracked during inference impacts performance. Our results show that the relative performance of neural and feature-based mention encoders varies across different domains, and in most cases the best performance is achieved using a combination of both approaches. We also find that performance is minimally impacted by limiting the number of tracked mentions.

pdf bib
Search from History and Reason for Future : Two-stage Reasoning on Temporal Knowledge Graphs
Zixuan Li | Xiaolong Jin | Saiping Guan | Wei Li | Jiafeng Guo | Yuanzhuo Wang | Xueqi Cheng

Temporal Knowledge Graphs (TKGs) have been developed and used in many different areas. Reasoning on TKGs that predicts potential facts (events) in the future brings great challenges to existing models. When facing a prediction task, human beings usually search useful historical information (i.e., clues) in their memories and then reason for future meticulously. Inspired by this mechanism, we propose CluSTeR to predict future facts in a two-stage manner, Clue Searching and Temporal Reasoning, accordingly. Specifically, at the clue searching stage, CluSTeR learns a beam search policy via reinforcement learning (RL) to induce multiple clues from historical facts. At the temporal reasoning stage, it adopts a graph convolution network based sequence method to deduce answers from clues. Experiments on four datasets demonstrate the substantial advantages of CluSTeR compared with the state-of-the-art methods. Moreover, the clues found by CluSTeR further provide interpretability for the results.

pdf bib
Employing Argumentation Knowledge Graphs for Neural Argument Generation
Khalid Al Khatib | Lukas Trautner | Henning Wachsmuth | Yufang Hou | Benno Stein

Generating high-quality arguments, while being challenging, may benefit a wide range of downstream applications, such as writing assistants and argument search engines. Motivated by the effectiveness of utilizing knowledge graphs for supporting general text generation tasks, this paper investigates the usage of argumentation-related knowledge graphs to control the generation of arguments. In particular, we construct and populate three knowledge graphs, employing several compositions of them to encode various knowledge into texts of debate portals and relevant paragraphs from Wikipedia. Then, the texts with the encoded knowledge are used to fine-tune a pre-trained text generation model, GPT-2. We evaluate the newly created arguments manually and automatically, based on several dimensions important in argumentative contexts, including argumentativeness and plausibility. The results demonstrate the positive impact of encoding the graphs’ knowledge into debate portal texts for generating arguments with superior quality than those generated without knowledge.

pdf bib
GWLAN : General Word-Level AutocompletioN for Computer-Aided TranslationGWLAN: General Word-Level AutocompletioN for Computer-Aided Translation
Huayang Li | Lemao Liu | Guoping Huang | Shuming Shi

Computer-aided translation (CAT), the use of software to assist a human translator in the translation process, has been proven to be useful in enhancing the productivity of human translators. Autocompletion, which suggests translation results according to the text pieces provided by human translators, is a core function of CAT. There are two limitations in previous research in this line. First, most research works on this topic focus on sentence-level autocompletion (i.e., generating the whole translation as a sentence based on human input), but word-level autocompletion is under-explored so far. Second, almost no public benchmarks are available for the autocompletion task of CAT. This might be among the reasons why research progress in CAT is much slower compared to automatic MT. In this paper, we propose the task of general word-level autocompletion (GWLAN) from a real-world CAT scenario, and construct the first public benchmark to facilitate research in this topic. In addition, we propose an effective method for GWLAN and compare it with several strong baselines. Experiments demonstrate that our proposed method can give significantly more accurate predictions than the baseline methods on our benchmark datasets.

pdf bib
MLBiNet : A Cross-Sentence Collective Event Detection NetworkMLBiNet: A Cross-Sentence Collective Event Detection Network
Dongfang Lou | Zhilin Liao | Shumin Deng | Ningyu Zhang | Huajun Chen

We consider the problem of collectively detecting multiple events, particularly in cross-sentence settings. The key to dealing with the problem is to encode semantic information and model event inter-dependency at a document-level. In this paper, we reformulate it as a Seq2Seq task and propose a Multi-Layer Bidirectional Network (MLBiNet) to capture the document-level association of events and semantic information simultaneously. Specifically, a bidirectional decoder is firstly devised to model event inter-dependency within a sentence when decoding the event tag vector sequence. Secondly, an information aggregation module is employed to aggregate sentence-level semantic and event tag information. Finally, we stack multiple bidirectional decoders and feed cross-sentence information, forming a multi-layer bidirectional tagging architecture to iteratively propagate information across sentences. We show that our approach provides significant improvement in performance compared to the current state-of-the-art results.Multi-Layer Bidirectional Network (MLBiNet) to capture the document-level association of events and semantic information simultaneously. Specifically, a bidirectional decoder is firstly devised to model event inter-dependency within a sentence when decoding the event tag vector sequence. Secondly, an information aggregation module is employed to aggregate sentence-level semantic and event tag information. Finally, we stack multiple bidirectional decoders and feed cross-sentence information, forming a multi-layer bidirectional tagging architecture to iteratively propagate information across sentences. We show that our approach provides significant improvement in performance compared to the current state-of-the-art results.

pdf bib
Knowledge-Enriched Event Causality Identification via Latent Structure Induction Networks
Pengfei Cao | Xinyu Zuo | Yubo Chen | Kang Liu | Jun Zhao | Yuguang Chen | Weihua Peng

Identifying causal relations of events is an important task in natural language processing area. However, the task is very challenging, because event causality is usually expressed in diverse forms that often lack explicit causal clues. Existing methods can not handle well the problem, especially in the condition of lacking training data. Nonetheless, humans can make a correct judgement based on their background knowledge, including descriptive knowledge and relational knowledge. Inspired by it, we propose a novel Latent Structure Induction Network (LSIN) to incorporate the external structural knowledge into this task. Specifically, to make use of the descriptive knowledge, we devise a Descriptive Graph Induction module to obtain and encode the graph-structured descriptive knowledge. To leverage the relational knowledge, we propose a Relational Graph Induction module which is able to automatically learn a reasoning structure for event causality reasoning. Experimental results on two widely used datasets indicate that our approach significantly outperforms previous state-of-the-art methods.

pdf bib
R2D2 : Recursive Transformer based on Differentiable Tree for Interpretable Hierarchical Language ModelingR2D2: Recursive Transformer based on Differentiable Tree for Interpretable Hierarchical Language Modeling
Xiang Hu | Haitao Mi | Zujie Wen | Yafang Wang | Yi Su | Jing Zheng | Gerard de Melo

Human language understanding operates at multiple levels of granularity (e.g., words, phrases, and sentences) with increasing levels of abstraction that can be hierarchically combined. However, existing deep models with stacked layers do not explicitly model any sort of hierarchical process. In this paper, we propose a recursive Transformer model based on differentiable CKY style binary trees to emulate this composition process, and we extend the bidirectional language model pre-training objective to this architecture, attempting to predict each word given its left and right abstraction nodes. To scale up our approach, we also introduce an efficient pruning and growing algorithm to reduce the time complexity and enable encoding in linear time. Experimental results on language modeling and unsupervised parsing show the effectiveness of our approach.

pdf bib
Personalized Transformer for Explainable Recommendation
Lei Li | Yongfeng Zhang | Li Chen

Personalization of natural language generation plays a vital role in a large spectrum of tasks, such as explainable recommendation, review summarization and dialog systems. In these tasks, user and item IDs are important identifiers for personalization. Transformer, which is demonstrated with strong language modeling capability, however, is not personalized and fails to make use of the user and item IDs since the ID tokens are not even in the same semantic space as the words. To address this problem, we present a PErsonalized Transformer for Explainable Recommendation (PETER), on which we design a simple and effective learning objective that utilizes the IDs to predict the words in the target explanation, so as to endow the IDs with linguistic meanings and to achieve personalized Transformer. Besides generating explanations, PETER can also make recommendations, which makes it a unified model for the whole recommendation-explanation pipeline. Extensive experiments show that our small unpretrained model outperforms fine-tuned BERT on the generation task, in terms of both effectiveness and efficiency, which highlights the importance and the nice utility of our design.

pdf bib
Generating SOAP Notes from Doctor-Patient Conversations Using Modular Summarization TechniquesSOAP Notes from Doctor-Patient Conversations Using Modular Summarization Techniques
Kundan Krishna | Sopan Khosla | Jeffrey Bigham | Zachary C. Lipton

Following each patient visit, physicians draft long semi-structured clinical summaries called SOAP notes. While invaluable to clinicians and researchers, creating digital SOAP notes is burdensome, contributing to physician burnout. In this paper, we introduce the first complete pipelines to leverage deep summarization models to generate these notes based on transcripts of conversations between physicians and patients. After exploring a spectrum of methods across the extractive-abstractive spectrum, we propose Cluster2Sent, an algorithm that (i) extracts important utterances relevant to each summary section ; (ii) clusters together related utterances ; and then (iii) generates one summary sentence per cluster. Cluster2Sent outperforms its purely abstractive counterpart by 8 ROUGE-1 points, and produces significantly more factual and coherent sentences as assessed by expert human evaluators. For reproducibility, we demonstrate similar benefits on the publicly available AMI dataset. Our results speak to the benefits of structuring summaries into sections and annotating supporting evidence when constructing summarization corpora.

pdf bib
ConSERT : A Contrastive Framework for Self-Supervised Sentence Representation TransferConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer
Yuanmeng Yan | Rumei Li | Sirui Wang | Fuzheng Zhang | Wei Wu | Weiran Xu

Learning high-quality sentence representations benefits a wide range of natural language processing tasks. Though BERT-based pre-trained language models achieve high performance on many downstream tasks, the native derived sentence representations are proved to be collapsed and thus produce a poor performance on the semantic textual similarity (STS) tasks. In this paper, we present ConSERT, a Contrastive Framework for Self-Supervised SEntence Representation Transfer, that adopts contrastive learning to fine-tune BERT in an unsupervised and effective way. By making use of unlabeled texts, ConSERT solves the collapse issue of BERT-derived sentence representations and make them more applicable for downstream tasks. Experiments on STS datasets demonstrate that ConSERT achieves an 8 % relative improvement over the previous state-of-the-art, even comparable to the supervised SBERT-NLI. And when further incorporating NLI supervision, we achieve new state-of-the-art performance on STS tasks. Moreover, ConSERT obtains comparable results with only 1000 samples available, showing its robustness in data scarcity scenarios.

pdf bib
COINS : Dynamically Generating COntextualized Inference Rules for Narrative Story CompletionCOINS: Dynamically Generating COntextualized Inference Rules for Narrative Story Completion
Debjit Paul | Anette Frank

Despite recent successes of large pre-trained language models in solving reasoning tasks, their inference capabilities remain opaque. We posit that such models can be made more interpretable by explicitly generating interim inference rules, and using them to guide the generation of task-specific textual outputs. In this paper we present Coins, a recursive inference framework that i) iteratively reads context sentences, ii) dynamically generates contextualized inference rules, encodes them, and iii) uses them to guide task-specific output generation. We apply to a Narrative Story Completion task that asks a model to complete a story with missing sentences, to produce a coherent story with plausible logical connections, causal relationships, and temporal dependencies. By modularizing inference and sentence generation steps in a recurrent model, we aim to make reasoning steps and their effects on next sentence generation transparent. Our automatic and manual evaluations show that the model generates better story sentences than SOTA baselines, especially in terms of coherence. We further demonstrate improved performance over strong pre-trained LMs in generating commonsense inference rules. The recursive nature of holds the potential for controlled generation of longer sequences.Narrative Story Completion task that asks a model to complete a story with missing sentences, to produce a coherent story with plausible logical connections, causal relationships, and temporal dependencies. By modularizing inference and sentence generation steps in a recurrent model, we aim to make reasoning steps and their effects on next sentence generation transparent. Our automatic and manual evaluations show that the model generates better story sentences than SOTA baselines, especially in terms of coherence. We further demonstrate improved performance over strong pre-trained LMs in generating commonsense inference rules. The recursive nature of holds the potential for controlled generation of longer sequences.

pdf bib
From Paraphrasing to Semantic Parsing : Unsupervised Semantic Parsing via Synchronous Semantic Decoding
Shan Wu | Bo Chen | Chunlei Xin | Xianpei Han | Le Sun | Weipeng Zhang | Jiansong Chen | Fan Yang | Xunliang Cai

Semantic parsing is challenging due to the structure gap and the semantic gap between utterances and logical forms. In this paper, we propose an unsupervised semantic parsing method-Synchronous Semantic Decoding (SSD), which can simultaneously resolve the semantic gap and the structure gap by jointly leveraging paraphrasing and grammar-constrained decoding. Specifically, we reformulate semantic parsing as a constrained paraphrasing problem : given an utterance, our model synchronously generates its canonical utterancel and meaning representation. During synchronously decoding : the utterance paraphrasing is constrained by the structure of the logical form, therefore the canonical utterance can be paraphrased controlledly ; the semantic decoding is guided by the semantics of the canonical utterance, therefore its logical form can be generated unsupervisedly. Experimental results show that SSD is a promising approach and can achieve state-of-the-art unsupervised semantic parsing performance on multiple datasets.

pdf bib
Pre-training Universal Language Representation
Yian Li | Hai Zhao

Despite the well-developed cut-edge representation learning for language, most language representation models usually focus on specific levels of linguistic units. This work introduces universal language representation learning, i.e., embeddings of different levels of linguistic units or text with quite diverse lengths in a uniform vector space. We propose the training objective MiSAD that utilizes meaningful n-grams extracted from large unlabeled corpus by a simple but effective algorithm for pre-trained language models. Then we empirically verify that well designed pre-training scheme may effectively yield universal language representation, which will bring great convenience when handling multiple layers of linguistic objects in a unified way. Especially, our model achieves the highest accuracy on analogy tasks in different language levels and significantly improves the performance on downstream tasks in the GLUE benchmark and a question answering dataset.

pdf bib
A Cognitive Regularizer for Language Modeling
Jason Wei | Clara Meister | Ryan Cotterell

The uniform information density (UID) hypothesis, which posits that speakers behaving optimally tend to distribute information uniformly across a linguistic signal, has gained traction in psycholinguistics as an explanation for certain syntactic, morphological, and prosodic choices. In this work, we explore whether the UID hypothesis can be operationalized as an inductive bias for statistical language modeling. Specifically, we augment the canonical MLE objective for training language models with a regularizer that encodes UID. In experiments on ten languages spanning five language families, we find that using UID regularization consistently improves perplexity in language models, having a larger effect when training data is limited. Moreover, via an analysis of generated sequences, we find that UID-regularized language models have other desirable properties, e.g., they generate text that is more lexically diverse. Our results not only suggest that UID is a reasonable inductive bias for language modeling, but also provide an alternative validation of the UID hypothesis using modern-day NLP tools.

pdf bib
Lower Perplexity is Not Always Human-Like
Tatsuki Kuribayashi | Yohei Oseki | Takumi Ito | Ryo Yoshida | Masayuki Asahara | Kentaro Inui

In computational psycholinguistics, various language models have been evaluated against human reading behavior (e.g., eye movement) to build human-like computational models. However, most previous efforts have focused almost exclusively on English, despite the recent trend towards linguistic universal within the general community. In order to fill the gap, this paper investigates whether the established results in computational psycholinguistics can be generalized across languages. Specifically, we re-examine an established generalization the lower perplexity a language model has, the more human-like the language model is in Japanese with typologically different structures from English. Our experiments demonstrate that this established generalization exhibits a surprising lack of universality ; namely, lower perplexity is not always human-like. Moreover, this discrepancy between English and Japanese is further explored from the perspective of (non-)uniform information density. Overall, our results suggest that a cross-lingual evaluation will be necessary to construct human-like computational models.the lower perplexity a language model has, the more human-like the language model is— in Japanese with typologically different structures from English. Our experiments demonstrate that this established generalization exhibits a surprising lack of universality; namely, lower perplexity is not always human-like. Moreover, this discrepancy between English and Japanese is further explored from the perspective of (non-)uniform information density. Overall, our results suggest that a cross-lingual evaluation will be necessary to construct human-like computational models.

pdf bib
Obtaining Better Static Word Embeddings Using Contextual Embedding Models
Prakhar Gupta | Martin Jaggi

The advent of contextual word embeddings representations of words which incorporate semantic and syntactic information from their contexthas led to tremendous improvements on a wide variety of NLP tasks. However, recent contextual models have prohibitively high computational cost in many use-cases and are often hard to interpret. In this work, we demonstrate that our proposed distillation method, which is a simple extension of CBOW-based training, allows to significantly improve computational efficiency of NLP applications, while outperforming the quality of existing static embeddings trained from scratch as well as those distilled from previously proposed methods. As a side-effect, our approach also allows a fair comparison of both contextual and static embeddings via standard lexical evaluation tasks.

pdf bib
Meta-Learning with Variational Semantic Memory for Word Sense Disambiguation
Yingjun Du | Nithin Holla | Xiantong Zhen | Cees Snoek | Ekaterina Shutova

A critical challenge faced by supervised word sense disambiguation (WSD) is the lack of large annotated datasets with sufficient coverage of words in their diversity of senses. This inspired recent research on few-shot WSD using meta-learning. While such work has successfully applied meta-learning to learn new word senses from very few examples, its performance still lags behind its fully-supervised counterpart. Aiming to further close this gap, we propose a model of semantic memory for WSD in a meta-learning setting. Semantic memory encapsulates prior experiences seen throughout the lifetime of the model, which aids better generalization in limited data settings. Our model is based on hierarchical variational inference and incorporates an adaptive memory update rule via a hypernetwork. We show our model advances the state of the art in few-shot WSD, supports effective learning in extremely data scarce (e.g. one-shot) scenarios and produces meaning prototypes that capture similar senses of distinct words.

pdf bib
Text-Free Image-to-Speech Synthesis Using Learned Segmental Units
Wei-Ning Hsu | David Harwath | Tyler Miller | Christopher Song | James Glass

In this paper we present the first model for directly synthesizing fluent, natural-sounding spoken audio captions for images that does not require natural language text as an intermediate representation or source of supervision. Instead, we connect the image captioning module and the speech synthesis module with a set of discrete, sub-word speech units that are discovered with a self-supervised visual grounding task. We conduct experiments on the Flickr8k spoken caption dataset in addition to a novel corpus of spoken audio captions collected for the popular MSCOCO dataset, demonstrating that our generated captions also capture diverse visual semantics of the images they describe. We investigate several different intermediate speech representations, and empirically find that the representation must satisfy several important properties to serve as drop-in replacements for text.

pdf bib
CTFN : Hierarchical Learning for Multimodal Sentiment Analysis Using Coupled-Translation Fusion NetworkCTFN: Hierarchical Learning for Multimodal Sentiment Analysis Using Coupled-Translation Fusion Network
Jiajia Tang | Kang Li | Xuanyu Jin | Andrzej Cichocki | Qibin Zhao | Wanzeng Kong

Multimodal sentiment analysis is the challenging research area that attends to the fusion of multiple heterogeneous modalities. The main challenge is the occurrence of some missing modalities during the multimodal fusion procedure. However, the existing techniques require all modalities as input, thus are sensitive to missing modalities at predicting time. In this work, the coupled-translation fusion network (CTFN) is firstly proposed to model bi-direction interplay via couple learning, ensuring the robustness in respect to missing modalities. Specifically, the cyclic consistency constraint is presented to improve the translation performance, allowing us directly to discard decoder and only embraces encoder of Transformer. This could contribute to a much lighter model. Due to the couple learning, CTFN is able to conduct bi-direction cross-modality intercorrelation parallelly. Based on CTFN, a hierarchical architecture is further established to exploit multiple bi-direction translations, leading to double multimodal fusing embeddings compared with traditional translation methods. Moreover, the convolution block is utilized to further highlight explicit interactions among those translations. For evaluation, CTFN was verified on two multimodal benchmarks with extensive ablation studies. The experiments demonstrate that the proposed framework achieves state-of-the-art or often competitive performance. Additionally, CTFN still maintains robustness when considering missing modality.

pdf bib
Language Model Evaluation Beyond Perplexity
Clara Meister | Ryan Cotterell

We propose an alternate approach to quantifying how well language models learn natural language : we ask how well they match the statistical tendencies of natural language. To answer this question, we analyze whether text generated from language models exhibits the statistical tendencies present in the human-generated text on which they were trained. We provide a frameworkpaired with significance testsfor evaluating the fit of language models to these trends. We find that neural language models appear to learn only a subset of the tendencies considered, but align much more closely with empirical trends than proposed theoretical distributions (when present). Further, the fit to different distributions is highly-dependent on both model architecture and generation strategy. As concrete examples, text generated under the nucleus sampling scheme adheres more closely to the typetoken relationship of natural language than text produced using standard ancestral sampling ; text from LSTMs reflects the natural language distributions over length, stopwords, and symbols surprisingly well.

pdf bib
Alignment Rationale for Natural Language Inference
Zhongtao Jiang | Yuanzhe Zhang | Zhao Yang | Jun Zhao | Kang Liu

Deep learning models have achieved great success on the task of Natural Language Inference (NLI), though only a few attempts try to explain their behaviors. Existing explanation methods usually pick prominent features such as words or phrases from the input text. However, for NLI, alignments among words or phrases are more enlightening clues to explain the model. To this end, this paper presents AREC, a post-hoc approach to generate alignment rationale explanations for co-attention based models in NLI. The explanation is based on feature selection, which keeps few but sufficient alignments while maintaining the same prediction of the target model. Experimental results show that our method is more faithful and human-readable compared with many existing approaches. We further study and re-evaluate three typical models through our explanation beyond accuracy, and propose a simple method that greatly improves the model robustness.

pdf bib
On Sample Based Explanation Methods for NLP : Faithfulness, Efficiency and Semantic EvaluationNLP: Faithfulness, Efficiency and Semantic Evaluation
Wei Zhang | Ziming Huang | Yada Zhu | Guangnan Ye | Xiaodong Cui | Fan Zhang

In the recent advances of natural language processing, the scale of the state-of-the-art models and datasets is usually extensive, which challenges the application of sample-based explanation methods in many aspects, such as explanation interpretability, efficiency, and faithfulness. In this work, for the first time, we can improve the interpretability of explanations by allowing arbitrary text sequences as the explanation unit. On top of this, we implement a hessian-free method with a model faithfulness guarantee. Finally, to compare our method with the others, we propose a semantic-based evaluation metric that can better align with humans’ judgment of explanations than the widely adopted diagnostic or re-training measures. The empirical results on multiple real data sets demonstrate the proposed method’s superior performance to popular explanation techniques such as Influence Function or TracIn on semantic evaluation.

pdf bib
Counterfactual Inference for Text Classification Debiasing
Chen Qian | Fuli Feng | Lijie Wen | Chunping Ma | Pengjun Xie

Today’s text classifiers inevitably suffer from unintended dataset biases, especially the document-level label bias and word-level keyword bias, which may hurt models’ generalization. Many previous studies employed data-level manipulations or model-level balancing mechanisms to recover unbiased distributions and thus prevent models from capturing the two types of biases. Unfortunately, they either suffer from the extra cost of data collection / selection / annotation or need an elaborate design of balancing strategies. Different from traditional factual inference in which debiasing occurs before or during training, counterfactual inference mitigates the influence brought by unintended confounders after training, which can make unbiased decisions with biased observations. Inspired by this, we propose a model-agnostic text classification debiasing framework Corsair, which can effectively avoid employing data manipulations or designing balancing mechanisms. Concretely, Corsair first trains a base model on a training set directly, allowing the dataset biases ‘poison’ the trained model. In inference, given a factual input document, Corsair imagines its two counterfactual counterparts to distill and mitigate the two biases captured by the poisonous model. Extensive experiments demonstrate Corsair’s effectiveness, generalizability and fairness.

pdf bib
HieRec : Hierarchical User Interest Modeling for Personalized News RecommendationHieRec: Hierarchical User Interest Modeling for Personalized News Recommendation
Tao Qi | Fangzhao Wu | Chuhan Wu | Peiru Yang | Yang Yu | Xing Xie | Yongfeng Huang

User interest modeling is critical for personalized news recommendation. Existing news recommendation methods usually learn a single user embedding for each user from their previous behaviors to represent their overall interest. However, user interest is usually diverse and multi-grained, which is difficult to be accurately modeled by a single user embedding. In this paper, we propose a news recommendation method with hierarchical user interest modeling, named HieRec. Instead of a single user embedding, in our method each user is represented in a hierarchical interest tree to better capture their diverse and multi-grained interest in news. We use a three-level hierarchy to represent 1) overall user interest ; 2) user interest in coarse-grained topics like sports ; and 3) user interest in fine-grained topics like football. Moreover, we propose a hierarchical user interest matching framework to match candidate news with different levels of user interest for more accurate user interest targeting. Extensive experiments on two real-world datasets validate our method can effectively improve the performance of user modeling for personalized news recommendation.

pdf bib
PP-Rec : News Recommendation with Personalized User Interest and Time-aware News PopularityPP-Rec: News Recommendation with Personalized User Interest and Time-aware News Popularity
Tao Qi | Fangzhao Wu | Chuhan Wu | Yongfeng Huang

Personalized news recommendation methods are widely used in online news services. These methods usually recommend news based on the matching between news content and user interest inferred from historical behaviors. However, these methods usually have difficulties in making accurate recommendations to cold-start users, and tend to recommend similar news with those users have read. In general, popular news usually contain important information and can attract users with different interests. Besides, they are usually diverse in content and topic. Thus, in this paper we propose to incorporate news popularity information to alleviate the cold-start and diversity problems for personalized news recommendation. In our method, the ranking score for recommending a candidate news to a target user is the combination of a personalized matching score and a news popularity score. The former is used to capture the personalized user interest in news. The latter is used to measure time-aware popularity of candidate news, which is predicted based on news content, recency, and real-time CTR using a unified framework. Besides, we propose a popularity-aware user encoder to eliminate the popularity bias in user behaviors for accurate interest modeling. Experiments on two real-world datasets show our method can effectively improve the accuracy and diversity for news recommendation.

pdf bib
BanditMTL : Bandit-based Multi-task Learning for Text ClassificationBanditMTL: Bandit-based Multi-task Learning for Text Classification
Yuren Mao | Zekai Wang | Weiwei Liu | Xuemin Lin | Wenbin Hu

Task variance regularization, which can be used to improve the generalization of Multi-task Learning (MTL) models, remains unexplored in multi-task text classification. Accordingly, to fill this gap, this paper investigates how the task might be effectively regularized, and consequently proposes a multi-task learning method based on adversarial multi-armed bandit. The proposed method, named BanditMTL, regularizes the task variance by means of a mirror gradient ascent-descent algorithm. Adopting BanditMTL in the multi-task text classification context is found to achieve state-of-the-art performance. The results of extensive experiments back up our theoretical analysis and validate the superiority of our proposals.

pdf bib
Exploring Distantly-Labeled Rationales in Neural Network Models
Quzhe Huang | Shengqi Zhu | Yansong Feng | Dongyan Zhao

Recent studies strive to incorporate various human rationales into neural networks to improve model performance, but few pay attention to the quality of the rationales. Most existing methods distribute their models’ focus to distantly-labeled rationale words entirely and equally, while ignoring the potential important non-rationale words and not distinguishing the importance of different rationale words. In this paper, we propose two novel auxiliary loss functions to make better use of distantly-labeled rationales, which encourage models to maintain their focus on important words beyond labeled rationales (PINs) and alleviate redundant training on non-helpful rationales (NoIRs). Experiments on two representative classification tasks show that our proposed methods can push a classification model to effectively learn crucial clues from non-perfect rationales while maintaining the ability to spread its focus to other unlabeled important words, thus significantly outperform existing methods.

pdf bib
A Human-machine Collaborative Framework for Evaluating Malevolence in Dialogues
Yangjun Zhang | Pengjie Ren | Maarten de Rijke

Conversational dialogue systems (CDSs) are hard to evaluate due to the complexity of natural language. Automatic evaluation of dialogues often shows insufficient correlation with human judgements. Human evaluation is reliable but labor-intensive. We introduce a human-machine collaborative framework, HMCEval, that can guarantee reliability of the evaluation outcomes with reduced human effort. HMCEval casts dialogue evaluation as a sample assignment problem, where we need to decide to assign a sample to a human or a machine for evaluation. HMCEval includes a model confidence estimation module to estimate the confidence of the predicted sample assignment, and a human effort estimation module to estimate the human effort should the sample be assigned to human evaluation, as well as a sample assignment execution module that finds the optimum assignment solution based on the estimated confidence and effort. We assess the performance of HMCEval on the task of evaluating malevolence in dialogues. The experimental results show that HMCEval achieves around 99 % evaluation accuracy with half of the human effort spared, showing that HMCEval provides reliable evaluation outcomes while reducing human effort by a large amount.

pdf bib
Learning to Ask Conversational Questions by Optimizing Levenshtein DistanceLevenshtein Distance
Zhongkun Liu | Pengjie Ren | Zhumin Chen | Zhaochun Ren | Maarten de Rijke | Ming Zhou

Conversational Question Simplification (CQS) aims to simplify self-contained questions into conversational ones by incorporating some conversational characteristics, e.g., anaphora and ellipsis. Existing maximum likelihood estimation based methods often get trapped in easily learned tokens as all tokens are treated equally during training. In this work, we introduce a Reinforcement Iterative Sequence Editing (RISE) framework that optimizes the minimum Levenshtein distance through explicit editing actions. RISE is able to pay attention to tokens that are related to conversational characteristics. To train RISE, we devise an Iterative Reinforce Training (IRT) algorithm with a Dynamic Programming based Sampling (DPS) process to improve exploration. Experimental results on two benchmark datasets show that RISE significantly outperforms state-of-the-art methods and generalizes well on unseen data.

pdf bib
DVD : A Diagnostic Dataset for Multi-step Reasoning in Video Grounded DialogueDVD: A Diagnostic Dataset for Multi-step Reasoning in Video Grounded Dialogue
Hung Le | Chinnadhurai Sankar | Seungwhan Moon | Ahmad Beirami | Alborz Geramifard | Satwik Kottur

A video-grounded dialogue system is required to understand both dialogue, which contains semantic dependencies from turn to turn, and video, which contains visual cues of spatial and temporal scene variations. Building such dialogue systems is a challenging problem, involving various reasoning types on both visual and language inputs. Existing benchmarks do not have enough annotations to thoroughly analyze dialogue systems and understand their capabilities and limitations in isolation. These benchmarks are also not explicitly designed to minimise biases that models can exploit without actual reasoning. To address these limitations, in this paper, we present DVD, a Diagnostic Dataset for Video-grounded Dialogue. The dataset is designed to contain minimal biases and has detailed annotations for the different types of reasoning over the spatio-temporal space of video. Dialogues are synthesized over multiple question turns, each of which is injected with a set of cross-turn semantic relationships. We use DVD to analyze existing approaches, providing interesting insights into their abilities and limitations. In total, DVD is built from 11k CATER synthetic videos and contains 10 instances of 10-round dialogues for each video, resulting in more than 100k dialogues and 1 M question-answer pairs. Our code and dataset are publicly available.

pdf bib
CoSQA : 20,000 + Web Queries for Code Search and Question AnsweringCoSQA: 20,000+ Web Queries for Code Search and Question Answering
Junjie Huang | Duyu Tang | Linjun Shou | Ming Gong | Ke Xu | Daxin Jiang | Ming Zhou | Nan Duan

Finding codes given natural language query is beneficial to the productivity of software developers. Future progress towards better semantic matching between query and code requires richer supervised training resources. To remedy this, we introduce CoSQA dataset. It includes 20,604 labels for pairs of natural language queries and codes, each annotated by at least 3 human annotators. We further introduce a contrastive learning method dubbed CoCLR to enhance text-code matching, which works as a data augmenter to bring more artificially generated training instances. We show that, evaluated on CodeXGLUE with the same CodeBERT model, training on CoSQA improves the accuracy of code question answering by 5.1 % and incorporating CoCLR brings a further improvement of 10.5 %.

pdf bib
Rewriter-Evaluator Architecture for Neural Machine Translation
Yangming Li | Kaisheng Yao

A few approaches have been developed to improve neural machine translation (NMT) models with multiple passes of decoding. However, their performance gains are limited because of lacking proper policies to terminate the multi-pass process. To address this issue, we introduce a novel architecture of Rewriter-Evaluator. Translating a source sentence involves multiple rewriting passes. In every pass, a rewriter generates a new translation to improve the past translation. Termination of this multi-pass process is determined by a score of translation quality estimated by an evaluator. We also propose prioritized gradient descent (PGD) to jointly and efficiently train the rewriter and the evaluator. Extensive experiments on three machine translation tasks show that our architecture notably improves the performances of NMT models and significantly outperforms prior methods. An oracle experiment reveals that it can largely reduce performance gaps to the oracle policy. Experiments confirm that the evaluator trained with PGD is more accurate than prior methods in determining proper numbers of rewriting.

pdf bib
Importance-based Neuron Allocation for Multilingual Neural Machine Translation
Wanying Xie | Yang Feng | Shuhao Gu | Dong Yu

Multilingual neural machine translation with a single model has drawn much attention due to its capability to deal with multiple languages. However, the current multilingual translation paradigm often makes the model tend to preserve the general knowledge, but ignore the language-specific knowledge. Some previous works try to solve this problem by adding various kinds of language-specific modules to the model, but they suffer from the parameter explosion problem and require specialized manual design. To solve these problems, we propose to divide the model neurons into general and language-specific parts based on their importance across languages. The general part is responsible for preserving the general knowledge and participating in the translation of all the languages, while the language-specific part is responsible for preserving the language-specific knowledge and participating in the translation of some specific languages. Experimental results on several language pairs, covering IWSLT and Europarl corpus datasets, demonstrate the effectiveness and universality of the proposed method.

pdf bib
Coreference Reasoning in Machine Reading Comprehension
Mingzhu Wu | Nafise Sadat Moosavi | Dan Roth | Iryna Gurevych

Coreference resolution is essential for natural language understanding and has been long studied in NLP. In recent years, as the format of Question Answering (QA) became a standard for machine reading comprehension (MRC), there have been data collection efforts, e.g., Dasigi et al. (2019), that attempt to evaluate the ability of MRC models to reason about coreference. However, as we show, coreference reasoning in MRC is a greater challenge than earlier thought ; MRC datasets do not reflect the natural distribution and, consequently, the challenges of coreference reasoning. Specifically, success on these datasets does not reflect a model’s proficiency in coreference reasoning. We propose a methodology for creating MRC datasets that better reflect the challenges of coreference reasoning and use it to create a sample evaluation set. The results on our dataset show that state-of-the-art models still struggle with these phenomena. Furthermore, we develop an effective way to use naturally occurring coreference phenomena from existing coreference resolution datasets when training MRC models. This allows us to show an improvement in the coreference reasoning abilities of state-of-the-art models.

pdf bib
A Unified Generative Framework for Various NER SubtasksNER Subtasks
Hang Yan | Tao Gui | Junqi Dai | Qipeng Guo | Zheng Zhang | Xipeng Qiu

Named Entity Recognition (NER) is the task of identifying spans that represent entities in sentences. Whether the entity spans are nested or discontinuous, the NER task can be categorized into the flat NER, nested NER, and discontinuous NER subtasks. These subtasks have been mainly solved by the token-level sequence labelling or span-level classification. However, these solutions can hardly tackle the three kinds of NER subtasks concurrently. To that end, we propose to formulate the NER subtasks as an entity span sequence generation task, which can be solved by a unified sequence-to-sequence (Seq2Seq) framework. Based on our unified framework, we can leverage the pre-trained Seq2Seq model to solve all three kinds of NER subtasks without the special design of the tagging schema or ways to enumerate spans. We exploit three types of entity representations to linearize entities into a sequence. Our proposed framework is easy-to-implement and achieves state-of-the-art (SoTA) or near SoTA performance on eight English NER datasets, including two flat NER datasets, three nested NER datasets, and three discontinuous NER datasets.

pdf bib
MulDA : A Multilingual Data Augmentation Framework for Low-Resource Cross-Lingual NERMulDA: A Multilingual Data Augmentation Framework for Low-Resource Cross-Lingual NER
Linlin Liu | Bosheng Ding | Lidong Bing | Shafiq Joty | Luo Si | Chunyan Miao

Named Entity Recognition (NER) for low-resource languages is a both practical and challenging research problem. This paper addresses zero-shot transfer for cross-lingual NER, especially when the amount of source-language training data is also limited. The paper first proposes a simple but effective labeled sequence translation method to translate source-language training data to target languages and avoids problems such as word order change and entity span determination. With the source-language data as well as the translated data, a generation-based multilingual data augmentation method is introduced to further increase diversity by generating synthetic labeled data in multiple languages. These augmented data enable the language model based NER models to generalize better with both the language-specific features from the target-language synthetic data and the language-independent features from multilingual synthetic data. An extensive set of experiments were conducted to demonstrate encouraging cross-lingual transfer performance of the new research on a wide variety of target languages.

pdf bib
Math Word Problem Solving with Explicit Numerical Values
Qinzhuo Wu | Qi Zhang | Zhongyu Wei | Xuanjing Huang

In recent years, math word problem solving has received considerable attention and achieved promising results, but previous methods rarely take numerical values into consideration. Most methods treat the numerical values in the problems as number symbols, and ignore the prominent role of the numerical values in solving the problem. In this paper, we propose a novel approach called NumS2 T, which enhances math word problem solving performance by explicitly incorporating numerical values into a sequence-to-tree network. In addition, a numerical properties prediction mechanism is used to capture the category and comparison information of numerals and measure their importance in global expressions. Experimental results on the Math23 K and APE datasets demonstrate that our model achieves better performance than existing state-of-the-art models.

pdf bib
SMedBERT : A Knowledge-Enhanced Pre-trained Language Model with Structured Semantics for Medical Text MiningSMedBERT: A Knowledge-Enhanced Pre-trained Language Model with Structured Semantics for Medical Text Mining
Taolin Zhang | Zerui Cai | Chengyu Wang | Minghui Qiu | Bite Yang | Xiaofeng He

Recently, the performance of Pre-trained Language Models (PLMs) has been significantly improved by injecting knowledge facts to enhance their abilities of language understanding. For medical domains, the background knowledge sources are especially useful, due to the massive medical terms and their complicated relations are difficult to understand in text. In this work, we introduce SMedBERT, a medical PLM trained on large-scale medical corpora, incorporating deep structured semantic knowledge from neighbours of linked-entity. In SMedBERT, the mention-neighbour hybrid attention is proposed to learn heterogeneous-entity information, which infuses the semantic representations of entity types into the homogeneous neighbouring entity structure. Apart from knowledge integration as external features, we propose to employ the neighbors of linked-entities in the knowledge graph as additional global contexts of text mentions, allowing them to communicate via shared neighbors, thus enrich their semantic representations. Experiments demonstrate that SMedBERT significantly outperforms strong baselines in various knowledge-intensive Chinese medical tasks. It also improves the performance of other tasks such as question answering, question matching and natural language inference.

pdf bib
Controversy and Conformity : from Generalized to Personalized Aggressiveness Detection
Kamil Kanclerz | Alicja Figas | Marcin Gruza | Tomasz Kajdanowicz | Jan Kocon | Daria Puchalska | Przemyslaw Kazienko

There is content such as hate speech, offensive, toxic or aggressive documents, which are perceived differently by their consumers. They are commonly identified using classifiers solely based on textual content that generalize pre-agreed meanings of difficult problems. Such models provide the same results for each user, which leads to high misclassification rate observable especially for contentious, aggressive documents. Both document controversy and user nonconformity require new solutions. Therefore, we propose novel personalized approaches that respect individual beliefs expressed by either user conformity-based measures or various embeddings of their previous text annotations. We found that only a few annotations of most controversial documents are enough for all our personalization methods to significantly outperform classic, generalized solutions. The more controversial the content, the greater the gain. The personalized solutions may be used to efficiently filter unwanted aggressive content in the way adjusted to a given person.

pdf bib
Instantaneous Grammatical Error Correction with Shallow Aggressive Decoding
Xin Sun | Tao Ge | Furu Wei | Houfeng Wang

In this paper, we propose Shallow Aggressive Decoding (SAD) to improve the online inference efficiency of the Transformer for instantaneous Grammatical Error Correction (GEC). SAD optimizes the online inference efficiency for GEC by two innovations : 1) it aggressively decodes as many tokens as possible in parallel instead of always decoding only one token in each step to improve computational parallelism ; 2) it uses a shallow decoder instead of the conventional Transformer architecture with balanced encoder-decoder depth to reduce the computational cost during inference. Experiments in both English and Chinese GEC benchmarks show that aggressive decoding could yield identical predictions to greedy decoding but with significant speedup for online inference. Its combination with the shallow decoder could offer an even higher online inference speedup over the powerful Transformer baseline without quality loss. Not only does our approach allow a single model to achieve the state-of-the-art results in English GEC benchmarks : 66.4 F0.5 in the CoNLL-14 and 72.9 F0.5 in the BEA-19 test set with an almost 10x online inference speedup over the Transformer-big model, but also it is easily adapted to other languages. Our code is available at

pdf bib
PHMOSpell : Phonological and Morphological Knowledge Guided Chinese Spelling CheckPHMOSpell: Phonological and Morphological Knowledge Guided Chinese Spelling Check
Li Huang | Junjie Li | Weiwei Jiang | Zhiyu Zhang | Minchuan Chen | Shaojun Wang | Jing Xiao

Chinese Spelling Check (CSC) is a challenging task due to the complex characteristics of Chinese characters. Statistics reveal that most Chinese spelling errors belong to phonological or visual errors. However, previous methods rarely utilize phonological and morphological knowledge of Chinese characters or heavily rely on external resources to model their similarities. To address the above issues, we propose a novel end-to-end trainable model called PHMOSpell, which promotes the performance of CSC with multi-modal information. Specifically, we derive pinyin and glyph representations for Chinese characters from audio and visual modalities respectively, which are integrated into a pre-trained language model by a well-designed adaptive gating mechanism. To verify its effectiveness, we conduct comprehensive experiments and ablation tests. Experimental results on three shared benchmarks demonstrate that our model consistently outperforms previous state-of-the-art models.

pdf bib
Improving Encoder by Auxiliary Supervision Tasks for Table-to-Text Generation
Liang Li | Can Ma | Yinliang Yue | Dayong Hu

Table-to-text generation aims at automatically generating natural text to help people conveniently obtain salient information in tables. Although neural models for table-to-text have achieved remarkable progress, some problems are still overlooked. Previous methods can not deduce the factual results from the entity’s (player or team) performance and the relations between entities. To solve this issue, we first build an entity graph from the input tables and introduce a reasoning module to perform reasoning on the graph. Moreover, there are different relations (e.g., the numeric size relation and the importance relation) between records in different dimensions. And these relations may contribute to the data-to-text generation. However, it is hard for a vanilla encoder to capture these. Consequently, we propose to utilize two auxiliary tasks, Number Ranking (NR) and Importance Ranking (IR), to supervise the encoder to capture the different relations. Experimental results on ROTOWIRE and RW-FG show that our method not only has a good generalization but also outperforms previous methods on several metrics : BLEU, Content Selection, Content Ordering.

pdf bib
Bridging Subword Gaps in Pretrain-Finetune Paradigm for Natural Language Generation
Xin Liu | Baosong Yang | Dayiheng Liu | Haibo Zhang | Weihua Luo | Min Zhang | Haiying Zhang | Jinsong Su

A well-known limitation in pretrain-finetune paradigm lies in its inflexibility caused by the one-size-fits-all vocabulary. This potentially weakens the effect when applying pretrained models into natural language generation (NLG) tasks, especially for the subword distributions between upstream and downstream tasks with significant discrepancy. Towards approaching this problem, we extend the vanilla pretrain-finetune pipeline with an extra embedding transfer step. Specifically, a plug-and-play embedding generator is introduced to produce the representation of any input token, according to pre-trained embeddings of its morphologically similar ones. Thus, embeddings of mismatch tokens in downstream tasks can also be efficiently initialized. We conduct experiments on a variety of NLG tasks under the pretrain-finetune fashion. Experimental results and extensive analyses show that the proposed strategy offers us opportunities to feel free to transfer the vocabulary, leading to more efficient and better performed downstream NLG models.

pdf bib
Capturing Relations between Scientific Papers : An Abstractive Model for Related Work Section Generation
Xiuying Chen | Hind Alamro | Mingzhe Li | Shen Gao | Xiangliang Zhang | Dongyan Zhao | Rui Yan

Given a set of related publications, related work section generation aims to provide researchers with an overview of the specific research area by summarizing these works and introducing them in a logical order. Most of existing related work generation models follow the inflexible extractive style, which directly extract sentences from multiple original papers to form a related work discussion. Hence, in this paper, we propose a Relation-aware Related work Generator (RRG), which generates an abstractive related work from the given multiple scientific papers in the same research area. Concretely, we propose a relation-aware multi-document encoder that relates one document to another according to their content dependency in a relation graph. The relation graph and the document representation are interacted and polished iteratively, complementing each other in the training process. We also contribute two public datasets composed of related work sections and their corresponding papers. Extensive experiments on the two datasets show that the proposed model brings substantial improvements over several strong baselines. We hope that this work will promote advances in related work generation task.

pdf bib
PhotoChat : A Human-Human Dialogue Dataset With Photo Sharing Behavior For Joint Image-Text ModelingPhotoChat: A Human-Human Dialogue Dataset With Photo Sharing Behavior For Joint Image-Text Modeling
Xiaoxue Zang | Lijuan Liu | Maria Wang | Yang Song | Hao Zhang | Jindong Chen

We present a new human-human dialogue dataset-PhotoChat, the first dataset that casts light on the photo sharing behavior in online messaging. PhotoChat contains 12k dialogues, each of which is paired with a user photo that is shared during the conversation. Based on this dataset, we propose two tasks to facilitate research on image-text modeling : a photo-sharing intent prediction task that predicts whether one intends to share a photo in the next conversation turn, and a photo retrieval task that retrieves the most relevant photo according to the dialogue context. In addition, for both tasks, we provide baseline models using the state-of-the-art models and report their benchmark performances. The best image retrieval model achieves 10.4 % recall@1 (out of 1000 candidates) and the best photo intent prediction model achieves 58.1 % F1 score, indicating that the dataset presents interesting yet challenging real-world problems. We are releasing PhotoChat to facilitate future research work among the community.

pdf bib
Good for Misconceived Reasons : An Empirical Revisiting on the Need for Visual Context in Multimodal Machine Translation
Zhiyong Wu | Lingpeng Kong | Wei Bi | Xiang Li | Ben Kao

A neural multimodal machine translation (MMT) system is one that aims to perform better translation by extending conventional text-only translation models with multimodal information. Many recent studies report improvements when equipping their models with the multimodal module, despite the controversy of whether such improvements indeed come from the multimodal part. We revisit the contribution of multimodal information in MMT by devising two interpretable MMT models. To our surprise, although our models replicate similar gains as recently developed multimodal-integrated systems achieved, our models learn to ignore the multimodal information. Upon further investigation, we discover that the improvements achieved by the multimodal models over text-only counterparts are in fact results of the regularization effect. We report empirical findings that highlight the importance of MMT models’ interpretability, and discuss how our findings will benefit future research.

pdf bib
BERTifying the Hidden Markov Model for Multi-Source Weakly Supervised Named Entity RecognitionBERTifying the Hidden Markov Model for Multi-Source Weakly Supervised Named Entity Recognition
Yinghao Li | Pranav Shetty | Lucas Liu | Chao Zhang | Le Song

We study the problem of learning a named entity recognition (NER) tagger using noisy labels from multiple weak supervision sources. Though cheap to obtain, the labels from weak supervision sources are often incomplete, inaccurate, and contradictory, making it difficult to learn an accurate NER model. To address this challenge, we propose a conditional hidden Markov model (CHMM), which can effectively infer true labels from multi-source noisy labels in an unsupervised way. CHMM enhances the classic hidden Markov model with the contextual representation power of pre-trained language models. Specifically, CHMM learns token-wise transition and emission probabilities from the BERT embeddings of the input tokens to infer the latent true labels from noisy observations. We further refine CHMM with an alternate-training approach (CHMM-ALT). It fine-tunes a BERT-NER model with the labels inferred by CHMM, and this BERT-NER’s output is regarded as an additional weak source to train the CHMM in return. Experiments on four NER benchmarks from various domains show that our method outperforms state-of-the-art weakly supervised NER models by wide margins.

pdf bib
CIL : Contrastive Instance Learning Framework for Distantly Supervised Relation ExtractionCIL: Contrastive Instance Learning Framework for Distantly Supervised Relation Extraction
Tao Chen | Haizhou Shi | Siliang Tang | Zhigang Chen | Fei Wu | Yueting Zhuang

The journey of reducing noise from distant supervision (DS) generated training data has been started since the DS was first introduced into the relation extraction (RE) task. For the past decade, researchers apply the multi-instance learning (MIL) framework to find the most reliable feature from a bag of sentences. Although the pattern of MIL bags can greatly reduce DS noise, it fails to represent many other useful sentence features in the datasets. In many cases, these sentence features can only be acquired by extra sentence-level human annotation with heavy costs. Therefore, the performance of distantly supervised RE models is bounded. In this paper, we go beyond typical MIL framework and propose a novel contrastive instance learning (CIL) framework. Specifically, we regard the initial MIL as the relational triple encoder and constraint positive pairs against negative pairs for each instance. Experiments demonstrate the effectiveness of our proposed framework, with significant improvements over the previous methods on NYT10, GDS and KBP.

pdf bib
An End-to-End Progressive Multi-Task Learning Framework for Medical Named Entity Recognition and Normalization
Baohang Zhou | Xiangrui Cai | Ying Zhang | Xiaojie Yuan

Medical named entity recognition (NER) and normalization (NEN) are fundamental for constructing knowledge graphs and building QA systems. Existing implementations for medical NER and NEN are suffered from the error propagation between the two tasks. The mispredicted mentions from NER will directly influence the results of NEN. Therefore, the NER module is the bottleneck of the whole system. Besides, the learnable features for both tasks are beneficial to improving the model performance. To avoid the disadvantages of existing models and exploit the generalized representation across the two tasks, we design an end-to-end progressive multi-task learning model for jointly modeling medical NER and NEN in an effective way. There are three level tasks with progressive difficulty in the framework. The progressive tasks can reduce the error propagation with the incremental task settings which implies the lower level tasks gain the supervised signals other than errors from the higher level tasks to improve their performances. Besides, the context features are exploited to enrich the semantic information of entity mentions extracted by NER. The performance of NEN profits from the enhanced entity mention features. The standard entities from knowledge bases are introduced into the NER module for extracting corresponding entity mentions correctly. The empirical results on two publicly available medical literature datasets demonstrate the superiority of our method over nine typical methods.

pdf bib
Learning from Miscellaneous Other-Class Words for Few-shot Named Entity Recognition
Meihan Tong | Shuai Wang | Bin Xu | Yixin Cao | Minghui Liu | Lei Hou | Juanzi Li

Few-shot Named Entity Recognition (NER) exploits only a handful of annotations to iden- tify and classify named entity mentions. Pro- totypical network shows superior performance on few-shot NER. However, existing prototyp- ical methods fail to differentiate rich seman- tics in other-class words, which will aggravate overfitting under few shot scenario. To address the issue, we propose a novel model, Mining Undefined Classes from Other-class (MUCO), that can automatically induce different unde- fined classes from the other class to improve few-shot NER. With these extra-labeled unde- fined classes, our method will improve the dis- criminative ability of NER classifier and en- hance the understanding of predefined classes with stand-by semantic knowledge. Experi- mental results demonstrate that our model out- performs five state-of-the-art models in both 1- shot and 5-shots settings on four NER bench- marks. We will release the code upon accep- tance. The source code is released on https : // / shuaiwa16 / OtherClassNER.git.

pdf bib
Multi-Label Few-Shot Learning for Aspect Category Detection
Mengting Hu | Shiwan Zhao | Honglei Guo | Chao Xue | Hang Gao | Tiegang Gao | Renhong Cheng | Zhong Su

Aspect category detection (ACD) in sentiment analysis aims to identify the aspect categories mentioned in a sentence. In this paper, we formulate ACD in the few-shot learning scenario. However, existing few-shot learning approaches mainly focus on single-label predictions. These methods can not work well for the ACD task since a sentence may contain multiple aspect categories. Therefore, we propose a multi-label few-shot learning method based on the prototypical network. To alleviate the noise, we design two effective attention mechanisms. The support-set attention aims to extract better prototypes by removing irrelevant aspects. The query-set attention computes multiple prototype-specific representations for each query instance, which are then used to compute accurate distances with the corresponding prototypes. To achieve multi-label inference, we further learn a dynamic threshold per instance by a policy network. Extensive experimental results on three datasets demonstrate that the proposed method significantly outperforms strong baselines.

pdf bib
Argument Pair Extraction via Attention-guided Multi-Layer Multi-Cross Encoding
Liying Cheng | Tianyu Wu | Lidong Bing | Luo Si

Argument pair extraction (APE) is a research task for extracting arguments from two passages and identifying potential argument pairs. Prior research work treats this task as a sequence labeling problem and a binary classification problem on two passages that are directly concatenated together, which has a limitation of not fully utilizing the unique characteristics and inherent relations of two different passages. This paper proposes a novel attention-guided multi-layer multi-cross encoding scheme to address the challenges. The new model processes two passages with two individual sequence encoders and updates their representations using each other’s representations through attention. In addition, the pair prediction part is formulated as a table-filling problem by updating the representations of two sequences’ Cartesian product. Furthermore, an auxiliary attention loss is introduced to guide each argument to align to its paired argument. An extensive set of experiments show that the new model significantly improves the APE performance over several alternatives.

pdf bib
OpenMEVA : A Benchmark for Evaluating Open-ended Story Generation MetricsOpenMEVA: A Benchmark for Evaluating Open-ended Story Generation Metrics
Jian Guan | Zhexin Zhang | Zhuoer Feng | Zitao Liu | Wenbiao Ding | Xiaoxi Mao | Changjie Fan | Minlie Huang

Automatic metrics are essential for developing natural language generation (NLG) models, particularly for open-ended language generation tasks such as story generation. However, existing automatic metrics are observed to correlate poorly with human evaluation. The lack of standardized benchmark datasets makes it difficult to fully evaluate the capabilities of a metric and fairly compare different metrics. Therefore, we propose OpenMEVA, a benchmark for evaluating open-ended story generation metrics. OpenMEVA provides a comprehensive test suite to assess the capabilities of metrics, including (a) the correlation with human judgments, (b) the generalization to different model outputs and datasets, (c) the ability to judge story coherence, and (d) the robustness to perturbations. To this end, OpenMEVA includes both manually annotated stories and auto-constructed test examples. We evaluate existing metrics on OpenMEVA and observe that they have poor correlation with human judgments, fail to recognize discourse-level incoherence, and lack inferential knowledge (e.g., causal order between events), the generalization ability and robustness. Our study presents insights for developing NLG models and metrics in further research.

pdf bib
Selective Knowledge Distillation for Neural Machine Translation
Fusheng Wang | Jianhao Yan | Fandong Meng | Jie Zhou

Neural Machine Translation (NMT) models achieve state-of-the-art performance on many translation benchmarks. As an active research field in NMT, knowledge distillation is widely applied to enhance the model’s performance by transferring teacher model’s knowledge on each training sample. However, previous work rarely discusses the different impacts and connections among these samples, which serve as the medium for transferring teacher knowledge. In this paper, we design a novel protocol that can effectively analyze the different impacts of samples by comparing various samples’ partitions. Based on above protocol, we conduct extensive experiments and find that the teacher’s knowledge is not the more, the better. Knowledge over specific samples may even hurt the whole performance of knowledge distillation. Finally, to address these issues, we propose two simple yet effective strategies, i.e., batch-level and global-level selections, to pick suitable samples for distillation. We evaluate our approaches on two large-scale machine translation tasks, WMT’14 English-German and WMT’19 Chinese-English. Experimental results show that our approaches yield up to +1.28 and +0.89 BLEU points improvements over the Transformer baseline, respectively.

pdf bib
Measuring and Increasing Context Usage in Context-Aware Machine Translation
Patrick Fernandes | Kayo Yin | Graham Neubig | André F. T. Martins

Recent work in neural machine translation has demonstrated both the necessity and feasibility of using inter-sentential context, context from sentences other than those currently being translated. However, while many current methods present model architectures that theoretically can use this extra context, it is often not clear how much they do actually utilize it at translation time. In this paper, we introduce a new metric, conditional cross-mutual information, to quantify usage of context by these models. Using this metric, we measure how much document-level machine translation systems use particular varieties of context. We find that target context is referenced more than source context, and that including more context has a diminishing affect on results. We then introduce a new, simple training method, context-aware word dropout, to increase the usage of context by context-aware models. Experiments show that our method not only increases context usage, but also improves the translation quality according to metrics such as BLEU and COMET, as well as performance on anaphoric pronoun resolution and lexical cohesion contrastive datasets.

pdf bib
Length-Adaptive Transformer : Train Once with Length Drop, Use Anytime with Search
Gyuwan Kim | Kyunghyun Cho

Despite transformers’ impressive accuracy, their computational cost is often prohibitive to use with limited computational resources. Most previous approaches to improve inference efficiency require a separate model for each possible computational budget. In this paper, we extend PoWER-BERT (Goyal et al., 2020) and propose Length-Adaptive Transformer that can be used for various inference scenarios after one-shot training. We train a transformer with LengthDrop, a structural variant of dropout, which stochastically determines a sequence length at each layer. We then conduct a multi-objective evolutionary search to find a length configuration that maximizes the accuracy and minimizes the efficiency metric under any given computational budget. Additionally, we significantly extend the applicability of PoWER-BERT beyond sequence-level classification into token-level classification with Drop-and-Restore process that drops word-vectors temporarily in intermediate layers and restores at the last layer if necessary. We empirically verify the utility of the proposed approach by demonstrating the superior accuracy-efficiency trade-off under various setups, including span-based question answering and text classification. Code is available at

pdf bib
Regression Bugs Are In Your Model ! Measuring, Reducing and Analyzing Regressions In NLP Model UpdatesNLP Model Updates
Yuqing Xie | Yi-An Lai | Yuanjun Xiong | Yi Zhang | Stefano Soatto

Behavior of deep neural networks can be inconsistent between different versions. Regressions during model update are a common cause of concern that often over-weigh the benefits in accuracy or efficiency gain. This work focuses on quantifying, reducing and analyzing regression errors in the NLP model updates. Using negative flip rate as regression measure, we show that regression has a prevalent presence across tasks in the GLUE benchmark. We formulate the regression-free model updates into a constrained optimization problem, and further reduce it into a relaxed form which can be approximately optimized through knowledge distillation training method. We empirically analyze how model ensemble reduces regression. Finally, we conduct CheckList behavioral testing to understand the distribution of regressions across linguistic phenomena, and the efficacy of ensemble and distillation methods.

pdf bib
On the Efficacy of Adversarial Data Collection for Question Answering : Results from a Large-Scale Randomized Study
Divyansh Kaushik | Douwe Kiela | Zachary C. Lipton | Wen-tau Yih

In adversarial data collection (ADC), a human workforce interacts with a model in real time, attempting to produce examples that elicit incorrect predictions. Researchers hope that models trained on these more challenging datasets will rely less on superficial patterns, and thus be less brittle. However, despite ADC’s intuitive appeal, it remains unclear when training on adversarial datasets produces more robust models. In this paper, we conduct a large-scale controlled study focused on question answering, assigning workers at random to compose questions either (i) adversarially (with a model in the loop) ; or (ii) in the standard fashion (without a model). Across a variety of models and datasets, we find that models trained on adversarial data usually perform better on other adversarial datasets but worse on a diverse collection of out-of-domain evaluation sets. Finally, we provide a qualitative analysis of adversarial (vs standard) data, identifying key differences and offering guidance for future research.

pdf bib
Question Answering Over Temporal Knowledge Graphs
Apoorv Saxena | Soumen Chakrabarti | Partha Talukdar

Temporal Knowledge Graphs (Temporal KGs) extend regular Knowledge Graphs by providing temporal scopes (start and end times) on each edge in the KG. While Question Answering over KG (KGQA) has received some attention from the research community, QA over Temporal KGs (Temporal KGQA) is a relatively unexplored area. Lack of broad coverage datasets has been another factor limiting progress in this area. We address this challenge by presenting CRONQUESTIONS, the largest known Temporal KGQA dataset, clearly stratified into buckets of structural complexity. CRONQUESTIONS expands the only known previous dataset by a factor of 340x. We find that various state-of-the-art KGQA methods fall far short of the desired performance on this new dataset. In response, we also propose CRONKGQA, a transformer-based solution that exploits recent advances in Temporal KG embeddings, and achieves performance superior to all baselines, with an increase of 120 % in accuracy over the next best performing method. Through extensive experiments, we give detailed insights into the workings of CRONKGQA, as well as situations where significant further improvements appear possible. In addition to the dataset, we have released our code as well.

pdf bib
Language Model Augmented Relevance Score
Ruibo Liu | Jason Wei | Soroush Vosoughi

Although automated metrics are commonly used to evaluate NLG systems, they often correlate poorly with human judgements. Newer metrics such as BERTScore have addressed many weaknesses in prior metrics such as BLEU and ROUGE, which rely on n-gram matching. These newer methods, however, are still limited in that they do not consider the generation context, so they can not properly reward generated text that is correct but deviates from the given reference. In this paper, we propose Language Model Augmented Relevance Score (MARS), a new context-aware metric for NLG evaluation. MARS leverages off-the-shelf language models, guided by reinforcement learning, to create augmented references that consider both the generation context and available human references, which are then used as additional references to score generated text. Compared with seven existing metrics in three common NLG tasks, MARS not only achieves higher correlation with human reference judgements, but also differentiates well-formed candidates from adversarial samples to a larger degree.

pdf bib
DExperts : Decoding-Time Controlled Text Generation with Experts and Anti-ExpertsDExperts: Decoding-Time Controlled Text Generation with Experts and Anti-Experts
Alisa Liu | Maarten Sap | Ximing Lu | Swabha Swayamdipta | Chandra Bhagavatula | Noah A. Smith | Yejin Choi

Despite recent advances in natural language generation, it remains challenging to control attributes of generated text. We propose DExperts : Decoding-time Experts, a decoding-time method for controlled text generation that combines a pretrained language model with expert LMs and/or anti-expert LMs in a product of experts. Intuitively, under the ensemble, tokens only get high probability if they are considered likely by the experts, and unlikely by the anti-experts. We apply DExperts to language detoxification and sentiment-controlled generation, where we outperform existing controllable generation methods on both automatic and human evaluations. Moreover, because DExperts operates only on the output of the pretrained LM, it is effective with (anti-)experts of smaller size, including when operating on GPT-3. Our work highlights the promise of tuning small LMs on text with (un)desirable attributes for efficient decoding-time steering.

pdf bib
Polyjuice : Generating Counterfactuals for Explaining, Evaluating, and Improving Models
Tongshuang Wu | Marco Tulio Ribeiro | Jeffrey Heer | Daniel Weld

While counterfactual examples are useful for analysis and training of NLP models, current generation methods either rely on manual labor to create very few counterfactuals, or only instantiate limited types of perturbations such as paraphrases or word substitutions. We present Polyjuice, a general-purpose counterfactual generator that allows for control over perturbation types and locations, trained by finetuning GPT-2 on multiple datasets of paired sentences. We show that Polyjuice produces diverse sets of realistic counterfactuals, which in turn are useful in various distinct applications : improving training and evaluation on three different tasks (with around 70 % less annotation effort than manual generation), augmenting state-of-the-art explanation techniques, and supporting systematic counterfactual error analysis by revealing behaviors easily missed by human experts.

pdf bib
Mid-Air Hand Gestures for Post-Editing of Machine Translation
Rashad Albo Jamara | Nico Herbig | Antonio Krüger | Josef van Genabith

To translate large volumes of text in a globally connected world, more and more translators are integrating machine translation (MT) and post-editing (PE) into their translation workflows to generate publishable quality translations. While this process has been shown to save time and reduce errors, the task of translation is changing from mostly text production from scratch to fixing errors within useful but partly incorrect MT output. This is affecting the interface design of translation tools, where better support for text editing tasks is required. Here, we present the first study that investigates the usefulness of mid-air hand gestures in combination with the keyboard (GK) for text editing in PE of MT. Guided by a gesture elicitation study with 14 freelance translators, we develop a prototype supporting mid-air hand gestures for cursor placement, text selection, deletion, and reordering. These gestures combined with the keyboard facilitate all editing types required for PE. An evaluation of the prototype shows that the average editing duration of GK is only slightly slower than the standard mouse and keyboard (MK), even though participants are very familiar with the latter, and relative novices to the former. Furthermore, the qualitative analysis shows positive attitudes towards hand gestures for PE, especially when manipulating single words.

pdf bib
Joint Verification and Reranking for Open Fact Checking Over Tables
Michael Sejr Schlichtkrull | Vladimir Karpukhin | Barlas Oguz | Mike Lewis | Wen-tau Yih | Sebastian Riedel

Structured information is an important knowledge source for automatic verification of factual claims. Nevertheless, the majority of existing research into this task has focused on textual data, and the few recent inquiries into structured data have been for the closed-domain setting where appropriate evidence for each claim is assumed to have already been retrieved. In this paper, we investigate verification over structured data in the open-domain setting, introducing a joint reranking-and-verification model which fuses evidence documents in the verification component. Our open-domain model achieves performance comparable to the closed-domain state-of-the-art on the TabFact dataset, and demonstrates performance gains from the inclusion of multiple tables as well as a significant improvement over a heuristic retrieval baseline.

pdf bib
Verb Knowledge Injection for Multilingual Event Processing
Olga Majewska | Ivan Vulić | Goran Glavaš | Edoardo Maria Ponti | Anna Korhonen

Linguistic probing of pretrained Transformer-based language models (LMs) revealed that they encode a range of syntactic and semantic properties of a language. However, they are still prone to fall back on superficial cues and simple heuristics to solve downstream tasks, rather than leverage deeper linguistic information. In this paper, we target a specific facet of linguistic knowledge, the interplay between verb meaning and argument structure. We investigate whether injecting explicit information on verbs’ semantic-syntactic behaviour improves the performance of pretrained LMs in event extraction tasks, where accurate verb processing is paramount. Concretely, we impart the verb knowledge from curated lexical resources into dedicated adapter modules (verb adapters), allowing it to complement, in downstream tasks, the language knowledge obtained during LM-pretraining. We first demonstrate that injecting verb knowledge leads to performance gains in English event extraction. We then explore the utility of verb adapters for event extraction in other languages : we investigate 1) zero-shot language transfer with multilingual Transformers and 2) transfer via (noisy automatic) translation of English verb-based lexical knowledge. Our results show that the benefits of verb knowledge injection indeed extend to other languages, even when relying on noisily translated lexical knowledge.

pdf bib
Lexical Semantic Change Discovery
Sinan Kurtyigit | Maike Park | Dominik Schlechtweg | Jonas Kuhn | Sabine Schulte im Walde

While there is a large amount of research in the field of Lexical Semantic Change Detection, only few approaches go beyond a standard benchmark evaluation of existing models. In this paper, we propose a shift of focus from change detection to change discovery, i.e., discovering novel word senses over time from the full corpus vocabulary. By heavily fine-tuning a type-based and a token-based approach on recently published German data, we demonstrate that both models can successfully be applied to discover new words undergoing meaning change. Furthermore, we provide an almost fully automated framework for both evaluation and discovery.

pdf bib
The R-U-A-Robot Dataset : Helping Avoid Chatbot Deception by Detecting User Questions About Human or Non-Human IdentityR-U-A-Robot Dataset: Helping Avoid Chatbot Deception by Detecting User Questions About Human or Non-Human Identity
David Gros | Yu Li | Zhou Yu

Humans are increasingly interacting with machines through language, sometimes in contexts where the user may not know they are talking to a machine (like over the phone or a text chatbot). We aim to understand how system designers and researchers might allow their systems to confirm its non-human identity. We collect over 2,500 phrasings related to the intent of Are you a robot?. This is paired with over 2,500 adversarially selected utterances where only confirming the system is non-human would be insufficient or disfluent. We compare classifiers to recognize the intent and discuss the precision / recall and model complexity tradeoffs. Such classifiers could be integrated into dialog systems to avoid undesired deception. We then explore how both a generative research model (Blender) as well as two deployed systems (Amazon Alexa, Google Assistant) handle this intent, finding that systems often fail to confirm their non-human identity. Finally, we try to understand what a good response to the intent would be, and conduct a user study to compare the important aspects when responding to this intent.

pdf bib
Using Meta-Knowledge Mined from Identifiers to Improve Intent Recognition in Conversational Systems
Claudio Pinhanez | Paulo Cavalin | Victor Henrique Alves Ribeiro | Ana Appel | Heloisa Candello | Julio Nogima | Mauro Pichiliani | Melina Guerra | Maira de Bayser | Gabriel Malfatti | Henrique Ferreira

In this paper we explore the improvement of intent recognition in conversational systems by the use of meta-knowledge embedded in intent identifiers. Developers often include such knowledge, structure as taxonomies, in the documentation of chatbots. By using neuro-symbolic algorithms to incorporate those taxonomies into embeddings of the output space, we were able to improve accuracy in intent recognition. In datasets with intents and example utterances from 200 professional chatbots, we saw decreases in the equal error rate (EER) in more than 40 % of the chatbots in comparison to the baseline of the same algorithm without the meta-knowledge. The meta-knowledge proved also to be effective in detecting out-of-scope utterances, improving the false acceptance rate (FAR) in two thirds of the chatbots, with decreases of 0.05 or more in FAR in almost 40 % of the chatbots. When considering only the well-developed workspaces with a high level use of taxonomies, FAR decreased more than 0.05 in 77 % of them, and more than 0.1 in 39 % of the chatbots.

pdf bib
ARBERT & MARBERT : Deep Bidirectional Transformers for ArabicARBERT & MARBERT: Deep Bidirectional Transformers for Arabic
Muhammad Abdul-Mageed | AbdelRahim Elmadany | El Moatez Billah Nagoudi

Pre-trained language models (LMs) are currently integral to many natural language processing systems. Although multilingual LMs were also introduced to serve many languages, these have limitations such as being costly at inference time and the size and diversity of non-English data involved in their pre-training. We remedy these issues for a collection of diverse Arabic varieties by introducing two powerful deep bidirectional transformer-based models, ARBERT and MARBERT. To evaluate our models, we also introduce ARLUE, a new benchmark for multi-dialectal Arabic language understanding evaluation. ARLUE is built using 42 datasets targeting six different task clusters, allowing us to offer a series of standardized experiments under rich conditions. When fine-tuned on ARLUE, our models collectively achieve new state-of-the-art results across the majority of tasks (37 out of 48 classification tasks, on the 42 datasets). Our best model acquires the highest ARLUE score (77.40) across all six task clusters, outperforming all other models including XLM-R Large (3.4x larger size). Our models are publicly available at and ARLUE will be released through the same repository.

pdf bib
Improving Paraphrase Detection with the Adversarial Paraphrasing Task
Animesh Nighojkar | John Licato

If two sentences have the same meaning, it should follow that they are equivalent in their inferential properties, i.e., each sentence should textually entail the other. However, many paraphrase datasets currently in widespread use rely on a sense of paraphrase based on word overlap and syntax. Can we teach them instead to identify paraphrases in a way that draws on the inferential properties of the sentences, and is not over-reliant on lexical and syntactic similarities of a sentence pair? We apply the adversarial paradigm to this question, and introduce a new adversarial method of dataset creation for paraphrase identification : the Adversarial Paraphrasing Task (APT), which asks participants to generate semantically equivalent (in the sense of mutually implicative) but lexically and syntactically disparate paraphrases. These sentence pairs can then be used both to test paraphrase identification models (which get barely random accuracy) and then improve their performance. To accelerate dataset generation, we explore automation of APT using T5, and show that the resulting dataset also improves accuracy. We discuss implications for paraphrase detection and release our dataset in the hope of making paraphrase detection models better able to detect sentence-level meaning equivalence.

pdf bib
ADEPT : An Adjective-Dependent Plausibility TaskADEPT: An Adjective-Dependent Plausibility Task
Ali Emami | Ian Porada | Alexandra Olteanu | Kaheer Suleman | Adam Trischler | Jackie Chi Kit Cheung

A false contract is more likely to be rejected than a contract is, yet a false key is less likely than a key to open doors. While correctly interpreting and assessing the effects of such adjective-noun pairs (e.g., false key) on the plausibility of given events (e.g., opening doors) underpins many natural language understanding tasks, doing so often requires a significant degree of world knowledge and common-sense reasoning. We introduce ADEPT a large-scale semantic plausibility task consisting of over 16 thousand sentences that are paired with slightly modified versions obtained by adding an adjective to a noun. Overall, we find that while the task appears easier for human judges (85 % accuracy), it proves more difficult for transformer-based models like RoBERTa (71 % accuracy). Our experiments also show that neither the adjective itself nor its taxonomic class suffice in determining the correct plausibility judgement, emphasizing the importance of endowing automatic natural language understanding systems with more context sensitivity and common-sense reasoning.

pdf bib
Conditional Generation of Temporally-ordered Event Sequences
Shih-Ting Lin | Nathanael Chambers | Greg Durrett

Models of narrative schema knowledge have proven useful for a range of event-related tasks, but they typically do not capture the temporal relationships between events. We propose a single model that addresses both temporal ordering, sorting given events into the order they occurred, and event infilling, predicting new events which fit into an existing temporally-ordered sequence. We use a BART-based conditional generation model that can capture both temporality and common event co-occurrence, meaning it can be flexibly applied to different tasks in this space. Our model is trained as a denoising autoencoder : we take temporally-ordered event sequences, shuffle them, delete some events, and then attempt to recover the original event sequence. This task teaches the model to make inferences given incomplete knowledge about the events in an underlying scenario. On the temporal ordering task, we show that our model is able to unscramble event sequences from existing datasets without access to explicitly labeled temporal training data, outperforming both a BERT-based pairwise model and a BERT-based pointer network. On event infilling, human evaluation shows that our model is able to generate events that fit better temporally into the input events when compared to GPT-2 story completion models.

pdf bib
SpanNER : Named Entity Re-/Recognition as Span PredictionSpanNER: Named Entity Re-/Recognition as Span Prediction
Jinlan Fu | Xuanjing Huang | Pengfei Liu

Recent years have seen the paradigm shift of Named Entity Recognition (NER) systems from sequence labeling to span prediction. Despite its preliminary effectiveness, the span prediction model’s architectural bias has not been fully understood. In this paper, we first investigate the strengths and weaknesses when the span prediction model is used for named entity recognition compared with the sequence labeling framework and how to further improve it, which motivates us to make complementary advantages of systems based on different paradigms. We then reveal that span prediction, simultaneously, can serve as a system combiner to re-recognize named entities from different systems’ outputs. We experimentally implement 154 systems on 11 datasets, covering three languages, comprehensive results show the effectiveness of span prediction models that both serve as base NER systems and system combiners. We make all codes and datasets available :, as well as an online system demo :. Our model also has been deployed into the ExplainaBoard platform, which allows users to flexibly perform a system combination of top-scoring systems in an interactive way :., as well as an online system demo: Our model also has been deployed into the ExplainaBoard platform, which allows users to flexibly perform a system combination of top-scoring systems in an interactive way:

pdf bib
Beyond Noise : Mitigating the Impact of Fine-grained Semantic Divergences on Neural Machine Translation
Eleftheria Briakou | Marine Carpuat

While it has been shown that Neural Machine Translation (NMT) is highly sensitive to noisy parallel training samples, prior work treats all types of mismatches between source and target as noise. As a result, it remains unclear how samples that are mostly equivalent but contain a small number of semantically divergent tokens impact NMT training. To close this gap, we analyze the impact of different types of fine-grained semantic divergences on Transformer models. We show that models trained on synthetic divergences output degenerated text more frequently and are less confident in their predictions. Based on these findings, we introduce a divergent-aware NMT framework that uses factors to help NMT recover from the degradation caused by naturally occurring divergences, improving both translation quality and model calibration on EN-FR tasks.

pdf bib
Scientific Credibility of Machine Translation Research : A Meta-Evaluation of 769 Papers
Benjamin Marie | Atsushi Fujita | Raphael Rubino

This paper presents the first large-scale meta-evaluation of machine translation (MT). We annotated MT evaluations conducted in 769 research papers published from 2010 to 2020. Our study shows that practices for automatic MT evaluation have dramatically changed during the past decade and follow concerning trends. An increasing number of MT evaluations exclusively rely on differences between BLEU scores to draw conclusions, without performing any kind of statistical significance testing nor human evaluation, while at least 108 metrics claiming to be better than BLEU have been proposed. MT evaluations in recent papers tend to copy and compare automatic metric scores from previous work to claim the superiority of a method or an algorithm without confirming neither exactly the same training, validating, and testing data have been used nor the metric scores are comparable. Furthermore, tools for reporting standardized metric scores are still far from being widely adopted by the MT community. After showing how the accumulation of these pitfalls leads to dubious evaluation, we propose a guideline to encourage better automatic MT evaluation along with a simple meta-evaluation scoring method to assess its credibility.

pdf bib
Neural Machine Translation with Monolingual Translation Memory
Deng Cai | Yan Wang | Huayang Li | Wai Lam | Lemao Liu

Prior work has proved that Translation Memory (TM) can boost the performance of Neural Machine Translation (NMT). In contrast to existing work that uses bilingual corpus as TM and employs source-side similarity search for memory retrieval, we propose a new framework that uses monolingual memory and performs learnable memory retrieval in a cross-lingual manner. Our framework has unique advantages. First, the cross-lingual memory retriever allows abundant monolingual data to be TM. Second, the memory retriever and NMT model can be jointly optimized for the ultimate translation goal. Experiments show that the proposed method obtains substantial improvements. Remarkably, it even outperforms strong TM-augmented NMT baselines using bilingual TM. Owning to the ability to leverage monolingual data, our model also demonstrates effectiveness in low-resource and domain adaptation scenarios.

pdf bib
Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning
Armen Aghajanyan | Sonal Gupta | Luke Zettlemoyer

Although pretrained language models can be fine-tuned to produce state-of-the-art results for a very wide range of language understanding tasks, the dynamics of this process are not well understood, especially in the low data regime. Why can we use relatively vanilla gradient descent algorithms (e.g., without strong regularization) to tune a model with hundreds of millions of parameters on datasets with only hundreds or thousands of labeled examples? In this paper, we argue that analyzing fine-tuning through the lens of intrinsic dimension provides us with empirical and theoretical intuitions to explain this remarkable phenomenon. We empirically show that common pre-trained models have a very low intrinsic dimension ; in other words, there exists a low dimension reparameterization that is as effective for fine-tuning as the full parameter space. For example, by optimizing only 200 trainable parameters randomly projected back into the full space, we can tune a RoBERTa model to achieve 90 % of the full parameter performance levels on MRPC. Furthermore, we empirically show that pre-training implicitly minimizes intrinsic dimension and, perhaps surprisingly, larger models tend to have lower intrinsic dimension after a fixed number of pre-training updates, at least in part explaining their extreme effectiveness. Lastly, we connect intrinsic dimensionality with low dimensional task representations and compression based generalization bounds to provide intrinsic-dimension-based generalization bounds that are independent of the full parameter count.

pdf bib
UnNatural Language InferenceUnNatural Language Inference
Koustuv Sinha | Prasanna Parthasarathi | Joelle Pineau | Adina Williams

Recent investigations into the inner-workings of state-of-the-art large-scale pre-trained Transformer-based Natural Language Understanding (NLU) models indicate that they appear to understand human-like syntax, at least to some extent. We provide novel evidence that complicates this claim : we find that state-of-the-art Natural Language Inference (NLI) models assign the same labels to permuted examples as they do to the original, i.e. they are invariant to random word-order permutations. This behavior notably differs from that of humans ; we struggle to understand the meaning of ungrammatical sentences. To measure the severity of this issue, we propose a suite of metrics and investigate which properties of particular permutations lead models to be word order invariant. For example, in MNLI dataset we find almost all (98.7 %) examples contain at least one permutation which elicits the gold label. Models are even able to assign gold labels to permutations that they originally failed to predict correctly. We provide a comprehensive empirical evaluation of this phenomenon, and further show that this issue exists in pre-Transformer RNN / ConvNet based encoders, as well as across multiple languages (English and Chinese). Our code and data are available at

pdf bib
Vocabulary Learning via Optimal Transport for Neural Machine Translation
Jingjing Xu | Hao Zhou | Chun Gan | Zaixiang Zheng | Lei Li

The choice of token vocabulary affects the performance of machine translation. This paper aims to figure out what is a good vocabulary and whether we can find the optimal vocabulary without trial training. To answer these questions, we first provide an alternative understanding of vocabulary from the perspective of information theory. It motivates us to formulate the quest of vocabularization finding the best token dictionary with a proper size as an optimal transport (OT) problem. We propose VOLT, a simple and efficient solution without trial training. Empirical results show that VOLT beats widely-used vocabularies in diverse scenarios, including WMT-14 English-German translation, TED bilingual translation, and TED multilingual translation. For example, VOLT achieves 70 % vocabulary size reduction and 0.5 BLEU gain on English-German translation. Also, compared to BPE-search, VOLT reduces the search time from 384 GPU hours to 30 GPU hours on English-German translation. Codes are available at


pdf (full)
bib (full)
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

pdf bib
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)
Chengqing Zong | Fei Xia | Wenjie Li | Roberto Navigli

pdf bib
Catchphrase : Automatic Detection of Cultural References
Nir Sweed | Dafna Shahaf

A snowclone is a customizable phrasal template that can be realized in multiple, instantly recognized variants. For example, * is the new * (Orange is the new black, 40 is the new 30). Snowclones are extensively used in social media. In this paper, we study snowclones originating from pop-culture quotes ; our goal is to automatically detect cultural references in text. We introduce a new, publicly available data set of pop-culture quotes and their corresponding snowclone usages and train models on them. We publish code for Catchphrase, an internet browser plugin to automatically detect and mark references in real-time, and examine its performance via a user study. Aside from assisting people to better comprehend cultural references, we hope that detecting snowclones can complement work on paraphrasing and help tackling long-standing questions in social science about the dynamics of information propagation.

pdf bib
Difficulty-Aware Machine Translation Evaluation
Runzhe Zhan | Xuebo Liu | Derek F. Wong | Lidia S. Chao

The high-quality translation results produced by machine translation (MT) systems still pose a huge challenge for automatic evaluation. Current MT evaluation pays the same attention to each sentence component, while the questions of real-world examinations (e.g., university examinations) have different difficulties and weightings. In this paper, we propose a novel difficulty-aware MT evaluation metric, expanding the evaluation dimension by taking translation difficulty into consideration. A translation that fails to be predicted by most MT systems will be treated as a difficult one and assigned a large weight in the final score function, and conversely. Experimental results on the WMT19 English-German Metrics shared tasks show that our proposed method outperforms commonly used MT metrics in terms of human correlation. In particular, our proposed method performs well even when all the MT systems are very competitive, which is when most existing metrics fail to distinguish between them. The source code is freely available at

pdf bib
Are VQA Systems RAD? Measuring Robustness to Augmented Data with Focused InterventionsVQA Systems RAD? Measuring Robustness to Augmented Data with Focused Interventions
Daniel Rosenberg | Itai Gat | Amir Feder | Roi Reichart

Deep learning algorithms have shown promising results in visual question answering (VQA) tasks, but a more careful look reveals that they often do not understand the rich signal they are being fed with. To understand and better measure the generalization capabilities of VQA systems, we look at their robustness to counterfactually augmented data. Our proposed augmentations are designed to make a focused intervention on a specific property of the question such that the answer changes. Using these augmentations, we propose a new robustness measure, Robustness to Augmented Data (RAD), which measures the consistency of model predictions between original and augmented examples. Through extensive experimentation, we show that RAD, unlike classical accuracy measures, can quantify when state-of-the-art systems are not robust to counterfactuals. We find substantial failure cases which reveal that current VQA systems are still brittle. Finally, we connect between robustness and generalization, demonstrating the predictive power of RAD for performance on unseen augmentations.

pdf bib
N-Best ASR Transformer : Enhancing SLU Performance using Multiple ASR HypothesesASR Transformer: Enhancing SLU Performance using Multiple ASR Hypotheses
Karthik Ganesan | Pakhi Bamdev | Jaivarsan B | Amresh Venugopal | Abhinav Tushar

Spoken Language Understanding (SLU) systems parse speech into semantic structures like dialog acts and slots. This involves the use of an Automatic Speech Recognizer (ASR) to transcribe speech into multiple text alternatives (hypotheses). Transcription errors, ordinary in ASRs, impact downstream SLU performance negatively. Common approaches to mitigate such errors involve using richer information from the ASR, either in form of N-best hypotheses or word-lattices. We hypothesize that transformer models will learn better with a simpler utterance representation using the concatenation of the N-best ASR alternatives, where each alternative is separated by a special delimiter [ SEP ]. In our work, we test our hypothesis by using the concatenated N-best ASR alternatives as the input to the transformer encoder models, namely BERT and XLM-RoBERTa, and achieve equivalent performance to the prior state-of-the-art model on DSTC2 dataset. We also show that our approach significantly outperforms the prior state-of-the-art when subjected to the low data regime. Additionally, this methodology is accessible to users of third-party ASR APIs which do not provide word-lattice information.

pdf bib
AND does not mean OR : Using Formal Languages to Study Language Models’ RepresentationsAND does not mean OR: Using Formal Languages to Study Language Models’ Representations
Aaron Traylor | Roman Feiman | Ellie Pavlick

A current open question in natural language processing is to what extent language models, which are trained with access only to the form of language, are able to capture the meaning of language. This question is challenging to answer in general, as there is no clear line between meaning and form, but rather meaning constrains form in consistent ways. The goal of this study is to offer insights into a narrower but critical subquestion : Under what conditions should we expect that meaning and form covary sufficiently, such that a language model with access only to form might nonetheless succeed in emulating meaning? Focusing on several formal languages (propositional logic and a set of programming languages), we generate training corpora using a variety of motivated constraints, and measure a distributional language model’s ability to differentiate logical symbols (AND, OR, and NOT). Our findings are largely negative : none of our simulated training corpora result in models which definitively differentiate meaningfully different symbols (e.g., AND vs. OR), suggesting a limitation to the types of semantic signals that current models are able to exploit.

pdf bib
Continual Quality Estimation with Online Bayesian Meta-LearningBayesian Meta-Learning
Abiola Obamuyide | Marina Fomicheva | Lucia Specia

Most current quality estimation (QE) models for machine translation are trained and evaluated in a static setting where training and test data are assumed to be from a fixed distribution. However, in real-life settings, the test data that a deployed QE model would be exposed to may differ from its training data. In particular, training samples are often labelled by one or a small set of annotators, whose perceptions of translation quality and needs may differ substantially from those of end-users, who will employ predictions in practice. To address this challenge, we propose an online Bayesian meta-learning framework for the continuous training of QE models that is able to adapt them to the needs of different users, while being robust to distributional shifts in training and test data. Experiments on data with varying number of users and language characteristics validate the effectiveness of the proposed approach.

pdf bib
Multilingual Agreement for Multilingual Neural Machine Translation
Jian Yang | Yuwei Yin | Shuming Ma | Haoyang Huang | Dongdong Zhang | Zhoujun Li | Furu Wei

Although multilingual neural machine translation (MNMT) enables multiple language translations, the training process is based on independent multilingual objectives. Most multilingual models can not explicitly exploit different language pairs to assist each other, ignoring the relationships among them. In this work, we propose a novel agreement-based method to encourage multilingual agreement among different translation directions, which minimizes the differences among them. We combine the multilingual training objectives with the agreement term by randomly substituting some fragments of the source language with their counterpart translations of auxiliary languages. To examine the effectiveness of our method, we conduct experiments on the multilingual translation task of 10 language pairs. Experimental results show that our method achieves significant improvements over the previous multilingual baselines.

pdf bib
Higher-order Derivatives of Weighted Finite-state Machines
Ran Zmigrod | Tim Vieira | Ryan Cotterell

Weighted finite-state machines are a fundamental building block of NLP systems. They have withstood the test of timefrom their early use in noisy channel models in the 1990s up to modern-day neurally parameterized conditional random fields. This work examines the computation of higher-order derivatives with respect to the normalization constant for weighted finite-state machines. We provide a general algorithm for evaluating derivatives of all orders, which has not been previously described in the literature. In the case of second-order derivatives, our scheme runs in the optimal O(A2 N4) time where A is the alphabet size and N is the number of states. Our algorithm is significantly faster than prior algorithms. Additionally, our approach leads to a significantly faster algorithm for computing second-order expectations, such as covariance matrices and gradients of first-order expectations.

pdf bib
Reinforcement Learning for Abstractive Question Summarization with Question-aware Semantic Rewards
Shweta Yadav | Deepak Gupta | Asma Ben Abacha | Dina Demner-Fushman

The growth of online consumer health questions has led to the necessity for reliable and accurate question answering systems. A recent study showed that manual summarization of consumer health questions brings significant improvement in retrieving relevant answers. However, the automatic summarization of long questions is a challenging task due to the lack of training data and the complexity of the related subtasks, such as the question focus and type recognition. In this paper, we introduce a reinforcement learning-based framework for abstractive question summarization. We propose two novel rewards obtained from the downstream tasks of (i) question-type identification and (ii) question-focus recognition to regularize the question generation model. These rewards ensure the generation of semantically valid questions and encourage the inclusion of key medical entities / foci in the question summary. We evaluated our proposed method on two benchmark datasets and achieved higher performance over state-of-the-art models. The manual evaluation of the summaries reveals that the generated questions are more diverse and have fewer factual inconsistencies than the baseline summaries. The source code is available here :

pdf bib
A Semantics-aware Transformer Model of Relation Linking for Knowledge Base Question Answering
Tahira Naseem | Srinivas Ravishankar | Nandana Mihindukulasooriya | Ibrahim Abdelaziz | Young-Suk Lee | Pavan Kapanipathi | Salim Roukos | Alfio Gliozzo | Alexander Gray

Relation linking is a crucial component of Knowledge Base Question Answering systems. Existing systems use a wide variety of heuristics, or ensembles of multiple systems, heavily relying on the surface question text. However, the explicit semantic parse of the question is a rich source of relation information that is not taken advantage of. We propose a simple transformer-based neural model for relation linking that leverages the AMR semantic parse of a sentence. Our system significantly outperforms the state-of-the-art on 4 popular benchmark datasets. These are based on either DBpedia or Wikidata, demonstrating that our approach is effective across KGs.

pdf bib
Neural Retrieval for Question Answering with Cross-Attention Supervised Data Augmentation
Yinfei Yang | Ning Jin | Kuo Lin | Mandy Guo | Daniel Cer

Early fusion models with cross-attention have shown better-than-human performance on some question answer benchmarks, while it is a poor fit for retrieval since it prevents pre-computation of the answer representations. We present a supervised data mining method using an accurate early fusion model to improve the training of an efficient late fusion retrieval model. We first train an accurate classification model with cross-attention between questions and answers. The cross-attention model is then used to annotate additional passages in order to generate weighted training examples for a neural retrieval model. The resulting retrieval model with additional data significantly outperforms retrieval models directly trained with gold annotations on Precision at N (P@N) and Mean Reciprocal Rank (MRR).

pdf bib
Enhancing Descriptive Image Captioning with Natural Language Inference
Zhan Shi | Hui Liu | Xiaodan Zhu

Generating descriptive sentences that convey non-trivial, detailed, and salient information about images is an important goal of image captioning. In this paper we propose a novel approach to encourage captioning models to produce more detailed captions using natural language inference, based on the motivation that, among different captions of an image, descriptive captions are more likely to entail less descriptive captions. Specifically, we construct directed inference graphs for reference captions based on natural language inference. A PageRank algorithm is then employed to estimate the descriptiveness score of each node. Built on that, we use reference sampling and weighted designated rewards to guide captioning to generate descriptive captions. The results on MSCOCO show that the proposed method outperforms the baselines significantly on a wide range of conventional and descriptiveness-related evaluation metrics.descriptive sentences that convey non-trivial, detailed, and salient information about images is an important goal of image captioning. In this paper we propose a novel approach to encourage captioning models to produce more detailed captions using natural language inference, based on the motivation that, among different captions of an image, descriptive captions are more likely to entail less descriptive captions. Specifically, we construct directed inference graphs for reference captions based on natural language inference. A PageRank algorithm is then employed to estimate the descriptiveness score of each node. Built on that, we use reference sampling and weighted designated rewards to guide captioning to generate descriptive captions. The results on MSCOCO show that the proposed method outperforms the baselines significantly on a wide range of conventional and descriptiveness-related evaluation metrics.

pdf bib
On Positivity Bias in Negative Reviews
Madhusudhan Aithal | Chenhao Tan

Prior work has revealed that positive words occur more frequently than negative words in human expressions, which is typically attributed to positivity bias, a tendency for people to report positive views of reality. But what about the language used in negative reviews? Consistent with prior work, we show that English negative reviews tend to contain more positive words than negative words, using a variety of datasets. We reconcile this observation with prior findings on the pragmatics of negation, and show that negations are commonly associated with positive words in negative reviews. Furthermore, in negative reviews, the majority of sentences with positive words express negative opinions based on sentiment classifiers, indicating some form of negation.

pdf bib
PRAL : A Tailored Pre-Training Model for Task-Oriented Dialog GenerationPRAL: A Tailored Pre-Training Model for Task-Oriented Dialog Generation
Jing Gu | Qingyang Wu | Chongruo Wu | Weiyan Shi | Zhou Yu

Large pre-trained language generation models such as GPT-2 have demonstrated their effectiveness as language priors by reaching state-of-the-art results in various language generation tasks. However, the performance of pre-trained models on task-oriented dialog tasks is still under-explored. We propose a Pre-trainedRole Alternating Language model (PRAL), explicitly designed for task-oriented conversational systems. We design several techniques : start position randomization, knowledge distillation, and history discount to improve pre-training performance. In addition, we introduce a high-quality large-scale task-oriented dialog pre-training dataset by post-prossessing13 dialog datasets. We effectively adapt PRALon three downstream tasks. The results show that PRAL outperforms or is on par with state-of-the-art models.

pdf bib
ROPE : Reading Order Equivariant Positional Encoding for Graph-based Document Information ExtractionROPE: Reading Order Equivariant Positional Encoding for Graph-based Document Information Extraction
Chen-Yu Lee | Chun-Liang Li | Chu Wang | Renshen Wang | Yasuhisa Fujii | Siyang Qin | Ashok Popat | Tomas Pfister

Natural reading orders of words are crucial for information extraction from form-like documents. Despite recent advances in Graph Convolutional Networks (GCNs) on modeling spatial layout patterns of documents, they have limited ability to capture reading orders of given word-level node representations in a graph. We propose Reading Order Equivariant Positional Encoding (ROPE), a new positional encoding technique designed to apprehend the sequential presentation of words in documents. ROPE generates unique reading order codes for neighboring words relative to the target word given a word-level graph connectivity. We study two fundamental document entity extraction tasks including word labeling and word grouping on the public FUNSD dataset and a large-scale payment dataset. We show that ROPE consistently improves existing GCNs with a margin up to 8.4 % F1-score.

pdf bib
Quantifying and Avoiding Unfair Qualification Labour in Crowdsourcing
Jonathan K. Kummerfeld

Extensive work has argued in favour of paying crowd workers a wage that is at least equivalent to the U.S. federal minimum wage. Meanwhile, research on collecting high quality annotations suggests using a qualification that requires workers to have previously completed a certain number of tasks. If most requesters who pay fairly require workers to have completed a large number of tasks already then workers need to complete a substantial amount of poorly paid work before they can earn a fair wage. Through analysis of worker discussions and guidance for researchers, we estimate that workers spend approximately 2.25 months of full time effort on poorly paid tasks in order to get the qualifications needed for better paid tasks. We discuss alternatives to this qualification and conduct a study of the correlation between qualifications and work quality on two NLP tasks. We find that it is possible to reduce the burden on workers while still collecting high quality data.

pdf bib
Men Are Elected, Women Are Married : Events Gender Bias on WikipediaWikipedia
Jiao Sun | Nanyun Peng

Human activities can be seen as sequences of events, which are crucial to understanding societies. Disproportional event distribution for different demographic groups can manifest and amplify social stereotypes, and potentially jeopardize the ability of members in some groups to pursue certain goals. In this paper, we present the first event-centric study of gender biases in a Wikipedia corpus. To facilitate the study, we curate a corpus of career and personal life descriptions with demographic information consisting of 7,854 fragments from 10,412 celebrities. Then we detect events with a state-of-the-art event detection model, calibrate the results using strategically generated templates, and extract events that have asymmetric associations with genders. Our study discovers that the Wikipedia pages tend to intermingle personal life events with professional events for females but not for males, which calls for the awareness of the Wikipedia community to formalize guidelines and train the editors to mind the implicit biases that contributors carry. Our work also lays the foundation for future works on quantifying and discovering event biases at the corpus level.

pdf bib
Modeling Task-Aware MIMO Cardinality for Efficient Multilingual Neural Machine TranslationMIMO Cardinality for Efficient Multilingual Neural Machine Translation
Hongfei Xu | Qiuhui Liu | Josef van Genabith | Deyi Xiong

Neural machine translation has achieved great success in bilingual settings, as well as in multilingual settings. With the increase of the number of languages, multilingual systems tend to underperform their bilingual counterparts. Model capacity has been found crucial for massively multilingual NMT to support language pairs with varying typological characteristics. Previous work increases the modeling capacity by deepening or widening the Transformer. However, modeling cardinality based on aggregating a set of transformations with the same topology has been proven more effective than going deeper or wider when increasing capacity. In this paper, we propose to efficiently increase the capacity for multilingual NMT by increasing the cardinality. Unlike previous work which feeds the same input to several transformations and merges their outputs into one, we present a Multi-Input-Multi-Output (MIMO) architecture that allows each transformation of the block to have its own input. We also present a task-aware attention mechanism to learn to selectively utilize individual transformations from a set of transformations for different translation directions. Our model surpasses previous work and establishes a new state-of-the-art on the large scale OPUS-100 corpus while being 1.31 times as fast.

pdf bib
Exploration and Exploitation : Two Ways to Improve Chinese Spelling Correction ModelsChinese Spelling Correction Models
Chong Li | Cenyuan Zhang | Xiaoqing Zheng | Xuanjing Huang

A sequence-to-sequence learning with neural networks has empirically proven to be an effective framework for Chinese Spelling Correction (CSC), which takes a sentence with some spelling errors as input and outputs the corrected one. However, CSC models may fail to correct spelling errors covered by the confusion sets, and also will encounter unseen ones. We propose a method, which continually identifies the weak spots of a model to generate more valuable training instances, and apply a task-specific pre-training strategy to enhance the model. The generated adversarial examples are gradually added to the training set. Experimental results show that such an adversarial training method combined with the pre-training strategy can improve both the generalization and robustness of multiple CSC models across three different datasets, achieving state-of-the-art performance for CSC task.

pdf bib
An Empirical Study on Adversarial Attack on NMT : Languages and Positions MatterNMT: Languages and Positions Matter
Zhiyuan Zeng | Deyi Xiong

In this paper, we empirically investigate adversarial attack on NMT from two aspects : languages (the source vs. the target language) and positions (front vs. rear). For autoregressive NMT models that generate target words from left to right, we observe that adversarial attack on the source language is more effective than on the target language, and that attacking front positions of target sentences or positions of source sentences aligned to the front positions of corresponding target sentences is more effective than attacking other positions. We further exploit the attention distribution of the victim model to attack source sentences at positions that have a strong association with front target words. Experiment results demonstrate that our attention-based adversarial attack is more effective than adversarial attacks by sampling positions randomly or according to gradients.

pdf bib
OntoGUM : Evaluating Contextualized SOTA Coreference Resolution on 12 More GenresOntoGUM: Evaluating Contextualized SOTA Coreference Resolution on 12 More Genres
Yilun Zhu | Sameer Pradhan | Amir Zeldes

SOTA coreference resolution produces increasingly impressive scores on the OntoNotes benchmark. However lack of comparable data following the same scheme for more genres makes it difficult to evaluate generalizability to open domain data. This paper provides a dataset and comprehensive evaluation showing that the latest neural LM based end-to-end systems degrade very substantially out of domain. We make an OntoNotes-like coreference dataset called OntoGUM publicly available, converted from GUM, an English corpus covering 12 genres, using deterministic rules, which we evaluate. Thanks to the rich syntactic and discourse annotations in GUM, we are able to create the largest human-annotated coreference corpus following the OntoNotes guidelines, and the first to be evaluated for consistency with the OntoNotes scheme. Out-of-domain evaluation across 12 genres shows nearly 15-20 % degradation for both deterministic and deep learning systems, indicating a lack of generalizability or covert overfitting in existing coreference resolution models.

pdf bib
Zero-shot Fact Verification by Claim Generation
Liangming Pan | Wenhu Chen | Wenhan Xiong | Min-Yen Kan | William Yang Wang

Neural models for automated fact verification have achieved promising results thanks to the availability of large, human-annotated datasets. However, for each new domain that requires fact verification, creating a dataset by manually writing claims and linking them to their supporting evidence is expensive. We develop QACG, a framework for training a robust fact verification model by using automatically generated claims that can be supported, refuted, or unverifiable from evidence from Wikipedia. QACG generates question-answer pairs from the evidence and then converts them into different types of claims. Experiments on the FEVER dataset show that our QACG framework significantly reduces the demand for human-annotated training data. In a zero-shot scenario, QACG improves a RoBERTa model’s F1 from 50 % to 77 %, equivalent in performance to 2K+ manually-curated examples. Our QACG code is publicly available.

pdf bib
Thank you BART ! Rewarding Pre-Trained Models Improves Formality Style TransferBART! Rewarding Pre-Trained Models Improves Formality Style Transfer
Huiyuan Lai | Antonio Toral | Malvina Nissim

Scarcity of parallel data causes formality style transfer models to have scarce success in preserving content. We show that fine-tuning pre-trained language (GPT-2) and sequence-to-sequence (BART) models boosts content preservation, and that this is possible even with limited amounts of parallel data. Augmenting these models with rewards that target style and content the two core aspects of the task we achieve a new state-of-the-art.

pdf bib
Deep Context- and Relation-Aware Learning for Aspect-based Sentiment Analysis
Shinhyeok Oh | Dongyub Lee | Taesun Whang | IlNam Park | Seo Gaeun | EungGyun Kim | Harksoo Kim

Existing works for aspect-based sentiment analysis (ABSA) have adopted a unified approach, which allows the interactive relations among subtasks. However, we observe that these methods tend to predict polarities based on the literal meaning of aspect and opinion terms and mainly consider relations implicitly among subtasks at the word level. In addition, identifying multiple aspectopinion pairs with their polarities is much more challenging. Therefore, a comprehensive understanding of contextual information w.r.t. the aspect and opinion are further required in ABSA. In this paper, we propose Deep Contextualized Relation-Aware Network (DCRAN), which allows interactive relations among subtasks with deep contextual information based on two modules (i.e., Aspect and Opinion Propagation and Explicit Self-Supervised Strategies). Especially, we design novel self-supervised strategies for ABSA, which have strengths in dealing with multiple aspects. Experimental results show that DCRAN significantly outperforms previous state-of-the-art methods by large margins on three widely used benchmarks.

pdf bib
TIMERS : Document-level Temporal Relation ExtractionTIMERS: Document-level Temporal Relation Extraction
Puneet Mathur | Rajiv Jain | Franck Dernoncourt | Vlad Morariu | Quan Hung Tran | Dinesh Manocha

We present TIMERS-a TIME, Rhetorical and Syntactic-aware model for document-level temporal relation classification in the English language. Our proposed method leverages rhetorical discourse features and temporal arguments from semantic role labels, in addition to traditional local syntactic features, trained through a Gated Relational-GCN. Extensive experiments show that the proposed model outperforms previous methods by 5-18 % on the TDDiscourse, TimeBank-Dense, and MATRES datasets due to our discourse-level modeling.

pdf bib
More than Text : Multi-modal Chinese Word SegmentationChinese Word Segmentation
Dong Zhang | Zheng Hu | Shoushan Li | Hanqian Wu | Qiaoming Zhu | Guodong Zhou

Chinese word segmentation (CWS) is undoubtedly an important basic task in natural language processing. Previous works only focus on the textual modality, but there are often audio and video utterances (such as news broadcast and face-to-face dialogues), where textual, acoustic and visual modalities normally exist. To this end, we attempt to combine the multi-modality (mainly the converted text and actual voice information) to perform CWS. In this paper, we annotate a new dataset for CWS containing text and audio. Moreover, we propose a time-dependent multi-modal interactive model based on Transformer framework to integrate multi-modal information for word sequence labeling. The experimental results on three different training sets show the effectiveness of our approach with fusing text and audio.

pdf bib
A Mixture-of-Experts Model for Antonym-Synonym Discrimination
Zhipeng Xie | Nan Zeng

Discrimination between antonyms and synonyms is an important and challenging NLP task. Antonyms and synonyms often share the same or similar contexts and thus are hard to make a distinction. This paper proposes two underlying hypotheses and employs the mixture-of-experts framework as a solution. It works on the basis of a divide-and-conquer strategy, where a number of localized experts focus on their own domains (or subspaces) to learn their specialties, and a gating mechanism determines the space partitioning and the expert mixture. Experimental results have shown that our method achieves the state-of-the-art performance on the task.

pdf bib
Learning Domain-Specialised Representations for Cross-Lingual Biomedical Entity Linking
Fangyu Liu | Ivan Vulić | Anna Korhonen | Nigel Collier

Injecting external domain-specific knowledge (e.g., UMLS) into pretrained language models (LMs) advances their capability to handle specialised in-domain tasks such as biomedical entity linking (BEL). However, such abundant expert knowledge is available only for a handful of languages (e.g., English). In this work, by proposing a novel cross-lingual biomedical entity linking task (XL-BEL) and establishing a new XL-BEL benchmark spanning 10 typologically diverse languages, we first investigate the ability of standard knowledge-agnostic as well as knowledge-enhanced monolingual and multilingual LMs beyond the standard monolingual English BEL task. The scores indicate large gaps to English performance. We then address the challenge of transferring domain-specific knowledge in resource-rich languages to resource-poor ones. To this end, we propose and evaluate a series of cross-lingual transfer methods for the XL-BEL task, and demonstrate that general-domain bitext helps propagate the available English knowledge to languages with little to no in-domain data. Remarkably, we show that our proposed domain-specific transfer methods yield consistent gains across all target languages, sometimes up to 20 Precision@1 points, without any in-domain knowledge in the target language, and without any in-domain parallel data.

pdf bib
A Cluster-based Approach for Improving Isotropy in Contextual Embedding Space
Sara Rajaee | Mohammad Taher Pilehvar

The representation degeneration problem in Contextual Word Representations (CWRs) hurts the expressiveness of the embedding space by forming an anisotropic cone where even unrelated words have excessively positive correlations. Existing techniques for tackling this issue require a learning process to re-train models with additional objectives and mostly employ a global assessment to study isotropy. Our quantitative analysis over isotropy shows that a local assessment could be more accurate due to the clustered structure of CWRs. Based on this observation, we propose a local cluster-based method to address the degeneration issue in contextual embedding spaces. We show that in clusters including punctuations and stop words, local dominant directions encode structural information, removing which can improve CWRs performance on semantic tasks. Moreover, we find that tense information in verb representations dominates sense semantics. We show that removing dominant directions of verb representations can transform the space to better suit semantic applications. Our experiments demonstrate that the proposed cluster-based method can mitigate the degeneration problem on multiple tasks.

pdf bib
Do n’t Let Discourse Confine Your Model : Sequence Perturbations for Improved Event Language Models
Mahnaz Koupaee | Greg Durrett | Nathanael Chambers | Niranjan Balasubramanian

Event language models represent plausible sequences of events. Most existing approaches train autoregressive models on text, which successfully capture event co-occurrence but unfortunately constrain the model to follow the discourse order in which events are presented. Other domains may employ different discourse orders, and for many applications, we may care about different notions of ordering (e.g., temporal) or not care about ordering at all (e.g., when predicting related events in a schema). We propose a simple yet surprisingly effective strategy for improving event language models by perturbing event sequences so we can relax model dependence on text order. Despite generating completely synthetic event orderings, we show that this technique improves the performance of the event language models on both applications and out-of-domain events data.

pdf bib
The Curse of Dense Low-Dimensional Information Retrieval for Large Index Sizes
Nils Reimers | Iryna Gurevych

Information Retrieval using dense low-dimensional representations recently became popular and showed out-performance to traditional sparse-representations like BM25. However, no previous work investigated how dense representations perform with large index sizes. We show theoretically and empirically that the performance for dense representations decreases quicker than sparse representations for increasing index sizes. In extreme cases, this can even lead to a tipping point where at a certain index size sparse representations outperform dense representations. We show that this behavior is tightly connected to the number of dimensions of the representations : The lower the dimension, the higher the chance for false positives, i.e. returning irrelevant documents

pdf bib
Cross-lingual Text Classification with Heterogeneous Graph Neural Network
Ziyun Wang | Xuan Liu | Peiji Yang | Shixing Liu | Zhisheng Wang

Cross-lingual text classification aims at training a classifier on the source language and transferring the knowledge to target languages, which is very useful for low-resource languages. Recent multilingual pretrained language models (mPLM) achieve impressive results in cross-lingual classification tasks, but rarely consider factors beyond semantic similarity, causing performance degradation between some language pairs. In this paper we propose a simple yet effective method to incorporate heterogeneous information within and across languages for cross-lingual text classification using graph convolutional networks (GCN). In particular, we construct a heterogeneous graph by treating documents and words as nodes, and linking nodes with different relations, which include part-of-speech roles, semantic similarity, and document translations. Extensive experiments show that our graph-based method significantly outperforms state-of-the-art models on all tasks, and also achieves consistent performance gain over baselines in low-resource settings where external tools like translators are unavailable.

pdf bib
Embedding Time Differences in Context-sensitive Neural Networks for Learning Time to Event
Nazanin Dehghani | Hassan Hajipoor | Hadi Amiri

We propose an effective context-sensitive neural model for time to event (TTE) prediction task, which aims to predict the amount of time to / from the occurrence of given events in streaming content. We investigate this problem in the context of a multi-task learning framework, which we enrich with time difference embeddings. In addition, we develop a multi-genre dataset of English events about soccer competitions and academy awards ceremonies, and their relevant tweets obtained from Twitter. Our model is 1.4 and 3.3 hours more accurate than the current state-of-the-art model in estimating TTE on English and Dutch tweets respectively. We examine different aspects of our model to illustrate its source of improvement.

pdf bib
Domain-Adaptive Pretraining Methods for Dialogue Understanding
Han Wu | Kun Xu | Linfeng Song | Lifeng Jin | Haisong Zhang | Linqi Song

Language models like BERT and SpanBERT pretrained on open-domain data have obtained impressive gains on various NLP tasks. In this paper, we probe the effectiveness of domain-adaptive pretraining objectives on downstream tasks. In particular, three objectives, including a novel objective focusing on modeling predicate-argument relations, are evaluated on two challenging dialogue understanding tasks. Experimental results demonstrate that domain-adaptive pretraining with proper objectives can significantly improve the performance of a strong baseline on these tasks, achieving the new state-of-the-art performances.

pdf bib
Targeting the Benchmark : On Methodology in Current Natural Language Processing Research
David Schlangen

It has become a common pattern in our field : One group introduces a language task, exemplified by a dataset, which they argue is challenging enough to serve as a benchmark. They also provide a baseline model for it, which then soon is improved upon by other groups. Often, research efforts then move on, and the pattern repeats itself. What is typically left implicit is the argumentation for why this constitutes progress, and progress towards what. In this paper, we try to step back for a moment from this pattern and work out possible argumentations and their parts.

pdf bib
nmT5-Is parallel data still relevant for pre-training massively multilingual language models?T5 - Is parallel data still relevant for pre-training massively multilingual language models?
Mihir Kale | Aditya Siddhant | Rami Al-Rfou | Linting Xue | Noah Constant | Melvin Johnson

Recently, mT5-a massively multilingual version of T5-leveraged a unified text-to-text format to attain state-of-the-art results on a wide variety of multilingual NLP tasks. In this paper, we investigate the impact of incorporating parallel data into mT5 pre-training. We find that multi-tasking language modeling with objectives such as machine translation during pre-training is a straightforward way to improve performance on downstream multilingual and cross-lingual tasks. However, the gains start to diminish as the model capacity increases, suggesting that parallel data might not be as essential for larger models. At the same time, even at larger model sizes, we find that pre-training with parallel data still provides benefits in the limited labelled data regime

pdf bib
Explicitly Capturing Relations between Entity Mentions via Graph Neural Networks for Domain-specific Named Entity Recognition
Pei Chen | Haibo Ding | Jun Araki | Ruihong Huang

Named entity recognition (NER) is well studied for the general domain, and recent systems have achieved human-level performance for identifying common entity types. However, the NER performance is still moderate for specialized domains that tend to feature complicated contexts and jargonistic entity types. To address these challenges, we propose explicitly connecting entity mentions based on both global coreference relations and local dependency relations for building better entity mention representations. In our experiments, we incorporate entity mention relations by Graph Neural Networks and show that our system noticeably improves the NER performance on two datasets from different domains. We further show that the proposed lightweight system can effectively elevate the NER performance to a higher level even when only a tiny amount of labeled data is available, which is desirable for domain-specific NER.

pdf bib
Improving Lexically Constrained Neural Machine Translation with Source-Conditioned Masked Span Prediction
Gyubok Lee | Seongjun Yang | Edward Choi

Accurate terminology translation is crucial for ensuring the practicality and reliability of neural machine translation (NMT) systems. To address this, lexically constrained NMT explores various methods to ensure pre-specified words and phrases appear in the translation output. However, in many cases, those methods are studied on general domain corpora, where the terms are mostly uni- and bi-grams (98 %). In this paper, we instead tackle a more challenging setup consisting of domain-specific corpora with much longer n-gram and highly specialized terms. Inspired by the recent success of masked span prediction models, we propose a simple and effective training strategy that achieves consistent improvements on both terminology and sentence-level translation for three domain-specific corpora in two language pairs.

pdf bib
Quotation Recommendation and Interpretation Based on Transformation from Queries to Quotations
Lingzhi Wang | Xingshan Zeng | Kam-Fai Wong

To help individuals express themselves better, quotation recommendation is receiving growing attention. Nevertheless, most prior efforts focus on modeling quotations and queries separately and ignore the relationship between the quotations and the queries. In this work, we introduce a transformation matrix that directly maps the query representations to quotation representations. To better learn the mapping relationship, we employ a mapping loss that minimizes the distance of two semantic spaces (one for quotation and another for mapped-query). Furthermore, we explore using the words in history queries to interpret the figurative language of quotations, where quotation-aware attention is applied on top of history queries to highlight the indicator words. Experiments on two datasets in English and Chinese show that our model outperforms previous state-of-the-art models.

pdf bib
Neural-Symbolic Commonsense Reasoner with Relation Predictors
Farhad Moghimifar | Lizhen Qu | Terry Yue Zhuo | Gholamreza Haffari | Mahsa Baktashmotlagh

Commonsense reasoning aims to incorporate sets of commonsense facts, retrieved from Commonsense Knowledge Graphs (CKG), to draw conclusion about ordinary situations. The dynamic nature of commonsense knowledge postulates models capable of performing multi-hop reasoning over new situations. This feature also results in having large-scale sparse Knowledge Graphs, where such reasoning process is needed to predict relations between new events. However, existing approaches in this area are limited by considering CKGs as a limited set of facts, thus rendering them unfit for reasoning over new unseen situations and events. In this paper, we present a neural-symbolic reasoner, which is capable of reasoning over large-scale dynamic CKGs. The logic rules for reasoning over CKGs are learned during training by our model. In addition to providing interpretable explanation, the learned logic rules help to generalise prediction to newly introduced events. Experimental results on the task of link prediction on CKGs prove the effectiveness of our model by outperforming the state-of-the-art models.

pdf bib
Parameter Selection : Why We Should Pay More Attention to It
Jie-Jyun Liu | Tsung-Han Yang | Si-An Chen | Chih-Jen Lin

The importance of parameter selection in supervised learning is well known. However, due to the many parameter combinations, an incomplete or an insufficient procedure is often applied. This situation may cause misleading or confusing conclusions. In this opinion paper, through an intriguing example we point out that the seriousness goes beyond what is generally recognized. In the topic of multilabel classification for medical code prediction, one influential paper conducted a proper parameter selection on a set, but when moving to a subset of frequently occurring labels, the authors used the same parameters without a separate tuning. The set of frequent labels became a popular benchmark in subsequent studies, which kept pushing the state of the art. However, we discovered that most of the results in these studies can not surpass the approach in the original paper if a parameter tuning had been conducted at the time. Thus it is unclear how much progress the subsequent developments have actually brought. The lesson clearly indicates that without enough attention on parameter selection, the research progress in our field can be uncertain or even illusive.

pdf bib
Modeling Discriminative Representations for Out-of-Domain Detection with Supervised Contrastive Learning
Zhiyuan Zeng | Keqing He | Yuanmeng Yan | Zijun Liu | Yanan Wu | Hong Xu | Huixing Jiang | Weiran Xu

Detecting Out-of-Domain (OOD) or unknown intents from user queries is essential in a task-oriented dialog system. A key challenge of OOD detection is to learn discriminative semantic features. Traditional cross-entropy loss only focuses on whether a sample is correctly classified, and does not explicitly distinguish the margins between categories. In this paper, we propose a supervised contrastive learning objective to minimize intra-class variance by pulling together in-domain intents belonging to the same class and maximize inter-class variance by pushing apart samples from different classes. Besides, we employ an adversarial augmentation mechanism to obtain pseudo diverse views of a sample in the latent space. Experiments on two public datasets prove the effectiveness of our method capturing discriminative representations for OOD detection.

pdf bib
Exposing the limits of Zero-shot Cross-lingual Hate Speech Detection
Debora Nozza

Reducing and counter-acting hate speech on Social Media is a significant concern. Most of the proposed automatic methods are conducted exclusively on English and very few consistently labeled, non-English resources have been proposed. Learning to detect hate speech on English and transferring to unseen languages seems an immediate solution. This work is the first to shed light on the limits of this zero-shot, cross-lingual transfer learning framework for hate speech detection. We use benchmark data sets in English, Italian, and Spanish to detect hate speech towards immigrants and women. Investigating post-hoc explanations of the model, we discover that non-hateful, language-specific taboo interjections are misinterpreted as signals of hate speech. Our findings demonstrate that zero-shot, cross-lingual models can not be used as they are, but need to be carefully designed.

pdf bib
Addressing Semantic Drift in Generative Question Answering with Auxiliary Extraction
Chenliang Li | Bin Bi | Ming Yan | Wei Wang | Songfang Huang

Recently, question answering (QA) based on machine reading comprehension has become popular. This work focuses on generative QA which aims to generate an abstractive answer to a given question instead of extracting an answer span from a provided passage. Generative QA often suffers from two critical problems : (1) summarizing content irrelevant to a given question, (2) drifting away from a correct answer during generation. In this paper, we address these problems by a novel Rationale-Enriched Answer Generator (REAG), which incorporates an extractive mechanism into a generative model. Specifically, we add an extraction task on the encoder to obtain the rationale for an answer, which is the most relevant piece of text in an input document to a given question. Based on the extracted rationale and original input, the decoder is expected to generate an answer with high confidence. We jointly train REAG on the MS MARCO QA+NLG task and the experimental results show that REAG improves the quality and semantic accuracy of answers over baseline models.

pdf bib
Improving Model Generalization : A Chinese Named Entity Recognition Case StudyChinese Named Entity Recognition Case Study
Guanqing Liang | Cane Wing-Ki Leung

Generalization is an important ability that helps to ensure that a machine learning model can perform well on unseen data. In this paper, we study the effect of data bias on model generalization, using Chinese Named Entity Recognition (NER) as a case study. Specifically, we analyzed five benchmarking datasets for Chinese NER, and observed the following two types of data bias that can compromise model generalization ability. Firstly, the test sets of all the five datasets contain a significant proportion of entities that have been seen in the training sets. Such test data would therefore not be able to reflect the true generalization ability of a model. Secondly, all datasets are dominated by a few fat-head entities, i.e., entities appearing with particularly high frequency. As a result, a model might be able to produce high prediction accuracy simply by keyword memorization without leveraging context knowledge. To address these data biases, we first refine each test set by excluding seen entities from it, so as to better evaluate a model’s generalization ability. Then, we propose a simple yet effective entity resampling method to make entities within the same category distributed equally, encouraging a model to leverage both name and context knowledge in the training process. Experimental results demonstrate that the proposed entity resampling method significantly improves a model’s ability in detecting unseen entities, especially for company, organization and position categories.

pdf bib
Attentive Multiview Text Representation for Differential Diagnosis
Hadi Amiri | Mitra Mohtarami | Isaac Kohane

We present a text representation approach that can combine different views (representations) of the same input through effective data fusion and attention strategies for ranking purposes. We apply our model to the problem of differential diagnosis, which aims to find the most probable diseases that match with clinical descriptions of patients, using data from the Undiagnosed Diseases Network. Our model outperforms several ranking approaches (including a commercially-supported system) by effectively prioritizing and combining representations obtained from traditional and recent text representation techniques. We elaborate on several aspects of our model and shed light on its improved performance.

pdf bib
MedNLI Is Not Immune : Natural Language Inference Artifacts in the Clinical DomainMedNLI Is Not Immune: Natural Language Inference Artifacts in the Clinical Domain
Christine Herlihy | Rachel Rudinger

Crowdworker-constructed natural language inference (NLI) datasets have been found to contain statistical artifacts associated with the annotation process that allow hypothesis-only classifiers to achieve better-than-random performance (CITATION). We investigate whether MedNLI, a physician-annotated dataset with premises extracted from clinical notes, contains such artifacts (CITATION). We find that entailed hypotheses contain generic versions of specific concepts in the premise, as well as modifiers related to responsiveness, duration, and probability. Neutral hypotheses feature conditions and behaviors that co-occur with, or cause, the condition(s) in the premise. Contradiction hypotheses feature explicit negation of the premise and implicit negation via assertion of good health. Adversarial filtering demonstrates that performance degrades when evaluated on the difficult subset. We provide partition information and recommendations for alternative dataset construction strategies for knowledge-intensive domains.difficult subset. We provide partition information and recommendations for alternative dataset construction strategies for knowledge-intensive domains.

pdf bib
Towards a more Robust Evaluation for Conversational Question Answering
Wissam Siblini | Baris Sayil | Yacine Kessaci

With the explosion of chatbot applications, Conversational Question Answering (CQA) has generated a lot of interest in recent years. Among proposals, reading comprehension models which take advantage of the conversation history (previous QA) seem to answer better than those which only consider the current question. Nevertheless, we note that the CQA evaluation protocol has a major limitation. In particular, models are allowed, at each turn of the conversation, to access the ground truth answers of the previous turns. Not only does this severely prevent their applications in fully autonomous chatbots, it also leads to unsuspected biases in their behavior. In this paper, we highlight this effect and propose new tools for evaluation and training in order to guard against the noted issues. The new results that we bring come to reinforce methods of the current state of the art.

pdf bib
Avoiding Overlap in Data Augmentation for AMR-to-Text GenerationAMR-to-Text Generation
Wenchao Du | Jeffrey Flanigan

Leveraging additional unlabeled data to boost model performance is common practice in machine learning and natural language processing. For generation tasks, if there is overlap between the additional data and the target text evaluation data, then training on the additional data is training on answers of the test set. This leads to overly-inflated scores with the additional data compared to real-world testing scenarios and problems when comparing models. We study the AMR dataset and Gigaword, which is popularly used for improving AMR-to-text generators, and find significant overlap between Gigaword and a subset of the AMR dataset. We propose methods for excluding parts of Gigaword to remove this overlap, and show that our approach leads to a more realistic evaluation of the task of AMR-to-text generation. Going forward, we give simple best-practice recommendations for leveraging additional data in AMR-to-text generation.

pdf bib
Weakly-Supervised Methods for Suicide Risk Assessment : Role of Related Domains
Chenghao Yang | Yudong Zhang | Smaranda Muresan

Social media has become a valuable resource for the study of suicidal ideation and the assessment of suicide risk. Among social media platforms, Reddit has emerged as the most promising one due to its anonymity and its focus on topic-based communities (subreddits) that can be indicative of someone’s state of mind or interest regarding mental health disorders such as r / SuicideWatch, r / Anxiety, r / depression. A challenge for previous work on suicide risk assessment has been the small amount of labeled data. We propose an empirical investigation into several classes of weakly-supervised approaches, and show that using pseudo-labeling based on related issues around mental health (e.g., anxiety, depression) helps improve model performance for suicide risk assessment.

pdf bib
Can Transformer Models Measure Coherence In Text : Re-Thinking the Shuffle Test
Philippe Laban | Luke Dai | Lucas Bandarkar | Marti A. Hearst

The Shuffle Test is the most common task to evaluate whether NLP models can measure coherence in text. Most recent work uses direct supervision on the task ; we show that by simply finetuning a RoBERTa model, we can achieve a near perfect accuracy of 97.8 %, a state-of-the-art. We argue that this outstanding performance is unlikely to lead to a good model of text coherence, and suggest that the Shuffle Test should be approached in a Zero-Shot setting : models should be evaluated without being trained on the task itself. We evaluate common models in this setting, such as Generative and Bi-directional Transformers, and find that larger architectures achieve high-performance out-of-the-box. Finally, we suggest the k-Block Shuffle Test, a modification of the original by increasing the size of blocks shuffled. Even though human reader performance remains high (around 95 % accuracy), model performance drops from 94 % to 78 % as block size increases, creating a conceptually simple challenge to benchmark NLP models.

pdf bib
SaRoCo : Detecting Satire in a Novel Romanian Corpus of News ArticlesSaRoCo: Detecting Satire in a Novel Romanian Corpus of News Articles
Ana-Cristina Rogoz | Gaman Mihaela | Radu Tudor Ionescu

In this work, we introduce a corpus for satire detection in Romanian news. We gathered 55,608 public news articles from multiple real and satirical news sources, composing one of the largest corpora for satire detection regardless of language and the only one for the Romanian language. We provide an official split of the text samples, such that training news articles belong to different sources than test news articles, thus ensuring that models do not achieve high performance simply due to overfitting. We conduct experiments with two state-of-the-art deep neural models, resulting in a set of strong baselines for our novel corpus. Our results show that the machine-level accuracy for satire detection in Romanian is quite low (under 73 % on the test set) compared to the human-level accuracy (87 %), leaving enough room for improvement in future research.


pdf (full)
bib (full)
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop

pdf bib
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop
Jad Kabbara | Haitao Lin | Amandalynne Paullada | Jannis Vamvas

pdf bib
Transformer-Based Direct Hidden Markov Model for Machine TranslationMarkov Model for Machine Translation
Weiyue Wang | Zijian Yang | Yingbo Gao | Hermann Ney

The neural hidden Markov model has been proposed as an alternative to attention mechanism in machine translation with recurrent neural networks. However, since the introduction of the transformer models, its performance has been surpassed. This work proposes to introduce the concept of the hidden Markov model to the transformer architecture, which outperforms the transformer baseline. Interestingly, we find that the zero-order model already provides promising performance, giving it an edge compared to a model with first-order dependency, which performs similarly but is significantly slower in training and decoding.

pdf bib
How Low is Too Low? A Computational Perspective on Extremely Low-Resource Languages
Rachit Bansal | Himanshu Choudhary | Ravneet Punia | Niko Schenk | Émilie Pagé-Perron | Jacob Dahl

Despite the recent advancements of attention-based deep learning architectures across a majority of Natural Language Processing tasks, their application remains limited in a low-resource setting because of a lack of pre-trained models for such languages. In this study, we make the first attempt to investigate the challenges of adapting these techniques to an extremely low-resource language Sumerian cuneiform one of the world’s oldest written languages attested from at least the beginning of the 3rd millennium BC. Specifically, we introduce the first cross-lingual information extraction pipeline for Sumerian, which includes part-of-speech tagging, named entity recognition, and machine translation. We introduce InterpretLR, an interpretability toolkit for low-resource NLP and use it alongside human evaluations to gauge the trained models. Notably, all our techniques and most components of our pipeline can be generalised to any low-resource language. We publicly release all our implementations including a novel data set with domain-specific pre-processing to promote further research in this domain.

pdf bib
Long Document Summarization in a Low Resource Setting using Pretrained Language Models
Ahsaas Bajaj | Pavitra Dangati | Kalpesh Krishna | Pradhiksha Ashok Kumar | Rheeya Uppaal | Bradford Windsor | Eliot Brenner | Dominic Dotterrer | Rajarshi Das | Andrew McCallum

Abstractive summarization is the task of compressing a long document into a coherent short document while retaining salient information. Modern abstractive summarization methods are based on deep neural networks which often require large training datasets. Since collecting summarization datasets is an expensive and time-consuming task, practical industrial settings are usually low-resource. In this paper, we study a challenging low-resource setting of summarizing long legal briefs with an average source document length of 4268 words and only 120 available (document, summary) pairs. To account for data scarcity, we used a modern pre-trained abstractive summarizer BART, which only achieves 17.9 ROUGE-L as it struggles with long documents. We thus attempt to compress these long documents by identifying salient sentences in the source which best ground the summary, using a novel algorithm based on GPT-2 language model perplexity scores, that operates within the low resource regime. On feeding the compressed documents to BART, we observe a 6.0 ROUGE-L improvement. Our method also beats several competitive salience detection baselines. Furthermore, the identified salient sentences tend to agree with independent human labeling by domain experts.

pdf bib
Situation-Based Multiparticipant Chat Summarization : a Concept, an Exploration-Annotation Tool and an Example Collection
Anna Smirnova | Evgeniy Slobodkin | George Chernishev

Currently, text chatting is one of the primary means of communication. However, modern text chat still in general does not offer any navigation or even full-featured search, although the high volumes of messages demand it. In order to mitigate these inconveniences, we formulate the problem of situation-based summarization and propose a special data annotation tool intended for developing training and gold-standard data. A situation is a subset of messages revolving around a single event in both temporal and contextual senses : e.g, a group of friends arranging a meeting in chat, agreeing on date, time, and place. Situations can be extracted via information retrieval, natural language processing, and machine learning techniques. Since the task is novel, neither training nor gold-standard datasets for it have been created yet. In this paper, we present the formulation of the situation-based summarization problem. Next, we describe Chat Corpora Annotator (CCA): the first annotation system designed specifically for exploring and annotating chat log data. We also introduce a custom query language for semi-automatic situation extraction. Finally, we present the first gold-standard dataset for situation-based summarization. The software source code and the dataset are publicly available.

pdf bib
Modeling Text using the Continuous Space Topic Model with Pre-Trained Word Embeddings
Seiichi Inoue | Taichi Aida | Mamoru Komachi | Manabu Asai

In this study, we propose a model that extends the continuous space topic model (CSTM), which flexibly controls word probability in a document, using pre-trained word embeddings. To develop the proposed model, we pre-train word embeddings, which capture the semantics of words and plug them into the CSTM. Intrinsic experimental results show that the proposed model exhibits a superior performance over the CSTM in terms of perplexity and convergence speed. Furthermore, extrinsic experimental results show that the proposed model is useful for a document classification task when compared with the baseline model. We qualitatively show that the latent coordinates obtained by training the proposed model are better than those of the baseline model.

pdf bib
Hold on honey, men at work : A semi-supervised approach to detecting sexism in sitcoms
Smriti Singh | Tanvi Anand | Arijit Ghosh Chowdhury | Zeerak Waseem

Television shows play an important role inpropagating societal norms. Owing to the popularity of the situational comedy (sitcom) genre, it contributes significantly to the over-all development of society. In an effort to analyze the content of television shows belong-ing to this genre, we present a dataset of dialogue turns from popular sitcoms annotated for the presence of sexist remarks. We train a text classification model to detect sexism using domain adaptive learning. We apply the model to our dataset to analyze the evolution of sexist content over the years. We propose a domain-specific semi-supervised architecture for the aforementioned detection of sexism. Through extensive experiments, we show that our model often yields better classification performance over generic deep learn-ing based sentence classification that does not employ domain-specific training. We find that while sexism decreases over time on average, the proportion of sexist dialogue for the most sexist sitcom actually increases. A quantitative analysis along with a detailed error analysis presents the case for our proposed methodology

pdf bib
Improving the Robustness of QA Models to Challenge Sets with Variational Question-Answer Pair GenerationQA Models to Challenge Sets with Variational Question-Answer Pair Generation
Kazutoshi Shinoda | Saku Sugawara | Akiko Aizawa

Question answering (QA) models for reading comprehension have achieved human-level accuracy on in-distribution test sets. However, they have been demonstrated to lack robustness to challenge sets, whose distribution is different from that of training sets. Existing data augmentation methods mitigate this problem by simply augmenting training sets with synthetic examples sampled from the same distribution as the challenge sets. However, these methods assume that the distribution of a challenge set is known a priori, making them less applicable to unseen challenge sets. In this study, we focus on question-answer pair generation (QAG) to mitigate this problem. While most existing QAG methods aim to improve the quality of synthetic examples, we conjecture that diversity-promoting QAG can mitigate the sparsity of training sets and lead to better robustness. We present a variational QAG model that generates multiple diverse QA pairs from a paragraph. Our experiments show that our method can improve the accuracy of 12 challenge sets, as well as the in-distribution accuracy.

pdf bib
Edit Distance Based Curriculum Learning for Paraphrase Generation
Sora Kadotani | Tomoyuki Kajiwara | Yuki Arase | Makoto Onizuka

Curriculum learning has improved the quality of neural machine translation, where only source-side features are considered in the metrics to determine the difficulty of translation. In this study, we apply curriculum learning to paraphrase generation for the first time. Different from machine translation, paraphrase generation allows a certain level of discrepancy in semantics between source and target, which results in diverse transformations from lexical substitution to reordering of clauses. Hence, the difficulty of transformations requires considering both source and target contexts. Experiments on formality transfer using GYAFC showed that our curriculum learning with edit distance improves the quality of paraphrase generation. Additionally, the proposed method improves the quality of difficult samples, which was not possible for previous methods.

pdf bib
CMTA : COVID-19 Misinformation Multilingual Analysis on TwitterCMTA: COVID-19 Misinformation Multilingual Analysis on Twitter
Raj Pranesh | Mehrdad Farokhenajd | Ambesh Shekhar | Genoveva Vargas-Solar

The internet has actually come to be an essential resource of health knowledge for individuals around the world in the present situation of the coronavirus condition pandemic(COVID-19). During pandemic situations, myths, sensationalism, rumours and misinformation, generated intentionally or unintentionally, spread rapidly through social networks. Twitter is one of these popular social networks people use to share COVID-19 related news, information, and thoughts that reflect their perception and opinion about the pandemic. Evaluation of tweets for recognizing misinformation can create beneficial understanding to review the top quality and also the readability of online information concerning the COVID-19. This paper presents a multilingual COVID-19 related tweet analysis method, CMTA, that uses BERT, a deep learning model for multilingual tweet misinformation detection and classification. CMTA extracts features from multilingual textual data, which is then categorized into specific information classes. Classification is done by a Dense-CNN model trained on tweets manually annotated into information classes (i.e., ‘false’, ‘partly false’, ‘misleading’). The paper presents an analysis of multilingual tweets from February to June, showing the distribution type of information spread across different languages. To access the performance of the CMTA multilingual model, we performed a comparative analysis of 8 monolingual model and CMTA for the misinformation detection task. The results show that our proposed CMTA model has surpassed various monolingual models which consolidated the fact that through transfer learning a multilingual framework could be developed.


pdf (full)
bib (full)
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations

pdf bib
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations
Heng Ji | Jong C. Park | Rui Xia

pdf bib
IntelliCAT : Intelligent Machine Translation Post-Editing with Quality Estimation and Translation SuggestionIntelliCAT: Intelligent Machine Translation Post-Editing with Quality Estimation and Translation Suggestion
Dongjun Lee | Junhyeong Ahn | Heesoo Park | Jaemin Jo

We present IntelliCAT, an interactive translation interface with neural models that streamline the post-editing process on machine translation output. We leverage two quality estimation (QE) models at different granularities : sentence-level QE, to predict the quality of each machine-translated sentence, and word-level QE, to locate the parts of the machine-translated sentence that need correction. Additionally, we introduce a novel translation suggestion model conditioned on both the left and right contexts, providing alternatives for specific words or phrases for correction. Finally, with word alignments, IntelliCAT automatically preserves the original document’s styles in the translated document. The experimental results show that post-editing based on the proposed QE and translation suggestions can significantly improve translation quality. Furthermore, a user study reveals that three features provided in IntelliCAT significantly accelerate the post-editing task, achieving a 52.9 % speedup in translation time compared to translating from scratch. The interface is publicly available at

pdf bib
TextBox : A Unified, Modularized, and Extensible Framework for Text GenerationTextBox: A Unified, Modularized, and Extensible Framework for Text Generation
Junyi Li | Tianyi Tang | Gaole He | Jinhao Jiang | Xiaoxuan Hu | Puzhao Xie | Zhipeng Chen | Zhuohao Yu | Wayne Xin Zhao | Ji-Rong Wen

In this paper, we release an open-source library, called TextBox, to provide a unified, modularized, and extensible text generation framework. TextBox aims to support a broad set of text generation tasks and models. In our library, we implement 21 text generation models on 9 benchmark datasets, covering the categories of VAE, GAN, and pretrained language models. Meanwhile, our library maintains sufficient modularity and extensibility by properly decomposing the model architecture, inference, and learning process into highly reusable modules, which allows users to easily incorporate new models into our framework. The above features make TextBox especially suitable for researchers and practitioners to quickly reproduce baseline models and develop new models. TextBox is implemented based on PyTorch, and released under Apache License 2.0 at the link.

pdf bib
Inside ASCENT : Exploring a Deep Commonsense Knowledge Base and its Usage in Question AnsweringASCENT: Exploring a Deep Commonsense Knowledge Base and its Usage in Question Answering
Tuan-Phong Nguyen | Simon Razniewski | Gerhard Weikum

ASCENT is a fully automated methodology for extracting and consolidating commonsense assertions from web contents (Nguyen et al., 2021). It advances traditional triple-based commonsense knowledge representation by capturing semantic facets like locations and purposes, and composite concepts, i.e., subgroups and related aspects of subjects. In this demo, we present a web portal that allows users to understand its construction process, explore its content, and observe its impact in the use case of question answering. The demo website ( and an introductory video ( are both available online.

pdf bib
NeurST : Neural Speech Translation ToolkitNeurST: Neural Speech Translation Toolkit
Chengqi Zhao | Mingxuan Wang | Qianqian Dong | Rong Ye | Lei Li

NeurST is an open-source toolkit for neural speech translation. The toolkit mainly focuses on end-to-end speech translation, which is easy to use, modify, and extend to advanced speech translation research and products. NeurST aims at facilitating the speech translation research for NLP researchers and building reliable benchmarks for this field. It provides step-by-step recipes for feature extraction, data preprocessing, distributed training, and evaluation. In this paper, we will introduce the framework design of NeurST and show experimental results for different benchmark datasets, which can be regarded as reliable baselines for future research. The toolkit is publicly available at and we will continuously update the performance of with other counterparts and studies at. and we will continuously update the performance of with other counterparts and studies at

pdf bib
MT-Telescope : An interactive platform for contrastive evaluation of MT systemsMT-Telescope: An interactive platform for contrastive evaluation of MT systems
Ricardo Rei | Ana C Farinha | Craig Stewart | Luisa Coheur | Alon Lavie

We present MT-Telescope, a visualization platform designed to facilitate comparative analysis of the output quality of two Machine Translation (MT) systems. While automated MT evaluation metrics are commonly used to evaluate MT systems at a corpus-level, our platform supports fine-grained segment-level analysis and interactive visualisations that expose the fundamental differences in the performance of the compared systems. MT-Telescope also supports dynamic corpus filtering to enable focused analysis on specific phenomena such as ; translation of named entities, handling of terminology, and the impact of input segment length on translation quality. Furthermore, the platform provides a bootstrapped t-test for statistical significance as a means of evaluating the rigor of the resulting system ranking. MT-Telescope is open source, written in Python, and is built around a user friendly and dynamic web interface. Complementing other existing tools, our platform is designed to facilitate and promote the broader adoption of more rigorous analysis practices in the evaluation of MT quality.

pdf bib
CogIE : An Information Extraction Toolkit for Bridging Texts and CogNetCogIE: An Information Extraction Toolkit for Bridging Texts and CogNet
Zhuoran Jin | Yubo Chen | Dianbo Sui | Chenhao Wang | Zhipeng Xue | Jun Zhao

CogNet is a knowledge base that integrates three types of knowledge : linguistic knowledge, world knowledge and commonsense knowledge. In this paper, we propose an information extraction toolkit, called CogIE, which is a bridge connecting raw texts and CogNet. CogIE has three features : versatile, knowledge-grounded and extensible. First, CogIE is a versatile toolkit with a rich set of functional modules, including named entity recognition, entity typing, entity linking, relation extraction, event extraction and frame-semantic parsing. Second, as a knowledge-grounded toolkit, CogIE can ground the extracted facts to CogNet and leverage different types of knowledge to enrich extracted results. Third, for extensibility, owing to the design of three-tier architecture, CogIE is not only a plug-and-play toolkit for developers but also an extensible programming framework for researchers. We release an open-access online system to visually extract information from texts. Source code, datasets and pre-trained models are publicly available at GitHub, with a short instruction video.

pdf bib
fastHan : A BERT-based Multi-Task Toolkit for Chinese NLPHan: A BERT-based Multi-Task Toolkit for Chinese NLP
Zhichao Geng | Hang Yan | Xipeng Qiu | Xuanjing Huang

We present fastHan, an open-source toolkit for four basic tasks in Chinese natural language processing : Chinese word segmentation (CWS), Part-of-Speech (POS) tagging, named entity recognition (NER), and dependency parsing. The backbone of fastHan is a multi-task model based on a pruned BERT, which uses the first 8 layers in BERT. We also provide a 4-layer base model compressed from the 8-layer model. The joint-model is trained and evaluated on 13 corpora of four tasks, yielding near state-of-the-art (SOTA) performance in dependency parsing and NER, achieving SOTA performance in CWS and POS. Besides, fastHan’s transferability is also strong, performing much better than popular segmentation tools on a non-training corpus. To better meet the need of practical application, we allow users to use their own labeled data to further fine-tune fastHan. In addition to its small size and excellent performance, fastHan is user-friendly. Implemented as a python package, fastHan isolates users from the internal technical details and is convenient to use. The project is released on Github.

pdf bib
Erase and Rewind : Manual Correction of NLP Output through a Web InterfaceNLP Output through a Web Interface
Valentino Frasnelli | Lorenzo Bocchi | Alessio Palmero Aprosio

In this paper, we present Tintful, an NLP annotation software that can be used both to manually annotate texts and to fix mistakes in NLP pipelines, such as Stanford CoreNLP. Using a paradigm similar to wiki-like systems, a user who notices some wrong annotation can easily fix it and submit the resulting (and right) entry back to the tool developers. Moreover, Tintful can be used to easily annotate data from scratch. The input documents do not need to be in a particular format : starting from the plain text, the sentences are first annotated with CoreNLP, then the user can edit the annotations and submit everything back through a user-friendly interface.

pdf bib
ESRA : Explainable Scientific Research AssistantESRA: Explainable Scientific Research Assistant
Pollawat Hongwimol | Peeranuth Kehasukcharoen | Pasit Laohawarutchai | Piyawat Lertvittayakumjorn | Aik Beng Ng | Zhangsheng Lai | Timothy Liu | Peerapon Vateekul

We introduce Explainable Scientific Research Assistant (ESRA), a literature discovery platform that augments search results with relevant details and explanations, aiding users in understanding more about their queries and the returned papers beyond existing literature search systems. Enabled by a knowledge graph we extracted from abstracts of 23k papers on the arXiv’s cs. CL category, ESRA provides three main features : explanation (for why a paper is returned to the user), list of facts (that are relevant to the query), and graph visualization (drawing connections between the query and each paper with surrounding related entities). The experimental results with humans involved show that ESRA can accelerate the users’ search process with paper explanations and helps them better explore the landscape of the topics of interest by exploiting the underlying knowledge graph. We provide the ESRA web application at

pdf bib
REM : Efficient Semi-Automated Real-Time Moderation of Online ForumsREM: Efficient Semi-Automated Real-Time Moderation of Online Forums
Jakob Smedegaard Andersen | Olaf Zukunft | Walid Maalej

This paper presents REM, a novel tool for the semi-automated real-time moderation of large scale online forums. The growing demand for online participation and the increasing number of user comments raise challenges in filtering out harmful and undesirable content from public debates in online forums. Since a manual moderation does not scale well and pure automated approaches often lack the required level of accuracy, we suggest a semi-automated moderation approach. Our approach maximizes the efficiency of manual efforts by targeting only those comments for which human intervention is needed, e.g. due to high classification uncertainty. Our tool offers a rich visual interactive environment enabling the exploration of online debates. We conduct a preliminary evaluation experiment to demonstrate the suitability of our approach and publicly release the source code of REM.

pdf bib
A Graphical Interface for Curating Schemas
Piyush Mishra | Akanksha Malhotra | Susan Windisch Brown | Martha Palmer | Ghazaleh Kazeminejad

Much past work has focused on extracting information like events, entities, and relations from documents. Very little work has focused on analyzing these results for better model understanding. In this paper, we introduce a curation interface that takes an Information Extraction (IE) system’s output in a pre-defined format and generates a graphical representation of its elements. The interface supports editing while curating schemas for complex events like Improvised Explosive Device (IED) based scenarios. We identify various schemas that either have linear event chains or contain parallel events with complicated temporal ordering. We iteratively update an induced schema to uniquely identify events specific to it, add optional events around them, and prune unnecessary events. The resulting schemas are improved and enriched versions of the machine-induced versions.

pdf bib
CLTR : An End-to-End, Transformer-Based System for Cell-Level Table Retrieval and Table Question AnsweringCLTR: An End-to-End, Transformer-Based System for Cell-Level Table Retrieval and Table Question Answering
Feifei Pan | Mustafa Canim | Michael Glass | Alfio Gliozzo | Peter Fox

We present the first end-to-end, transformer-based table question answering (QA) system that takes natural language questions and massive table corpora as inputs to retrieve the most relevant tables and locate the correct table cells to answer the question. Our system, CLTR, extends the current state-of-the-art QA over tables model to build an end-to-end table QA architecture. This system has successfully tackled many real-world table QA problems with a simple, unified pipeline. Our proposed system can also generate a heatmap of candidate columns and rows over complex tables and allow users to quickly identify the correct cells to answer questions. In addition, we introduce two new open domain benchmarks, E2E_WTQ and E2E_GNQ, consisting of 2,005 natural language questions over 76,242 tables. The benchmarks are designed to validate CLTR as well as accommodate future table retrieval and end-to-end table QA research and experiments. Our experiments demonstrate that our system is the current state-of-the-art model on the table retrieval task and produces promising results for end-to-end table QA.

pdf bib
FastSeq : Make Sequence Generation FasterFastSeq: Make Sequence Generation Faster
Yu Yan | Fei Hu | Jiusheng Chen | Nikhil Bhendawade | Ting Ye | Yeyun Gong | Nan Duan | Desheng Cui | Bingyu Chi | Ruofei Zhang

Transformer-based models have made tremendous impacts in natural language generation. However the inference speed is a bottleneck due to large model size and intensive computing involved in auto-regressive decoding process. We develop FastSeq framework to accelerate sequence generation without accuracy loss. The proposed optimization techniques include an attention cache optimization, an efficient algorithm for detecting repeated n-grams, and an asynchronous generation pipeline with parallel I / O. These optimizations are general enough to be applicable to Transformer-based models (e.g., T5, GPT2, and UniLM). Our benchmark results on a set of widely used and diverse models demonstrate 4-9x inference speed gain. Additionally, FastSeq is easy to use with a simple one-line code change. The source code is available at

pdf bib
Ecco : An Open Source Library for the Explainability of Transformer Language Models
J Alammar

Our understanding of why Transformer-based NLP models have been achieving their recent success lags behind our ability to continue scaling these models. To increase the transparency of Transformer-based language models, we present Ecco an open-source library for the explainability of Transformer-based NLP models. Ecco provides a set of tools to capture, analyze, visualize, and interactively explore the inner mechanics of these models. This includes (1) gradient-based feature attribution for natural language generation (2) hidden states and their evolution between model layers (3) convenient access and examination tools for neuron activations in the under-explored Feed-Forward Neural Network sublayer of Transformer layers. (4) convenient examination of activation vectors via canonical correlation analysis (CCA), non-negative matrix factorization (NMF), and probing classifiers. We find that syntactic information can be retrieved from BERT’s FFNN representations in levels comparable to those in hidden state representations. More curiously, we find that the model builds up syntactic information in its hidden states even when intermediate FFNNs indicate diminished levels of syntactic information. Ecco is available at

pdf bib
TweeNLP : A Twitter Exploration Portal for Natural Language ProcessingTweeNLP: A Twitter Exploration Portal for Natural Language Processing
Viraj Shah | Shruti Singh | Mayank Singh

We present TweeNLP, a one-stop portal that organizes Twitter’s natural language processing (NLP) data and builds a visualization and exploration platform. It curates 19,395 tweets (as of April 2021) from various NLP conferences and general NLP discussions. It supports multiple features such as TweetExplorer to explore tweets by topics, visualize insights from Twitter activity throughout the organization cycle of conferences, discover popular research papers and researchers. It also builds a timeline of conference and workshop submission deadlines. We envision TweeNLP to function as a collective memory unit for the NLP community by integrating the tweets pertaining to research papers with the NLPExplorer scientific literature search engine. The current system is hosted at

pdf bib
ReTraCk : A Flexible and Efficient Framework for Knowledge Base Question AnsweringReTraCk: A Flexible and Efficient Framework for Knowledge Base Question Answering
Shuang Chen | Qian Liu | Zhiwei Yu | Chin-Yew Lin | Jian-Guang Lou | Feng Jiang

We present Retriever-Transducer-Checker (ReTraCk), a neural semantic parsing framework for large scale knowledge base question answering (KBQA). ReTraCk is designed as a modular framework to maintain high flexibility. It includes a retriever to retrieve relevant KB items efficiently, a transducer to generate logical form with syntax correctness guarantees and a checker to improve transduction procedure. ReTraCk is ranked at top1 overall performance on the GrailQA leaderboard and obtains highly competitive performance on the typical WebQuestionsSP benchmark. Our system can interact with users timely, demonstrating the efficiency of the proposed framework.

pdf bib
TextFlint : Unified Multilingual Robustness Evaluation Toolkit for Natural Language ProcessingTextFlint: Unified Multilingual Robustness Evaluation Toolkit for Natural Language Processing
Xiao Wang | Qin Liu | Tao Gui | Qi Zhang | Yicheng Zou | Xin Zhou | Jiacheng Ye | Yongxin Zhang | Rui Zheng | Zexiong Pang | Qinzhuo Wu | Zhengyan Li | Chong Zhang | Ruotian Ma | Zichu Fei | Ruijian Cai | Jun Zhao | Xingwu Hu | Zhiheng Yan | Yiding Tan | Yuan Hu | Qiyuan Bian | Zhihua Liu | Shan Qin | Bolin Zhu | Xiaoyu Xing | Jinlan Fu | Yue Zhang | Minlong Peng | Xiaoqing Zheng | Yaqian Zhou | Zhongyu Wei | Xipeng Qiu | Xuanjing Huang

TextFlint is a multilingual robustness evaluation toolkit for NLP tasks that incorporates universal text transformation, task-specific transformation, adversarial attack, subpopulation, and their combinations to provide comprehensive robustness analyses. This enables practitioners to automatically evaluate their models from various aspects or to customize their evaluations as desired with just a few lines of code. TextFlint also generates complete analytical reports as well as targeted augmented data to address the shortcomings of the model in terms of its robustness. To guarantee acceptability, all the text transformations are linguistically based and all the transformed data selected (up to 100,000 texts) scored highly under human evaluation. To validate the utility, we performed large-scale empirical evaluations (over 67,000) on state-of-the-art deep learning models, classic supervised methods, and real-world systems. The toolkit is already available at with all the evaluation results demonstrated at

pdf bib
OpenAttack : An Open-source Textual Adversarial Attack ToolkitOpenAttack: An Open-source Textual Adversarial Attack Toolkit
Guoyang Zeng | Fanchao Qi | Qianrui Zhou | Tingji Zhang | Zixian Ma | Bairu Hou | Yuan Zang | Zhiyuan Liu | Maosong Sun

Textual adversarial attacking has received wide and increasing attention in recent years. Various attack models have been proposed, which are enormously distinct and implemented with different programming frameworks and settings. These facts hinder quick utilization and fair comparison of attack models. In this paper, we present an open-source textual adversarial attack toolkit named OpenAttack to solve these issues. Compared with existing other textual adversarial attack toolkits, OpenAttack has its unique strengths in support for all attack types, multilinguality, and parallel processing. Currently, OpenAttack includes 15 typical attack models that cover all attack types. Its highly inclusive modular design not only supports quick utilization of existing attack models, but also enables great flexibility and extensibility. OpenAttack has broad uses including comparing and evaluating attack models, measuring robustness of a model, assisting in developing new attack models, and adversarial training. Source code and documentation can be obtained at


pdf (full)
bib (full)
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Tutorial Abstracts

pdf bib
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Tutorial Abstracts
David Chiang | Min Zhang

pdf bib
Advances in Debating Technologies : Building AI That Can Debate HumansAI That Can Debate Humans
Roy Bar-Haim | Liat Ein-Dor | Matan Orbach | Elad Venezian | Noam Slonim

The tutorial focuses on Debating Technologies, a sub-field of computational argumentation defined as computational technologies developed directly to enhance, support, and engage with human debating (Gurevych et al., 2016). A recent milestone in this field is Project Debater, which was revealed in 2019 as the first AI system that can debate human experts on complex topics. Project Debater is the third in the series of IBM Research AI’s grand challenges, following Deep Blue and Watson. It has been developed for over six years by a large team of researchers and engineers, and its live demonstration in February 2019 received massive media attention. This research effort has resulted in more than 50 scientific papers to date, and many datasets freely available for research purposes. We discuss the scientific challenges that arise when building such a system, including argument mining, argument quality assessment, stance classification, principled argument detection, narrative generation, and rebutting a human opponent. Many of the underlying capabilities of Project Debater have been made freely available for academic research, and the tutorial will include a detailed explanation of how to use and leverage these tools. In addition to discussing individual components, the tutorial also provides a holistic view of a debating system. Such a view is largely missing in the academic literature, where each paper typically addresses a specific problem in isolation. We present a complete pipeline of a debating system, and discuss the information flow and the interaction between the various components. Finally, we discuss practical applications and future challenges of debating technologies.

pdf bib
Pre-training Methods for Neural Machine Translation
Mingxuan Wang | Lei Li

This tutorial provides a comprehensive guide to make the most of pre-training for neural machine translation. Firstly, we will briefly introduce the background of NMT, pre-training methodology, and point out the main challenges when applying pre-training for NMT. Then we will focus on analysing the role of pre-training in enhancing the performance of NMT, how to design a better pre-training model for executing specific NMT tasks and how to better integrate the pre-trained model into NMT system. In each part, we will provide examples, discuss training techniques and analyse what is transferred when applying pre-training.


pdf (full)
bib (full)
Proceedings of the 1st Workshop on Benchmarking: Past, Present and Future

pdf bib
Proceedings of the 1st Workshop on Benchmarking: Past, Present and Future
Kenneth Church | Mark Liberman | Valia Kordoni

pdf bib
How Might We Create Better Benchmarks for Speech Recognition?
Alëna Aksënova | Daan van Esch | James Flynn | Pavel Golik

The applications of automatic speech recognition (ASR) systems are proliferating, in part due to recent significant quality improvements. However, as recent work indicates, even state-of-the-art speech recognition systems some which deliver impressive benchmark results, struggle to generalize across use cases. We review relevant work, and, hoping to inform future benchmark development, outline a taxonomy of speech recognition use cases, proposed for the next generation of ASR benchmarks. We also survey work on metrics, in addition to the de facto standard Word Error Rate (WER) metric, and we introduce a versatile framework designed to describe interactions between linguistic variation and ASR performance metrics.


pdf (full)
bib (full)
Proceedings of the 4th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE 2021)

pdf bib
Proceedings of the 4th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE 2021)
Ali Hürriyetoğlu

pdf bib
Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE 2021): Workshop and Shared Task ReportCASE 2021): Workshop and Shared Task Report
Ali Hürriyetoğlu | Hristo Tanev | Vanni Zavarella | Jakub Piskorski | Reyyan Yeniterzi | Osman Mutlu | Deniz Yuret | Aline Villavicencio

This workshop is the fourth issue of a series of workshops on automatic extraction of socio-political events from news, organized by the Emerging Market Welfare Project, with the support of the Joint Research Centre of the European Commission and with contributions from many other prominent scholars in this field. The purpose of this series of workshops is to foster research and development of reliable, valid, robust, and practical solutions for automatically detecting descriptions of socio-political events, such as protests, riots, wars and armed conflicts, in text streams. This year workshop contributors make use of the state-of-the-art NLP technologies, such as Deep Learning, Word Embeddings and Transformers and cover a wide range of topics from text classification to news bias detection. Around 40 teams have registered and 15 teams contributed to three tasks that are i) multilingual protest news detection detection, ii) fine-grained classification of socio-political events, and iii) discovering Black Lives Matter protest events. The workshop also highlights two keynote and four invited talks about various aspects of creating event data sets and multi- and cross-lingual machine learning in few- and zero-shot settings.

pdf bib
Keynote Abstract : Events on a Global Scale : Towards Language-Agnostic Event Extraction
Elizabeth Boschee

Event extraction is a challenging and exciting task in the world of machine learning & natural language processing. The breadth of events of possible interest, the speed at which surrounding socio-political event contexts evolve, and the complexities involved in generating representative annotated data all contribute to this challenge. One particular dimension of difficulty is the intrinsically global nature of events : many downstream use cases for event extraction involve reporting not just in a few major languages but in a much broader context. The languages of interest for even a fixed task may still shift from day to day, e.g. when a disease emerges in an unexpected location. Early approaches to multi-lingual event extraction (e.g. ACE) relied wholly on supervised data provided in each language of interest. Later approaches leveraged the success of machine translation to side-step the issue, simply translating foreign-language content to English and deploying English models on the result (often leaving some significant portion of the original content behind). Most recently, however, the community has begun to shown significant progress applying zero-shot transfer techniques to the problem, developing models using supervised English data but decoding in a foreign language without translation, typically using embedding spaces specifically designed to capture multi-lingual semantic content. In this talk I will discuss multiple dimensions of these promising new approaches and the linguistic representations that underlie them. I will compare them with approaches based on machine translation (as well as with models trained using in-language training data, where available), and discuss their strengths and weaknesses in different contexts, including the amount of English / foreign bitext available and the nature of the target event ontology. I will also discuss possible future directions with an eye to improving the quality of event extraction no matter its source around the globe.

pdf bib
Keynote Abstract : Machine Learning in Conflict Studies : Reflections on Ethics, Collaboration, and Ongoing Challenges
Kristine Eck

Advances in machine learning are nothing short of revolutionary in their potential to analyze massive amounts of data and in doing so, create new knowledge bases. But there is a responsibility in wielding the power to analyze these data since the public attributes a high degree of confidence to results which are based on big datasets. In this keynote, I will first address our ethical imperative as scholars to get it right. This imperative relates not only to model precision but also to the quality of the underlying data, and to whether the models inadvertently reproduce or obscure political biases in the source material. In considering the ethical imperative to get it right, it is also important to define what is right : what is considered an acceptable threshold for classification success needs to be understood in light of the project’s objectives. I then reflect on the different topics and data which are sourced in this field. Much of the existing research has focused on identifying conflict events (e.g. battles), but scholars are also increasingly turning to ML approaches to address other facets of the conflict environment. Conflict event extraction has long been a challenge for the natural language processing (NLP) community because it requires sophisticated methods for defining event ontologies, creating language resources, and developing algorithmic approaches. NLP machine-learning tools are ill-adapted to the complex, often messy, and diverse data generated during conflicts. Relative to other types of NLP text corpora, conflicts tend to generate less textual data, and texts are generated non-systematically. Conflict-related texts are often lexically idiosyncratic and tend to be written differently across actors, periods, and conflicts. Event definition and adjudication present tough challenges in the context of conflict corpora. Topics which rely on other types of data may be better-suited to NLP and machine learning methods. For example, Twitter and other social media data lend themselves well to studying hate speech, public opinion, social polarization, or discursive aspects of conflictual environments. Likewise, government-produced policy documents have typically been analyzed with historical, qualitative methods but their standardized formats and quantity suggest that ML methods can provide new traction. ML approaches may also allow scholars to exploit local sources and multi-language sources to a greater degree than has been possible. Many challenges remain, and these are best addressed in collaborative projects which build on interdisciplinary expertise. Classification projects need to be anchored in the theoretical interests of scholars of political violence if the data they produce are to be put to analytical use. There are few ontologies for classification that adequately reflect conflict researchers’ interests, which highlights the need for conceptual as well as technical development.

pdf bib
Extracting Events from Industrial Incident Reports
Nitin Ramrakhiyani | Swapnil Hingmire | Sangameshwar Patil | Alok Kumar | Girish Palshikar

Incidents in industries have huge social and political impact and minimizing the consequent damage has been a high priority. However, automated analysis of repositories of incident reports has remained a challenge. In this paper, we focus on automatically extracting events from incident reports. Due to absence of event annotated datasets for industrial incidents we employ a transfer learning based approach which is shown to outperform several baselines. We further provide detailed analysis regarding effect of increase in pre-training data and provide explainability of why pre-training improves the performance.

pdf bib
Multilingual Protest News Detection-Shared Task 1, CASE 2021CASE 2021
Ali Hürriyetoğlu | Osman Mutlu | Erdem Yörük | Farhana Ferdousi Liza | Ritesh Kumar | Shyam Ratan

Benchmarking state-of-the-art text classification and information extraction systems in multilingual, cross-lingual, few-shot, and zero-shot settings for socio-political event information collection is achieved in the scope of the shared task Socio-political and Crisis Events Detection at the workshop CASE @ ACL-IJCNLP 2021. Socio-political event data is utilized for national and international policy- and decision-making. Therefore, the reliability and validity of these datasets are of the utmost importance. We split the shared task into three parts to address the three aspects of data collection (Task 1), fine-grained semantic classification (Task 2), and evaluation (Task 3). Task 1, which is the focus of this report, is on multilingual protest news detection and comprises four subtasks that are document classification (subtask 1), sentence classification (subtask 2), event sentence coreference identification (subtask 3), and event extraction (subtask 4). All subtasks had English, Portuguese, and Spanish for both training and evaluation data. Data in Hindi language was available only for the evaluation of subtask 1. The majority of the submissions, which are 238 in total, are created using multi- and cross-lingual approaches. Best scores are above 77.27 F1-macro for subtask 1, above 85.32 F1-macro for subtask 2, above 84.23 CoNLL 2012 average score for subtask 3, and above 66.20 F1-macro for subtask 4 in all evaluation settings.

pdf bib
IIITT at CASE 2021 Task 1 : Leveraging Pretrained Language Models for Multilingual Protest DetectionIIITT at CASE 2021 Task 1: Leveraging Pretrained Language Models for Multilingual Protest Detection
Pawan Kalyan | Duddukunta Reddy | Adeep Hande | Ruba Priyadharshini | Ratnasingam Sakuntharaj | Bharathi Raja Chakravarthi

In a world abounding in constant protests resulting from events like a global pandemic, climate change, religious or political conflicts, there has always been a need to detect events / protests before getting amplified by news media or social media. This paper demonstrates our work on the sentence classification subtask of multilingual protest detection in CASE@ACL-IJCNLP 2021. We approached this task by employing various multilingual pre-trained transformer models to classify if any sentence contains information about an event that has transpired or not. We performed soft voting over the models, achieving the best results among the models, accomplishing a macro F1-Score of 0.8291, 0.7578, and 0.7951 in English, Spanish, and Portuguese, respectively.

pdf bib
NUS-IDS at CASE 2021 Task 1 : Improving Multilingual Event Sentence Coreference Identification With Linguistic InformationNUS-IDS at CASE 2021 Task 1: Improving Multilingual Event Sentence Coreference Identification With Linguistic Information
Fiona Anting Tan | Sujatha Das Gollapalli | See-Kiong Ng

Event Sentence Coreference Identification (ESCI) aims to cluster event sentences that refer to the same event together for information extraction. We describe our ESCI solution developed for the ACL-CASE 2021 shared tasks on the detection and classification of socio-political and crisis event information in a multilingual setting. For a given article, our proposed pipeline comprises of an accurate sentence pair classifier that identifies coreferent sentence pairs and subsequently uses these predicted probabilities to cluster sentences into groups. Sentence pair representations are constructed from fine-tuned BERT embeddings plus POS embeddings fed through a BiLSTM model, and combined with linguistic-based lexical and semantic similarities between sentences. Our best models ranked 2nd, 1st and 2nd and obtained CoNLL F1 scores of 81.20 %, 93.03 %, 83.15 % for the English, Portuguese and Spanish test sets respectively in the ACL-CASE 2021 competition.

pdf bib
IBM MNLP IE at CASE 2021 Task 1 : Multigranular and Multilingual Event Detection on Protest NewsIBM MNLP IE at CASE 2021 Task 1: Multigranular and Multilingual Event Detection on Protest News
Parul Awasthy | Jian Ni | Ken Barker | Radu Florian

In this paper, we present the event detection models and systems we have developed for Multilingual Protest News Detection-Shared Task 1 at CASE 2021. The shared task has 4 subtasks which cover event detection at different granularity levels (from document level to token level) and across multiple languages (English, Hindi, Portuguese and Spanish). To handle data from multiple languages, we use a multilingual transformer-based language model (XLM-R) as the input text encoder. We apply a variety of techniques and build several transformer-based models that perform consistently well across all the subtasks and languages. Our systems achieve an average F_1 score of 81.2. Out of thirteen subtask-language tracks, our submissions rank 1st in nine and 2nd in four tracks.

pdf bib
ALEM at CASE 2021 Task 1 : Multilingual Text Classification on News ArticlesALEM at CASE 2021 Task 1: Multilingual Text Classification on News Articles
Alaeddin Gürel | Emre Emin

We participated CASE shared task in ACL-IJCNLP 2021. This paper is a summary of our experiments and ideas about this shared task. For each subtask we shared our approach, successful and failed methods and our thoughts about them. We submit our results once for every subtask, except for subtask3, in task submission system and present scores based on our validation set formed from given training samples in this paper. Techniques and models we mentioned includes BERT, Multilingual BERT, oversampling, undersampling, data augmentation and their implications with each other. Most of the experiments we came up with were not completed, as time did not permit, but we share them here as we plan to do them as suggested in the future work part of document.

pdf bib
AMU-EURANOVA at CASE 2021 Task 1 : Assessing the stability of multilingual BERTAMU-EURANOVA at CASE 2021 Task 1: Assessing the stability of multilingual BERT
Léo Bouscarrat | Antoine Bonnefoy | Cécile Capponi | Carlos Ramisch

This paper explains our participation in task 1 of the CASE 2021 shared task. This task is about multilingual event extraction from news. We focused on sub-task 4, event information extraction. This sub-task has a small training dataset and we fine-tuned a multilingual BERT to solve this sub-task. We studied the instability problem on the dataset and tried to mitigate it.

pdf bib
Team DaDeFrNi at CASE 2021 Task 1 : Document and Sentence Classification for Protest Event DetectionDaDeFrNi” at CASE 2021 Task 1: Document and Sentence Classification for Protest Event Detection
Francesco Re | Daniel Vegh | Dennis Atzenhofer | Niklas Stoehr

This paper accompanies our top-performing submission to the CASE 2021 shared task, which is hosted at the workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text. Subtasks 1 and 2 of Task 1 concern the classification of newspaper articles and sentences into conflict versus not conflict-related in four different languages. Our model performs competitively in both subtasks (up to 0.8662 macro F1), obtaining the highest score of all contributions for subtask 1 on Hindi articles (0.7877 macro F1). We describe all experiments conducted with the XLM-RoBERTa (XLM-R) model and report results obtained in each binary classification task. We propose supplementing the original training data with additional data on political conflict events. In addition, we provide an analysis of unigram probability estimates and geospatial references contained within the original training corpus.

pdf bib
Fine-grained Event Classification in News-like Text Snippets-Shared Task 2, CASE 2021CASE 2021
Jacek Haneczok | Guillaume Jacquet | Jakub Piskorski | Nicolas Stefanovitch

This paper describes the Shared Task on Fine-grained Event Classification in News-like Text Snippets. The Shared Task is divided into three sub-tasks : (a) classification of text snippets reporting socio-political events (25 classes) for which vast amount of training data exists, although exhibiting different structure and style vis-a-vis test data, (b) enhancement to a generalized zero-shot learning problem, where 3 additional event types were introduced in advance, but without any training data (‘unseen’ classes), and (c) further extension, which introduced 2 additional event types, announced shortly prior to the evaluation phase. The reported Shared Task focuses on classification of events in English texts and is organized as part of the Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE 2021), co-located with the ACL-IJCNLP 2021 Conference. Four teams participated in the task. Best performing systems for the three aforementioned sub-tasks achieved 83.9 %, 79.7 % and 77.1 % weighted F1 scores respectively.

pdf bib