Annual Meeting of the Association for Computational Linguistics (2020)



pdf (full)
bib (full)
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

pdf bib
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Dan Jurafsky | Joyce Chai | Natalie Schluter | Joel Tetreault

pdf bib
Designing Precise and Robust Dialogue Response Evaluators
Tianyu Zhao | Divesh Lala | Tatsuya Kawahara

Automatic dialogue response evaluator has been proposed as an alternative to automated metrics and human evaluation. However, existing automatic evaluators achieve only moderate correlation with human judgement and they are not robust. In this work, we propose to build a reference-free evaluator and exploit the power of semi-supervised training and pretrained (masked) language models. Experimental results demonstrate that the proposed evaluator achieves a strong correlation (0.6) with human judgement and generalizes robustly to diverse responses and corpora. We open-source the code and data in

pdf bib
Generating Informative Conversational Response using Recurrent Knowledge-Interaction and Knowledge-Copy
Xiexiong Lin | Weiyu Jian | Jianshan He | Taifeng Wang | Wei Chu

Knowledge-driven conversation approaches have achieved remarkable research attention recently. However, generating an informative response with multiple relevant knowledge without losing fluency and coherence is still one of the main challenges. To address this issue, this paper proposes a method that uses recurrent knowledge interaction among response decoding steps to incorporate appropriate knowledge. Furthermore, we introduce a knowledge copy mechanism using a knowledge-aware pointer network to copy words from external knowledge according to knowledge attention distribution. Our joint neural conversation model which integrates recurrent Knowledge-Interaction and knowledge Copy (KIC) performs well on generating informative responses. Experiments demonstrate that our model with fewer parameters yields significant improvements over competitive baselines on two datasets Wizard-of-Wikipedia(average Bleu +87 % ; abs. : 0.034) and DuConv(average Bleu +20 % ; abs. : 0.047)) with different knowledge formats (textual & structured) and different languages (English & Chinese).

pdf bib
Slot-consistent NLG for Task-oriented Dialogue Systems with Iterative Rectification NetworkNLG for Task-oriented Dialogue Systems with Iterative Rectification Network
Yangming Li | Kaisheng Yao | Libo Qin | Wanxiang Che | Xiaolong Li | Ting Liu

Data-driven approaches using neural networks have achieved promising performances in natural language generation (NLG). However, neural generators are prone to make mistakes, e.g., neglecting an input slot value and generating a redundant slot value. Prior works refer this to hallucination phenomenon. In this paper, we study slot consistency for building reliable NLG systems with all slot values of input dialogue act (DA) properly generated in output sentences. We propose Iterative Rectification Network (IRN) for improving general NLG systems to produce both correct and fluent responses. It applies a bootstrapping algorithm to sample training candidates and uses reinforcement learning to incorporate discrete reward related to slot inconsistency into training. Comprehensive studies have been conducted on multiple benchmark datasets, showing that the proposed methods have significantly reduced the slot error rate (ERR) for all strong baselines. Human evaluations also have confirmed its effectiveness.

pdf bib
Zero-Shot Transfer Learning with Synthesized Data for Multi-Domain Dialogue State Tracking
Giovanni Campagna | Agata Foryciarz | Mehrad Moradshahi | Monica Lam

Zero-shot transfer learning for multi-domain dialogue state tracking can allow us to handle new domains without incurring the high cost of data acquisition. This paper proposes new zero-short transfer learning technique for dialogue state tracking where the in-domain training data are all synthesized from an abstract dialogue model and the ontology of the domain. We show that data augmentation through synthesized data can improve the accuracy of zero-shot learning for both the TRADE model and the BERT-based SUMBT model on the MultiWOZ 2.1 dataset. We show training with only synthesized in-domain data on the SUMBT model can reach about 2/3 of the accuracy obtained with the full training dataset. We improve the zero-shot learning state of the art on average across domains by 21 %.

pdf bib
Reverse Engineering Configurations of Neural Text Generation Models
Yi Tay | Dara Bahri | Che Zheng | Clifford Brunk | Donald Metzler | Andrew Tomkins

Recent advances in neural text generation modeling have resulted in a number of societal concerns related to how such approaches might be used in malicious ways. It is therefore desirable to develop a deeper understanding of the fundamental properties of such models. The study of artifacts that emerge in machine generated text as a result of modeling choices is a nascent research area. To this end, the extent and degree to which these artifacts surface in generated text is still unclear. In the spirit of better understanding generative text models and their artifacts, we propose the new task of distinguishing which of several variants of a given model generated some piece of text. Specifically, we conduct an extensive suite of diagnostic tests to observe whether modeling choices (e.g., sampling methods, top-k probabilities, model architectures, etc.) leave detectable artifacts in the text they generate. Our key finding, which is backed by a rigorous set of experiments, is that such artifacts are present and that different modeling choices can be inferred by looking at generated text alone. This suggests that neural text generators may actually be more sensitive to various modeling choices than previously thought.

pdf bib
Unsupervised Paraphrasing by Simulated Annealing
Xianggen Liu | Lili Mou | Fandong Meng | Hao Zhou | Jie Zhou | Sen Song

We propose UPSA, a novel approach that accomplishes Unsupervised Paraphrasing by Simulated Annealing. We model paraphrase generation as an optimization problem and propose a sophisticated objective function, involving semantic similarity, expression diversity, and language fluency of paraphrases. UPSA searches the sentence space towards this objective by performing a sequence of local editing. We evaluate our approach on various datasets, namely, Quora, Wikianswers, MSCOCO, and Twitter. Extensive results show that UPSA achieves the state-of-the-art performance compared with previous unsupervised methods in terms of both automatic and human evaluations. Further, our approach outperforms most existing domain-adapted supervised models, showing the generalizability of UPSA.

pdf bib
Every Document Owns Its Structure : Inductive Text Classification via Graph Neural Networks
Yufeng Zhang | Xueli Yu | Zeyu Cui | Shu Wu | Zhongzhen Wen | Liang Wang

Text classification is fundamental in natural language processing (NLP) and Graph Neural Networks (GNN) are recently applied in this task. However, the existing graph-based works can neither capture the contextual word relationships within each document nor fulfil the inductive learning of new words. Therefore in this work, to overcome such problems, we propose TextING for inductive text classification via GNN. We first build individual graphs for each document and then use GNN to learn the fine-grained word representations based on their local structure, which can also effectively produce embeddings for unseen words in the new document. Finally, the word nodes are aggregated as the document embedding. Extensive experiments on four benchmark datasets show that our method outperforms state-of-the-art text classification methods.

pdf bib
Neural Topic Modeling with Bidirectional Adversarial Training
Rui Wang | Xuemeng Hu | Deyu Zhou | Yulan He | Yuxuan Xiong | Chenchen Ye | Haiyang Xu

Recent years have witnessed a surge of interests of using neural topic models for automatic topic extraction from text, since they avoid the complicated mathematical derivations for model inference as in traditional topic models such as Latent Dirichlet Allocation (LDA). However, these models either typically assume improper prior (e.g. Gaussian or Logistic Normal) over latent topic space or could not infer topic distribution for a given document. To address these limitations, we propose a neural topic modeling approach, called Bidirectional Adversarial Topic (BAT) model, which represents the first attempt of applying bidirectional adversarial training for neural topic modeling. The proposed BAT builds a two-way projection between the document-topic distribution and the document-word distribution. It uses a generator to capture the semantic patterns from texts and an encoder for topic inference. Furthermore, to incorporate word relatedness information, the Bidirectional Adversarial Topic model with Gaussian (Gaussian-BAT) is extended from BAT. To verify the effectiveness of BAT and Gaussian-BAT, three benchmark corpora are used in our experiments. The experimental results show that BAT and Gaussian-BAT obtain more coherent topics, outperforming several competitive baselines. Moreover, when performing text clustering based on the extracted topics, our models outperform all the baselines, with more significant improvements achieved by Gaussian-BAT where an increase of near 6 % is observed in accuracy.

pdf bib
Content Word Aware Neural Machine Translation
Kehai Chen | Rui Wang | Masao Utiyama | Eiichiro Sumita

Neural machine translation (NMT) encodes the source sentence in a universal way to generate the target sentence word-by-word. However, NMT does not consider the importance of word in the sentence meaning, for example, some words (i.e., content words) express more important meaning than others (i.e., function words). To address this limitation, we first utilize word frequency information to distinguish between content and function words in a sentence, and then design a content word-aware NMT to improve translation performance. Empirical results on the WMT14 English-to-German, WMT14 English-to-French, and WMT17 Chinese-to-English translation tasks show that the proposed methods can significantly improve the performance of Transformer-based NMT.

pdf bib
Evaluating Explanation Methods for Neural Machine Translation
Jierui Li | Lemao Liu | Huayang Li | Guanlin Li | Guoping Huang | Shuming Shi

Recently many efforts have been devoted to interpreting the black-box NMT models, but little progress has been made on metrics to evaluate explanation methods. Word Alignment Error Rate can be used as such a metric that matches human understanding, however, it can not measure explanation methods on those target words that are not aligned to any source word. This paper thereby makes an initial attempt to evaluate explanation methods from an alternative viewpoint. To this end, it proposes a principled metric based on fidelity in regard to the predictive behavior of the NMT model. As the exact computation for this metric is intractable, we employ an efficient approach as its approximation. On six standard translation tasks, we quantitatively evaluate several explanation methods in terms of the proposed metric and we reveal some valuable findings for these explanation methods in our experiments.

pdf bib
Jointly Masked Sequence-to-Sequence Model for Non-Autoregressive Neural Machine Translation
Junliang Guo | Linli Xu | Enhong Chen

The masked language model has received remarkable attention due to its effectiveness on various natural language processing tasks. However, few works have adopted this technique in the sequence-to-sequence models. In this work, we introduce a jointly masked sequence-to-sequence model and explore its application on non-autoregressive neural machine translation~(NAT). Specifically, we first empirically study the functionalities of the encoder and the decoder in NAT models, and find that the encoder takes a more important role than the decoder regarding the translation quality. Therefore, we propose to train the encoder more rigorously by masking the encoder input while training. As for the decoder, we propose to train it based on the consecutive masking of the decoder input with an n-gram loss function to alleviate the problem of translating duplicate words. The two types of masks are applied to the model jointly at the training stage. We conduct experiments on five benchmark machine translation tasks, and our model can achieve 27.69/32.24 BLEU scores on WMT14 English-German / German-English tasks with 5 + times speed up compared with an autoregressive model.n-gram loss function to alleviate the problem of translating duplicate words. The two types of masks are applied to the model jointly at the training stage. We conduct experiments on five benchmark machine translation tasks, and our model can achieve 27.69/32.24 BLEU scores on WMT14 English-German/German-English tasks with 5+ times speed up compared with an autoregressive model.

pdf bib
Lipschitz Constrained Parameter Initialization for Deep Transformers
Hongfei Xu | Qiuhui Liu | Josef van Genabith | Deyi Xiong | Jingyi Zhang

The Transformer translation model employs residual connection and layer normalization to ease the optimization difficulties caused by its multi-layer encoder / decoder structure. Previous research shows that even with residual connection and layer normalization, deep Transformers still have difficulty in training, and particularly Transformer models with more than 12 encoder / decoder layers fail to converge. In this paper, we first empirically demonstrate that a simple modification made in the official implementation, which changes the computation order of residual connection and layer normalization, can significantly ease the optimization of deep Transformers. We then compare the subtle differences in computation order in considerable detail, and present a parameter initialization method that leverages the Lipschitz constraint on the initialization of Transformer parameters that effectively ensures training convergence. In contrast to findings in previous research we further demonstrate that with Lipschitz parameter initialization, deep Transformers with the original computation order can converge, and obtain significant BLEU improvements with up to 24 layers. In contrast to previous research which focuses on deep encoders, our approach additionally enables Transformers to also benefit from deep decoders.

pdf bib
Multiscale Collaborative Deep Models for Neural Machine Translation
Xiangpeng Wei | Heng Yu | Yue Hu | Yue Zhang | Rongxiang Weng | Weihua Luo

Recent evidence reveals that Neural Machine Translation (NMT) models with deeper neural networks can be more effective but are difficult to train. In this paper, we present a MultiScale Collaborative (MSC) framework to ease the training of NMT models that are substantially deeper than those used previously. We explicitly boost the gradient back-propagation from top to bottom levels by introducing a block-scale collaboration mechanism into deep NMT models. Then, instead of forcing the whole encoder stack directly learns a desired representation, we let each encoder block learns a fine-grained representation and enhance it by encoding spatial dependencies using a context-scale collaboration. We provide empirical evidence showing that the MSC nets are easy to optimize and can obtain improvements of translation quality from considerably increased depth. On IWSLT translation tasks with three translation directions, our extremely deep models (with 72-layer encoders) surpass strong baselines by +2.2~+3.1 BLEU points. In addition, our deep MSC achieves a BLEU score of 30.56 on WMT14 English-to-German task that significantly outperforms state-of-the-art deep NMT models. We have included the source code in supplementary materials.

pdf bib
Opportunistic Decoding with Timely Correction for Simultaneous Translation
Renjie Zheng | Mingbo Ma | Baigong Zheng | Kaibo Liu | Liang Huang

Simultaneous translation has many important application scenarios and attracts much attention from both academia and industry recently. Most existing frameworks, however, have difficulties in balancing between the translation quality and latency, i.e., the decoding policy is usually either too aggressive or too conservative. We propose an opportunistic decoding technique with timely correction ability, which always (over-)generates a certain mount of extra words at each step to keep the audience on track with the latest information. At the same time, it also corrects, in a timely fashion, the mistakes in the former overgenerated words when observing more source context to ensure high translation quality. Experiments show our technique achieves substantial reduction in latency and up to +3.1 increase in BLEU, with revision rate under 8 % in Chinese-to-English and English-to-Chinese translation.

pdf bib
A Formal Hierarchy of RNN ArchitecturesRNN Architectures
William Merrill | Gail Weiss | Yoav Goldberg | Roy Schwartz | Noah A. Smith | Eran Yahav

We develop a formal hierarchy of the expressive capacity of RNN architectures. The hierarchy is based on two formal properties : space complexity, which measures the RNN’s memory, and rational recurrence, defined as whether the recurrent update can be described by a weighted finite-state machine. We place several RNN variants within this hierarchy. For example, we prove the LSTM is not rational, which formally separates it from the related QRNN (Bradbury et al., 2016). We also show how these models’ expressive capacity is expanded by stacking multiple layers or composing them with different pooling functions. Our results build on the theory of saturated RNNs (Merrill, 2019). While formally extending these findings to unsaturated RNNs is left to future work, we hypothesize that the practical learnable capacity of unsaturated RNNs obeys a similar hierarchy. We provide empirical results to support this conjecture. Experimental findings from training unsaturated networks on formal languages support this conjecture.

pdf bib
Integrating Semantic and Structural Information with Graph Convolutional Network for Controversy Detection
Lei Zhong | Juan Cao | Qiang Sheng | Junbo Guo | Ziang Wang

Identifying controversial posts on social media is a fundamental task for mining public sentiment, assessing the influence of events, and alleviating the polarized views. However, existing methods fail to 1) effectively incorporate the semantic information from content-related posts ; 2) preserve the structural information for reply relationship modeling ; 3) properly handle posts from topics dissimilar to those in the training set. To overcome the first two limitations, we propose Topic-Post-Comment Graph Convolutional Network (TPC-GCN), which integrates the information from the graph structure and content of topics, posts, and comments for post-level controversy detection. As to the third limitation, we extend our model to Disentangled TPC-GCN (DTPC-GCN), to disentangle topic-related and topic-unrelated features and then fuse dynamically. Extensive experiments on two real-world datasets demonstrate that our models outperform existing methods. Analysis of the results and cases proves that our models can integrate both semantic and structural information with significant generalizability.

pdf bib
Efficient Dialogue State Tracking by Selectively Overwriting Memory
Sungdong Kim | Sohee Yang | Gyuwan Kim | Sang-Woo Lee

Recent works in dialogue state tracking (DST) focus on an open vocabulary-based setting to resolve scalability and generalization issues of the predefined ontology-based approaches. However, they are inefficient in that they predict the dialogue state at every turn from scratch. Here, we consider dialogue state as an explicit fixed-sized memory and propose a selectively overwriting mechanism for more efficient DST. This mechanism consists of two steps : (1) predicting state operation on each of the memory slots, and (2) overwriting the memory with new values, of which only a few are generated according to the predicted state operations. Our method decomposes DST into two sub-tasks and guides the decoder to focus only on one of the tasks, thus reducing the burden of the decoder. This enhances the effectiveness of training and DST performance. Our SOM-DST (Selectively Overwriting Memory for Dialogue State Tracking) model achieves state-of-the-art joint goal accuracy with 51.72 % in MultiWOZ 2.0 and 53.01 % in MultiWOZ 2.1 in an open vocabulary-based DST setting. In addition, we analyze the accuracy gaps between the current and the ground truth-given situations and suggest that it is a promising direction to improve state operation prediction to boost the DST performance.

pdf bib
Learning Low-Resource End-To-End Goal-Oriented Dialog for Fast and Reliable System Deployment
Yinpei Dai | Hangyu Li | Chengguang Tang | Yongbin Li | Jian Sun | Xiaodan Zhu

Existing end-to-end dialog systems perform less effectively when data is scarce. To obtain an acceptable success in real-life online services with only a handful of training examples, both fast adaptability and reliable performance are highly desirable for dialog systems. In this paper, we propose the Meta-Dialog System (MDS), which combines the advantages of both meta-learning approaches and human-machine collaboration. We evaluate our methods on a new extended-bAbI dataset and a transformed MultiWOZ dataset for low-resource goal-oriented dialog learning. Experimental results show that MDS significantly outperforms non-meta-learning baselines and can achieve more than 90 % per-turn accuracies with only 10 dialogs on the extended-bAbI dataset.

pdf bib
Response-Anticipated Memory for On-Demand Knowledge Integration in Response Generation
Zhiliang Tian | Wei Bi | Dongkyu Lee | Lanqing Xue | Yiping Song | Xiaojiang Liu | Nevin L. Zhang

Neural conversation models are known to generate appropriate but non-informative responses in general. A scenario where informativeness can be significantly enhanced is Conversing by Reading (CbR), where conversations take place with respect to a given external document. In previous work, the external document is utilized by (1) creating a context-aware document memory that integrates information from the document and the conversational context, and then (2) generating responses referring to the memory. In this paper, we propose to create the document memory with some anticipated responses in mind. This is achieved using a teacher-student framework. The teacher is given the external document, the context, and the ground-truth response, and learns how to build a response-aware document memory from three sources of information. The student learns to construct a response-anticipated document memory from the first two sources, and teacher’s insight on memory creation. Empirical results show that our model outperforms the previous state-of-the-art for the CbR task.

pdf bib
Towards Unsupervised Language Understanding and Generation by Joint Dual Learning
Shang-Yu Su | Chao-Wei Huang | Yun-Nung Chen

In modular dialogue systems, natural language understanding (NLU) and natural language generation (NLG) are two critical components, where NLU extracts the semantics from the given texts and NLG is to construct corresponding natural language sentences based on the input semantic representations. However, the dual property between understanding and generation has been rarely explored. The prior work is the first attempt that utilized the duality between NLU and NLG to improve the performance via a dual supervised learning framework. However, the prior work still learned both components in a supervised manner ; instead, this paper introduces a general learning framework to effectively exploit such duality, providing flexibility of incorporating both supervised and unsupervised learning algorithms to train language understanding and generation models in a joint fashion. The benchmark experiments demonstrate that the proposed approach is capable of boosting the performance of both NLU and NLG. The source code is available at :

pdf bib
USR : An Unsupervised and Reference Free Evaluation Metric for Dialog GenerationUSR: An Unsupervised and Reference Free Evaluation Metric for Dialog Generation
Shikib Mehri | Maxine Eskenazi

The lack of meaningful automatic evaluation metrics for dialog has impeded open-domain dialog research. Standard language generation metrics have been shown to be ineffective for evaluating dialog models. To this end, this paper presents USR, an UnSupervised and Reference-free evaluation metric for dialog. USR is a reference-free metric that trains unsupervised models to measure several desirable qualities of dialog. USR is shown to strongly correlate with human judgment on both Topical-Chat (turn-level : 0.42, system-level : 1.0) and PersonaChat (turn-level : 0.48 and system-level : 1.0). USR additionally produces interpretable measures for several desirable properties of dialog.

pdf bib
Improved Natural Language Generation via Loss Truncation
Daniel Kang | Tatsunori B. Hashimoto

Neural language models are usually trained to match the distributional properties of large-scale corpora by minimizing the log loss. While straightforward to optimize, this approach forces the model to reproduce all variations in the dataset, including noisy and invalid references (e.g., misannotations and hallucinated facts). Even a small fraction of noisy data can degrade the performance of log loss. As an alternative, prior work has shown that minimizing the distinguishability of generated samples is a principled and robust loss that can handle invalid references. However, distinguishability has not been used in practice due to challenges in optimization and estimation. We propose loss truncation : a simple and scalable procedure which adaptively removes high log loss examples as a way to optimize for distinguishability. Empirically, we demonstrate that loss truncation outperforms existing baselines on distinguishability on a summarization task. Furthermore, we show that samples generated by the loss truncation model have factual accuracy ratings that exceed those of baselines and match human references.

pdf bib
Rigid Formats Controlled Text Generation
Piji Li | Haisong Zhang | Xiaojiang Liu | Shuming Shi

Neural text generation has made tremendous progress in various tasks. One common characteristic of most of the tasks is that the texts are not restricted to some rigid formats when generating. However, we may confront some special text paradigms such as Lyrics (assume the music score is given), Sonnet, SongCi (classical Chinese poetry of the Song dynasty), etc. The typical characteristics of these texts are in three folds : (1) They must comply fully with the rigid predefined formats. (2) They must obey some rhyming schemes. (3) Although they are restricted to some formats, the sentence integrity must be guaranteed. To the best of our knowledge, text generation based on the predefined rigid formats has not been well investigated. Therefore, we propose a simple and elegant framework named SongNet to tackle this problem. The backbone of the framework is a Transformer-based auto-regressive language model. Sets of symbols are tailor-designed to improve the modeling performance especially on format, rhyme, and sentence integrity. We improve the attention mechanism to impel the model to capture some future information on the format. A pre-training and fine-tuning framework is designed to further improve the generation quality. Extensive experiments conducted on two collected corpora demonstrate that our proposed framework generates significantly better results in terms of both automatic metrics and the human evaluation.

Syn-QG : Syntactic and Shallow Semantic Rules for Question GenerationQG: Syntactic and Shallow Semantic Rules for Question Generation
Kaustubh Dhole | Christopher D. Manning

Question Generation (QG) is fundamentally a simple syntactic transformation ; however, many aspects of semantics influence what questions are good to form. We implement this observation by developing Syn-QG, a set of transparent syntactic rules leveraging universal dependencies, shallow semantic parsing, lexical resources, and custom rules which transform declarative sentences into question-answer pairs. We utilize PropBank argument descriptions and VerbNet state predicates to incorporate shallow semantic content, which helps generate questions of a descriptive nature and produce inferential and semantically richer questions than existing systems. In order to improve syntactic fluency and eliminate grammatically incorrect questions, we employ back-translation over the output of these syntactic rules. A set of crowd-sourced evaluations shows that our system can generate a larger number of highly grammatical and relevant questions than previous QG systems and that back-translation drastically improves grammaticality at a slight cost of generating irrelevant questions.

pdf bib
Tree-Structured Neural Topic ModelTree-Structured Neural Topic Model
Masaru Isonuma | Junichiro Mori | Danushka Bollegala | Ichiro Sakata

This paper presents a tree-structured neural topic model, which has a topic distribution over a tree with an infinite number of branches. Our model parameterizes an unbounded ancestral and fraternal topic distribution by applying doubly-recurrent neural networks. With the help of autoencoding variational Bayes, our model improves data scalability and achieves competitive performance when inducing latent topics and tree structures, as compared to a prior tree-structured topic model (Blei et al., 2010). This work extends the tree-structured topic model such that it can be incorporated with neural models for downstream tasks.

pdf bib
Unsupervised FAQ Retrieval with Question Generation and BERTFAQ Retrieval with Question Generation and BERT
Yosi Mass | Boaz Carmeli | Haggai Roitman | David Konopnicki

We focus on the task of Frequently Asked Questions (FAQ) retrieval. A given user query can be matched against the questions and/or the answers in the FAQ. We present a fully unsupervised method that exploits the FAQ pairs to train two BERT models. The two models match user queries to FAQ answers and questions, respectively. We alleviate the missing labeled data of the latter by automatically generating high-quality question paraphrases. We show that our model is on par and even outperforms supervised models on existing datasets.

pdf bib
Interpreting Twitter User GeolocationTwitter User Geolocation
Ting Zhong | Tianliang Wang | Fan Zhou | Goce Trajcevski | Kunpeng Zhang | Yi Yang

Identifying user geolocation in online social networks is an essential task in many location-based applications. Existing methods rely on the similarity of text and network structure, however, they suffer from a lack of interpretability on the corresponding results, which is crucial for understanding model behavior. In this work, we adopt influence functions to interpret the behavior of GNN-based models by identifying the importance of training users when predicting the locations of the testing users. This methodology helps with providing meaningful explanations on prediction results. Furthermore, it also initiates an attempt to uncover the so-called black-box GNN-based models by investigating the effect of individual nodes.

pdf bib
A Frame-based Sentence Representation for Machine Reading Comprehension
Shaoru Guo | Ru Li | Hongye Tan | Xiaoli Li | Yong Guan | Hongyan Zhao | Yueping Zhang

Sentence representation (SR) is the most crucial and challenging task in Machine Reading Comprehension (MRC). MRC systems typically only utilize the information contained in the sentence itself, while human beings can leverage their semantic knowledge. To bridge the gap, we proposed a novel Frame-based Sentence Representation (FSR) method, which employs frame semantic knowledge to facilitate sentence modelling. Specifically, different from existing methods that only model lexical units (LUs), Frame Representation Models, which utilize both LUs in frame and Frame-to-Frame (F-to-F) relations, are designed to model frames and sentences with attention schema. Our proposed FSR method is able to integrate multiple-frame semantic information to get much better sentence representations. Our extensive experimental results show that it performs better than state-of-the-art technologies on machine reading comprehension task.

pdf bib
Enhancing Answer Boundary Detection for Multilingual Machine Reading Comprehension
Fei Yuan | Linjun Shou | Xuanyu Bai | Ming Gong | Yaobo Liang | Nan Duan | Yan Fu | Daxin Jiang

Multilingual pre-trained models could leverage the training data from a rich source language (such as English) to improve performance on low resource languages. However, the transfer quality for multilingual Machine Reading Comprehension (MRC) is significantly worse than sentence classification tasks mainly due to the requirement of MRC to detect the word level answer boundary. In this paper, we propose two auxiliary tasks in the fine-tuning stage to create additional phrase boundary supervision : (1) A mixed MRC task, which translates the question or passage to other languages and builds cross-lingual question-passage pairs ; (2) A language-agnostic knowledge masking task by leveraging knowledge phrases mined from web. Besides, extensive experiments on two cross-lingual MRC datasets show the effectiveness of our proposed approach.

pdf bib
Explicit Memory Tracker with Coarse-to-Fine Reasoning for Conversational Machine Reading
Yifan Gao | Chien-Sheng Wu | Shafiq Joty | Caiming Xiong | Richard Socher | Irwin King | Michael Lyu | Steven C.H. Hoi

The goal of conversational machine reading is to answer user questions given a knowledge base text which may require asking clarification questions. Existing approaches are limited in their decision making due to struggles in extracting question-related rules and reasoning about them. In this paper, we present a new framework of conversational machine reading that comprises a novel Explicit Memory Tracker (EMT) to track whether conditions listed in the rule text have already been satisfied to make a decision. Moreover, our framework generates clarification questions by adopting a coarse-to-fine reasoning strategy, utilizing sentence-level entailment scores to weight token-level distributions. On the ShARC benchmark (blind, held-out) testset, EMT achieves new state-of-the-art results of 74.6 % micro-averaged decision accuracy and 49.5 BLEU4. We also show that EMT is more interpretable by visualizing the entailment-oriented reasoning process as the conversation flows. Code and models are released at

pdf bib
Code-Switching Patterns Can Be an Effective Route to Improve Performance of Downstream NLP Applications : A Case Study of Humour, Sarcasm and Hate Speech DetectionNLP Applications: A Case Study of Humour, Sarcasm and Hate Speech Detection
Srijan Bansal | Vishal Garimella | Ayush Suhane | Jasabanta Patro | Animesh Mukherjee

In this paper, we demonstrate how code-switching patterns can be utilised to improve various downstream NLP applications. In particular, we encode various switching features to improve humour, sarcasm and hate speech detection tasks. We believe that this simple linguistic observation can also be potentially helpful in improving other similar NLP applications.

pdf bib
Multimodal Quality Estimation for Machine Translation
Shu Okabe | Frédéric Blain | Lucia Specia

We propose approaches to Quality Estimation (QE) for Machine Translation that explore both text and visual modalities for Multimodal QE. We compare various multimodality integration and fusion strategies. For both sentence-level and document-level predictions, we show that state-of-the-art neural and feature-based QE frameworks obtain better results when using the additional modality.

pdf bib
The TechQA DatasetTechQA Dataset
Vittorio Castelli | Rishav Chakravarti | Saswati Dana | Anthony Ferritto | Radu Florian | Martin Franz | Dinesh Garg | Dinesh Khandelwal | Scott McCarley | Michael McCawley | Mohamed Nasr | Lin Pan | Cezar Pendus | John Pitrelli | Saurabh Pujar | Salim Roukos | Andrzej Sakrajda | Avi Sil | Rosario Uceda-Sosa | Todd Ward | Rong Zhang

We introduce TECHQA, a domain-adaptation question answering dataset for the technical support domain. The TECHQA corpus highlights two real-world issues from the automated customer support domain. First, it contains actual questions posed by users on a technical forum, rather than questions generated specifically for a competition or a task. Second, it has a real-world size 600 training, 310 dev, and 490 evaluation question / answer pairs thus reflecting the cost of creating large labeled datasets with actual data. Hence, TECHQA is meant to stimulate research in domain adaptation rather than as a resource to build QA systems from scratch. TECHQA was obtained by crawling the IBMDeveloper and DeveloperWorks forums for questions with accepted answers provided in an IBM Technotea technical document that addresses a specific technical issue. We also release a collection of the 801,998 Technotes available on the web as of April 4, 2019 as a companion resource that can be used to learn representations of the IT domain language.

pdf bib
Attend, Translate and Summarize : An Efficient Method for Neural Cross-Lingual Summarization
Junnan Zhu | Yu Zhou | Jiajun Zhang | Chengqing Zong

Cross-lingual summarization aims at summarizing a document in one language (e.g., Chinese) into another language (e.g., English). In this paper, we propose a novel method inspired by the translation pattern in the process of obtaining a cross-lingual summary. We first attend to some words in the source text, then translate them into the target language, and summarize to get the final summary. Specifically, we first employ the encoder-decoder attention distribution to attend to the source words. Second, we present three strategies to acquire the translation probability, which helps obtain the translation candidates for each source word. Finally, each summary word is generated either from the neural distribution or from the translation candidates of source words. Experimental results on Chinese-to-English and English-to-Chinese summarization tasks have shown that our proposed method can significantly outperform the baselines, achieving comparable performance with the state-of-the-art.

pdf bib
Improving Truthfulness of Headline Generation
Kazuki Matsumaru | Sho Takase | Naoaki Okazaki

Most studies on abstractive summarization report ROUGE scores between system and reference summaries. However, we have a concern about the truthfulness of generated summaries : whether all facts of a generated summary are mentioned in the source text. This paper explores improving the truthfulness in headline generation on two popular datasets. Analyzing headlines generated by the state-of-the-art encoder-decoder model, we show that the model sometimes generates untruthful headlines. We conjecture that one of the reasons lies in untruthful supervision data used for training the model. In order to quantify the truthfulness of article-headline pairs, we consider the textual entailment of whether an article entails its headline. After confirming quite a few untruthful instances in the datasets, this study hypothesizes that removing untruthful instances from the supervision data may remedy the problem of the untruthful behaviors of the model. Building a binary classifier that predicts an entailment relation between an article and its headline, we filter out untruthful instances from the supervision data. Experimental results demonstrate that the headline generation model trained on filtered supervision data shows no clear difference in ROUGE scores but remarkable improvements in automatic and manual evaluations of the generated headlines.

pdf bib
Self-Attention Guided Copy Mechanism for Abstractive Summarization
Song Xu | Haoran Li | Peng Yuan | Youzheng Wu | Xiaodong He | Bowen Zhou

Copy module has been widely equipped in the recent abstractive summarization models, which facilitates the decoder to extract words from the source into the summary. Generally, the encoder-decoder attention is served as the copy distribution, while how to guarantee that important words in the source are copied remains a challenge. In this work, we propose a Transformer-based model to enhance the copy mechanism. Specifically, we identify the importance of each source word based on the degree centrality with a directed graph built by the self-attention layer in the Transformer. We use the centrality of each source word to guide the copy process explicitly. Experimental results show that the self-attention graph provides useful guidance for the copy distribution. Our proposed models significantly outperform the baseline methods on the CNN / Daily Mail dataset and the Gigaword dataset.

pdf bib
Fast and Accurate Non-Projective Dependency Tree Linearization
Xiang Yu | Simon Tannert | Ngoc Thang Vu | Jonas Kuhn

We propose a graph-based method to tackle the dependency tree linearization task. We formulate the task as a Traveling Salesman Problem (TSP), and use a biaffine attention model to calculate the edge costs. We facilitate the decoding by solving the TSP for each subtree and combining the solution into a projective tree. We then design a transition system as post-processing, inspired by non-projective transition-based parsing, to obtain non-projective sentences. Our proposed method outperforms the state-of-the-art linearizer while being 10 times faster in training and decoding.

pdf bib
In Layman’s Terms : Semi-Open Relation Extraction from Scientific Texts
Ruben Kruiper | Julian Vincent | Jessica Chen-Burger | Marc Desmulliez | Ioannis Konstas

Information Extraction (IE) from scientific texts can be used to guide readers to the central information in scientific documents. But narrow IE systems extract only a fraction of the information captured, and Open IE systems do not perform well on the long and complex sentences encountered in scientific texts. In this work we combine the output of both types of systems to achieve Semi-Open Relation Extraction, a new task that we explore in the Biology domain. First, we present the Focused Open Biological Information Extraction (FOBIE) dataset and use FOBIE to train a state-of-the-art narrow scientific IE system to extract trade-off relations and arguments that are central to biology texts. We then run both the narrow IE system and a state-of-the-art Open IE system on a corpus of 10 K open-access scientific biological texts. We show that a significant amount (65 %) of erroneous and uninformative Open IE extractions can be filtered using narrow IE extractions. Furthermore, we show that the retained extractions are significantly more often informative to a reader.

pdf bib
Probing Linguistic Features of Sentence-Level Representations in Neural Relation Extraction
Christoph Alt | Aleksandra Gabryszak | Leonhard Hennig

Despite the recent progress, little is known about the features captured by state-of-the-art neural relation extraction (RE) models. Common methods encode the source sentence, conditioned on the entity mentions, before classifying the relation. However, the complexity of the task makes it difficult to understand how encoder architecture and supporting linguistic knowledge affect the features learned by the encoder. We introduce 14 probing tasks targeting linguistic properties relevant to RE, and we use them to study representations learned by more than 40 different encoder architecture and linguistic feature combinations trained on two datasets, TACRED and SemEval 2010 Task 8. We find that the bias induced by the architecture and the inclusion of linguistic features are clearly expressed in the probing task performance. For example, adding contextualized word representations greatly increases performance on probing tasks with a focus on named entity and part-of-speech information, and yields better results in RE. In contrast, entity masking improves RE, but considerably lowers performance on entity type related probing tasks.

pdf bib
Character-Level Translation with Self-attention
Yingqiang Gao | Nikola I. Nikolov | Yuhuang Hu | Richard H.R. Hahnloser

We explore the suitability of self-attention models for character-level neural machine translation. We test the standard transformer model, as well as a novel variant in which the encoder block combines information from nearby characters using convolutions. We perform extensive experiments on WMT and UN datasets, testing both bilingual and multilingual translation to English using up to three input languages (French, Spanish, and Chinese). Our transformer variant consistently outperforms the standard transformer at the character-level and converges faster while learning more robust character-level alignments.

pdf bib
End-to-End Neural Word Alignment Outperforms GIZA++GIZA++
Thomas Zenkel | Joern Wuebker | John DeNero

Word alignment was once a core unsupervised learning task in natural language processing because of its essential role in training statistical machine translation (MT) models. Although unnecessary for training neural MT models, word alignment still plays an important role in interactive applications of neural machine translation, such as annotation transfer and lexicon injection. While statistical MT methods have been replaced by neural approaches with superior performance, the twenty-year-old GIZA++ toolkit remains a key component of state-of-the-art word alignment systems. Prior work on neural word alignment has only been able to outperform GIZA++ by using its output during training. We present the first end-to-end neural word alignment method that consistently outperforms GIZA++ on three data sets. Our approach repurposes a Transformer model trained for supervised translation to also serve as an unsupervised word alignment model in a manner that is tightly integrated and does not affect translation quality.

pdf bib
Parallel Sentence Mining by Constrained Decoding
Pinzhen Chen | Nikolay Bogoychev | Kenneth Heafield | Faheem Kirefu

We present a novel method to extract parallel sentences from two monolingual corpora, using neural machine translation. Our method relies on translating sentences in one corpus, but constraining the decoding by a prefix tree built on the other corpus. We argue that a neural machine translation system by itself can be a sentence similarity scorer and it efficiently approximates pairwise comparison with a modified beam search. When benchmarked on the BUCC shared task, our method achieves results comparable to other submissions.

pdf bib
Inflecting When There’s No Majority : Limitations of Encoder-Decoder Neural Networks as Cognitive Models for German PluralsGerman Plurals
Kate McCurdy | Sharon Goldwater | Adam Lopez

Can artificial neural networks learn to represent inflectional morphology and generalize to new words as human speakers do? Kirov and Cotterell (2018) argue that the answer is yes : modern Encoder-Decoder (ED) architectures learn human-like behavior when inflecting English verbs, such as extending the regular past tense form /-(e)d/ to novel words. However, their work does not address the criticism raised by Marcus et al. (1995): that neural models may learn to extend not the regular, but the most frequent class and thus fail on tasks like German number inflection, where infrequent suffixes like /-s/ can still be productively generalized. To investigate this question, we first collect a new dataset from German speakers (production and ratings of plural forms for novel nouns) that is designed to avoid sources of information unavailable to the ED model. The speaker data show high variability, and two suffixes evince ‘regular’ behavior, appearing more often with phonologically atypical inputs. Encoder-decoder models do generalize the most frequently produced plural class, but do not show human-like variability or ‘regular’ extension of these other plural markers. We conclude that modern neural models may still struggle with minority-class generalization.

pdf bib
Modelling Suspense in Short Stories as Uncertainty Reduction over Neural Representation
David Wilmot | Frank Keller

Suspense is a crucial ingredient of narrative fiction, engaging readers and making stories compelling. While there is a vast theoretical literature on suspense, it is computationally not well understood. We compare two ways for modelling suspense : surprise, a backward-looking measure of how unexpected the current state is given the story so far ; and uncertainty reduction, a forward-looking measure of how unexpected the continuation of the story is. Both can be computed either directly over story representations or over their probability distributions. We propose a hierarchical language model that encodes stories and computes surprise and uncertainty reduction. Evaluating against short stories annotated with human suspense judgements, we find that uncertainty reduction over representations is the best predictor, resulting in near human accuracy. We also show that uncertainty reduction can be used to predict suspenseful events in movie synopses.

pdf bib
Automatic Detection of Generated Text is Easiest when Humans are Fooled
Daphne Ippolito | Daniel Duckworth | Chris Callison-Burch | Douglas Eck

Recent advancements in neural language modelling make it possible to rapidly generate vast amounts of human-sounding text. The capabilities of humans and automatic discriminators to detect machine-generated text have been a large source of research interest, but humans and machines rely on different cues to make their decisions. Here, we perform careful benchmarking and analysis of three popular sampling-based decoding strategiestop-_k _, nucleus sampling, and untruncated random samplingand show that improvements in decoding methods have primarily optimized for fooling humans. This comes at the expense of introducing statistical abnormalities that make detection easy for automatic systems. We also show that though both human and automatic detector performance improve with longer excerpt length, even multi-sentence excerpts can fool expert human raters over 30 % of the time. Our findings reveal the importance of using both human and automatic detectors to assess the humanness of text generation systems.

pdf bib
GPT-too : A Language-Model-First Approach for AMR-to-Text GenerationGPT-too: A Language-Model-First Approach for AMR-to-Text Generation
Manuel Mager | Ramón Fernandez Astudillo | Tahira Naseem | Md Arafat Sultan | Young-Suk Lee | Radu Florian | Salim Roukos

Abstract Meaning Representations (AMRs) are broad-coverage sentence-level semantic graphs. Existing approaches to generating text from AMR have focused on training sequence-to-sequence or graph-to-sequence models on AMR annotated data only. In this paper, we propose an alternative approach that combines a strong pre-trained language model with cycle consistency-based re-scoring. Despite the simplicity of the approach, our experimental results show these models outperform all previous techniques on the English LDC2017T10 dataset, including the recent use of transformer architectures. In addition to the standard evaluation metrics, we provide human evaluation experiments that further substantiate the strength of our approach.

pdf bib
Politeness Transfer : A Tag and Generate Approach
Aman Madaan | Amrith Setlur | Tanmay Parekh | Barnabas Poczos | Graham Neubig | Yiming Yang | Ruslan Salakhutdinov | Alan W Black | Shrimai Prabhumoye

This paper introduces a new task of politeness transfer which involves converting non-polite sentences to polite sentences while preserving the meaning. We also provide a dataset of more than 1.39 instances automatically labeled for politeness to encourage benchmark evaluations on this new task. We design a tag and generate pipeline that identifies stylistic attributes and subsequently generates a sentence in the target style while preserving most of the source content. For politeness as well as five other transfer tasks, our model outperforms the state-of-the-art methods on automatic metrics for content preservation, with a comparable or better performance on style transfer accuracy. Additionally, our model surpasses existing methods on human evaluations for grammaticality, meaning preservation and transfer accuracy across all the six style transfer tasks. The data and code is located at

pdf bib
BPE-Dropout : Simple and Effective Subword RegularizationBPE-Dropout: Simple and Effective Subword Regularization
Ivan Provilkov | Dmitrii Emelianenko | Elena Voita

Subword segmentation is widely used to address the open vocabulary problem in machine translation. The dominant approach to subword segmentation is Byte Pair Encoding (BPE), which keeps the most frequent words intact while splitting the rare ones into multiple tokens. While multiple segmentations are possible even with the same vocabulary, BPE splits words into unique sequences ; this may prevent a model from better learning the compositionality of words and being robust to segmentation errors. So far, the only way to overcome this BPE imperfection, its deterministic nature, was to create another subword segmentation algorithm (Kudo, 2018). In contrast, we show that BPE itself incorporates the ability to produce multiple segmentations of the same word. We introduce BPE-dropout-simple and effective subword regularization method based on and compatible with conventional BPE. It stochastically corrupts the segmentation procedure of BPE, which leads to producing multiple segmentations within the same fixed BPE framework. Using BPE-dropout during training and the standard BPE during inference improves translation quality up to 2.3 BLEU compared to BPE and up to 0.9 BLEU compared to the previous subword regularization.

pdf bib
Attend to Medical Ontologies : Content Selection for Clinical Abstractive Summarization
Sajad Sotudeh Gharebagh | Nazli Goharian | Ross Filice

Sequence-to-sequence (seq2seq) network is a well-established model for text summarization task. It can learn to produce readable content ; however, it falls short in effectively identifying key regions of the source. In this paper, we approach the content selection problem for clinical abstractive summarization by augmenting salient ontological terms into the summarizer. Our experiments on two publicly available clinical data sets (107,372 reports of MIMIC-CXR, and 3,366 reports of OpenI) show that our model statistically significantly boosts state-of-the-art results in terms of ROUGE metrics (with improvements : 2.9 % RG-1, 2.5 % RG-2, 1.9 % RG-L), in the healthcare domain where any range of improvement impacts patients’ welfare.

pdf bib
On Faithfulness and Factuality in Abstractive Summarization
Joshua Maynez | Shashi Narayan | Bernd Bohnet | Ryan McDonald

It is well known that the standard likelihood training and approximate decoding objectives in neural text generation models lead to less human-like responses for open-ended tasks such as language modeling and story generation. In this paper we have analyzed limitations of these models for abstractive document summarization and found that these models are highly prone to hallucinate content that is unfaithful to the input document. We conducted a large scale human evaluation of several neural abstractive summarization systems to better understand the types of hallucinations they produce. Our human annotators found substantial amounts of hallucinated content in all model generated summaries. However, our analysis does show that pretrained models are better summarizers not only in terms of raw metrics, i.e., ROUGE, but also in generating faithful and factual summaries as evaluated by humans. Furthermore, we show that textual entailment measures better correlate with faithfulness than standard metrics, potentially leading the way to automatic evaluation metrics as well as training and decoding criteria.

pdf bib
Recurrent Neural Network Language Models Always Learn English-Like Relative Clause AttachmentEnglish-Like Relative Clause Attachment
Forrest Davis | Marten van Schijndel

A standard approach to evaluating language models analyzes how models assign probabilities to valid versus invalid syntactic constructions (i.e. is a grammatical sentence more probable than an ungrammatical sentence). Our work uses ambiguous relative clause attachment to extend such evaluations to cases of multiple simultaneous valid interpretations, where stark grammaticality differences are absent. We compare model performance in English and Spanish to show that non-linguistic biases in RNN LMs advantageously overlap with syntactic structure in English but not Spanish. Thus, English models may appear to acquire human-like syntactic preferences, while models trained on Spanish fail to acquire comparable human-like preferences. We conclude by relating these results to broader concerns about the relationship between comprehension (i.e. typical language model use cases) and production (which generates the training data for language models), suggesting that necessary linguistic biases are not present in the training signal at all.

pdf bib
Speakers enhance contextually confusable words
Eric Meinhardt | Eric Bakovic | Leon Bergen

Recent work has found evidence that natural languages are shaped by pressures for efficient communication e.g. the more contextually predictable a word is, the fewer speech sounds or syllables it has (Piantadosi et al. Research on the degree to which speech and language are shaped by pressures for effective communication robustness in the face of noise and uncertainty has been more equivocal. We develop a measure of contextual confusability during word recognition based on psychoacoustic data. Applying this measure to naturalistic speech corpora, we find evidence suggesting that speakers alter their productions to make contextually more confusable words easier to understand.

pdf bib
Learning to Contextually Aggregate Multi-Source Supervision for Sequence Labeling
Ouyu Lan | Xiao Huang | Bill Yuchen Lin | He Jiang | Liyuan Liu | Xiang Ren

Sequence labeling is a fundamental task for a range of natural language processing problems. When used in practice, its performance is largely influenced by the annotation quality and quantity, and meanwhile, obtaining ground truth labels is often costly. In many cases, ground truth labels do not exist, but noisy annotations or annotations from different domains are accessible. In this paper, we propose a novel framework Consensus Network (ConNet) that can be trained on annotations from multiple sources (e.g., crowd annotation, cross-domain data). It learns individual representation for every source and dynamically aggregates source-specific knowledge by a context-aware attention module. Finally, it leads to a model reflecting the agreement (consensus) among multiple sources. We evaluate the proposed framework in two practical settings of multi-source learning : learning with crowd annotations and unsupervised cross-domain model adaptation. Extensive experimental results show that our model achieves significant improvements over existing methods in both settings. We also demonstrate that the method can apply to various tasks and cope with different encoders.

pdf bib
MixText : Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text ClassificationMixText: Linguistically-Informed Interpolation of Hidden Space for Semi-Supervised Text Classification
Jiaao Chen | Zichao Yang | Diyi Yang

This paper presents MixText, a semi-supervised learning method for text classification, which uses our newly designed data augmentation method called TMix. TMix creates a large amount of augmented training samples by interpolating text in hidden space. Moreover, we leverage recent advances in data augmentation to guess low-entropy labels for unlabeled data, hence making them as easy to use as labeled data. By mixing labeled, unlabeled and augmented data, MixText significantly outperformed current pre-trained and fined-tuned models and other state-of-the-art semi-supervised learning methods on several text classification benchmarks. The improvement is especially prominent when supervision is extremely limited. We have publicly released our code at

pdf bib
SMART : Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized OptimizationSMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization
Haoming Jiang | Pengcheng He | Weizhu Chen | Xiaodong Liu | Jianfeng Gao | Tuo Zhao

Transfer learning has fundamentally changed the landscape of natural language processing (NLP). Many state-of-the-art models are first pre-trained on a large text corpus and then fine-tuned on downstream tasks. However, due to limited data resources from downstream tasks and the extremely high complexity of pre-trained models, aggressive fine-tuning often causes the fine-tuned model to overfit the training data of downstream tasks and fail to generalize to unseen data. To address such an issue in a principled manner, we propose a new learning framework for robust and efficient fine-tuning for pre-trained models to attain better generalization performance. The proposed framework contains two important ingredients : 1. Smoothness-inducing regularization, which effectively manages the complexity of the model ; 2. Bregman proximal point optimization, which is an instance of trust-region methods and can prevent aggressive updating. Our experiments show that the proposed framework achieves new state-of-the-art performance on a number of NLP tasks including GLUE, SNLI, SciTail and ANLI. Moreover, it also outperforms the state-of-the-art T5 model, which is the largest pre-trained model containing 11 billion parameters, on GLUE.

pdf bib
Taxonomy Construction of Unseen Domains via Graph-based Cross-Domain Knowledge Transfer
Chao Shang | Sarthak Dash | Md. Faisal Mahbub Chowdhury | Nandana Mihindukulasooriya | Alfio Gliozzo

Extracting lexico-semantic relations as graph-structured taxonomies, also known as taxonomy construction, has been beneficial in a variety of NLP applications. Recently Graph Neural Network (GNN) has shown to be powerful in successfully tackling many tasks. However, there has been no attempt to exploit GNN to create taxonomies. In this paper, we propose Graph2Taxo, a GNN-based cross-domain transfer framework for the taxonomy construction task. Our main contribution is to learn the latent features of taxonomy construction from existing domains to guide the structure learning of an unseen domain. We also propose a novel method of directed acyclic graph (DAG) generation for taxonomy construction. Specifically, our proposed Graph2Taxo uses a noisy graph constructed from automatically extracted noisy hyponym hypernym candidate pairs, and a set of taxonomies for some known domains for training. The learned model is then used to generate taxonomy for a new unknown domain given a set of terms for that domain. Experiments on benchmark datasets from science and environment domains show that our approach attains significant improvements correspondingly over the state of the art.

pdf bib
A Girl Has A Name : Detecting Authorship Obfuscation
Asad Mahmood | Zubair Shafiq | Padmini Srinivasan

Authorship attribution aims to identify the author of a text based on the stylometric analysis. Authorship obfuscation, on the other hand, aims to protect against authorship attribution by modifying a text’s style. In this paper, we evaluate the stealthiness of state-of-the-art authorship obfuscation methods under an adversarial threat model. An obfuscator is stealthy to the extent an adversary finds it challenging to detect whether or not a text modified by the obfuscator is obfuscated a decision that is key to the adversary interested in authorship attribution. We show that the existing authorship obfuscation methods are not stealthy as their obfuscated texts can be identified with an average F1 score of 0.87. The reason for the lack of stealthiness is that these obfuscators degrade text smoothness, as ascertained by neural language models, in a detectable manner. Our results highlight the need to develop stealthy authorship obfuscation methods that can better protect the identity of an author seeking anonymity.

pdf bib
Efficient Strategies for Hierarchical Text Classification : External Knowledge and Auxiliary Tasks
Kervy Rivas Rojas | Gina Bustamante | Arturo Oncevay | Marco Antonio Sobrevilla Cabezudo

In hierarchical text classification, we perform a sequence of inference steps to predict the category of a document from top to bottom of a given class taxonomy. Most of the studies have focused on developing novels neural network architectures to deal with the hierarchical structure, but we prefer to look for efficient ways to strengthen a baseline model. We first define the task as a sequence-to-sequence problem. Afterwards, we propose an auxiliary synthetic task of bottom-up-classification. Then, from external dictionaries, we retrieve textual definitions for the classes of all the hierarchy’s layers, and map them into the word vector space. We use the class-definition embeddings as an additional input to condition the prediction of the next layer and in an adapted beam search. Whereas the modified search did not provide large gains, the combination of the auxiliary task and the additional input of class-definitions significantly enhance the classification accuracy. With our efficient approaches, we outperform previous studies, using a drastically reduced number of parameters, in two well-known English datasets.

pdf bib
SPECTER : Document-level Representation Learning using Citation-informed TransformersSPECTER: Document-level Representation Learning using Citation-informed Transformers
Arman Cohan | Sergey Feldman | Iz Beltagy | Doug Downey | Daniel Weld

Representation learning is a critical ingredient for natural language processing systems. Recent Transformer language models like BERT learn powerful textual representations, but these models are targeted towards token- and sentence-level training objectives and do not leverage information on inter-document relatedness, which limits their document-level representation power. For applications on scientific documents, such as classification and recommendation, accurate embeddings of documents are a necessity. We propose SPECTER, a new method to generate document-level embedding of scientific papers based on pretraining a Transformer language model on a powerful signal of document-level relatedness : the citation graph. Unlike existing pretrained language models, Specter can be easily applied to downstream applications without task-specific fine-tuning. Additionally, to encourage further research on document-level models, we introduce SciDocs, a new evaluation benchmark consisting of seven document-level tasks ranging from citation prediction, to document classification and recommendation. We show that Specter outperforms a variety of competitive baselines on the benchmark.

pdf bib
Semantic Scaffolds for Pseudocode-to-Code Generation
Ruiqi Zhong | Mitchell Stern | Dan Klein

We propose a method for program generation based on semantic scaffolds, lightweight structures representing the high-level semantic and syntactic composition of a program. By first searching over plausible scaffolds then using these as constraints for a beam search over programs, we achieve better coverage of the search space when compared with existing techniques. We apply our hierarchical search method to the SPoC dataset for pseudocode-to-code generation, in which we are given line-level natural language pseudocode annotations and aim to produce a program satisfying execution-based test cases. By using semantic scaffolds during inference, we achieve a 10 % absolute improvement in top-100 accuracy over the previous state-of-the-art. Additionally, we require only 11 candidates to reach the top-3000 performance of the previous best approach when tested against unseen problems, demonstrating a substantial improvement in efficiency.

pdf bib
Can We Predict New Facts with Open Knowledge Graph Embeddings? A Benchmark for Open Link Prediction
Samuel Broscheit | Kiril Gashteovski | Yanjie Wang | Rainer Gemulla

Open Information Extraction systems extract (subject text, relation text, object text) triples from raw text. Some triples are textual versions of facts, i.e., non-canonicalized mentions of entities and relations. In this paper, we investigate whether it is possible to infer new facts directly from the open knowledge graph without any canonicalization or any supervision from curated knowledge. For this purpose, we propose the open link prediction task, i.e., predicting test facts by completing (subject text, relation text,?) questions. An evaluation in such a setup raises the question if a correct prediction is actually a new fact that was induced by reasoning over the open knowledge graph or if it can be trivially explained. For example, facts can appear in different paraphrased textual variants, which can lead to test leakage. To this end, we propose an evaluation protocol and a methodology for creating the open link prediction benchmark OlpBench. We performed experiments with a prototypical knowledge graph embedding model for openlink prediction. While the task is very challenging, our results suggests that it is possible to predict genuinely new facts, which can not be trivially explained.

pdf bib
Integrating Multimodal Information in Large Pretrained Transformers
Wasifur Rahman | Md Kamrul Hasan | Sangwu Lee | AmirAli Bagher Zadeh | Chengfeng Mao | Louis-Philippe Morency | Ehsan Hoque

Recent Transformer-based contextual word representations, including BERT and XLNet, have shown state-of-the-art performance in multiple disciplines within NLP. Fine-tuning the trained contextual models on task-specific datasets has been the key to achieving superior performance downstream. While fine-tuning these pre-trained models is straightforward for lexical applications (applications with only language modality), it is not trivial for multimodal language (a growing area in NLP focused on modeling face-to-face communication). More specifically, this is due to the fact that pre-trained models do n’t have the necessary components to accept two extra modalities of vision and acoustic. In this paper, we proposed an attachment to BERT and XLNet called Multimodal Adaptation Gate (MAG). MAG allows BERT and XLNet to accept multimodal nonverbal data during fine-tuning. It does so by generating a shift to internal representation of BERT and XLNet ; a shift that is conditioned on the visual and acoustic modalities. In our experiments, we study the commonly used CMU-MOSI and CMU-MOSEI datasets for multimodal sentiment analysis. Fine-tuning MAG-BERT and MAG-XLNet significantly boosts the sentiment analysis performance over previous baselines as well as language-only fine-tuning of BERT and XLNet. On the CMU-MOSI dataset, MAG-XLNet achieves human-level multimodal sentiment analysis performance for the first time in the NLP community.

pdf bib
MultiQT : Multimodal learning for real-time question tracking in speechMultiQT: Multimodal learning for real-time question tracking in speech
Jakob D. Havtorn | Jan Latko | Joakim Edin | Lars Maaløe | Lasse Borgholt | Lorenzo Belgrano | Nicolai Jacobsen | Regitze Sdun | Željko Agić

We address a challenging and practical task of labeling questions in speech in real time during telephone calls to emergency medical services in English, which embeds within a broader decision support system for emergency call-takers. We propose a novel multimodal approach to real-time sequence labeling in speech. Our model treats speech and its own textual representation as two separate modalities or views, as it jointly learns from streamed audio and its noisy transcription into text via automatic speech recognition. Our results show significant gains of jointly learning from the two modalities when compared to text or audio only, under adverse noise and limited volume of training data. The results generalize to medical symptoms detection where we observe a similar pattern of improvements with multimodal learning.

pdf bib
Image-Chat : Engaging Grounded Conversations
Kurt Shuster | Samuel Humeau | Antoine Bordes | Jason Weston

To achieve the long-term goal of machines being able to engage humans in conversation, our models should captivate the interest of their speaking partners. Communication grounded in images, whereby a dialogue is conducted based on a given photo, is a setup naturally appealing to humans (Hu et al., 2014). In this work we study large-scale architectures and datasets for this goal. We test a set of neural architectures using state-of-the-art image and text representations, considering various ways to fuse the components. To test such models, we collect a dataset of grounded human-human conversations, where speakers are asked to play roles given a provided emotional mood or style, as the use of such traits is also a key factor in engagingness (Guo et al., 2019). Our dataset, Image-Chat, consists of 202k dialogues over 202k images using 215 possible style traits. Automatic metrics and human evaluations of engagingness show the efficacy of our approach ; in particular, we obtain state-of-the-art performance on the existing IGC task, and our best performing model is almost on par with humans on the Image-Chat test set (preferred 47.7 % of the time).

pdf bib
Neural Generation of Dialogue Response Timings
Matthew Roddy | Naomi Harte

The timings of spoken response offsets in human dialogue have been shown to vary based on contextual elements of the dialogue. We propose neural models that simulate the distributions of these response offsets, taking into account the response turn as well as the preceding turn. The models are designed to be integrated into the pipeline of an incremental spoken dialogue system (SDS). We evaluate our models using offline experiments as well as human listening tests. We show that human listeners consider certain response timings to be more natural based on the dialogue context. The introduction of these models into SDS pipelines could increase the perceived naturalness of interactions.

pdf bib
Bridging the Structural Gap Between Encoding and Decoding for Data-To-Text Generation
Chao Zhao | Marilyn Walker | Snigdha Chaturvedi

Generating sequential natural language descriptions from graph-structured data (e.g., knowledge graph) is challenging, partly because of the structural differences between the input graph and the output text. Hence, popular sequence-to-sequence models, which require serialized input, are not a natural fit for this task. Graph neural networks, on the other hand, can better encode the input graph but broaden the structural gap between the encoder and decoder, making faithful generation difficult. To narrow this gap, we propose DualEnc, a dual encoding model that can not only incorporate the graph structure, but can also cater to the linear structure of the output text. Empirical comparisons with strong single-encoder baselines demonstrate that dual encoding can significantly improve the quality of the generated text.

pdf bib
A Batch Normalized Inference Network Keeps the KL Vanishing AwayKL Vanishing Away
Qile Zhu | Wei Bi | Xiaojiang Liu | Xiyao Ma | Xiaolin Li | Dapeng Wu

Variational Autoencoder (VAE) is widely used as a generative model to approximate a model’s posterior on latent variables by combining the amortized variational inference and deep neural networks. However, when paired with strong autoregressive decoders, VAE often converges to a degenerated local optimum known as posterior collapse. Previous approaches consider the KullbackLeibler divergence (KL) individual for each datapoint. We propose to let the KL follow a distribution across the whole dataset, and analyze that it is sufficient to prevent posterior collapse by keeping the expectation of the KL’s distribution positive. Then we propose Batch Normalized-VAE (BN-VAE), a simple but effective approach to set a lower bound of the expectation by regularizing the distribution of the approximate posterior’s parameters. Without introducing any new model component or modifying the objective, our approach can avoid the posterior collapse effectively and efficiently. We further show that the proposed BN-VAE can be extended to conditional VAE (CVAE). Empirically, our approach surpasses strong autoregressive baselines on language modeling, text classification and dialogue generation, and rivals more complex approaches while keeping almost the same training time as VAE.

pdf bib
Low Resource Sequence Tagging using Sentence Reconstruction
Tal Perl | Sriram Chaudhury | Raja Giryes

This work revisits the task of training sequence tagging models with limited resources using transfer learning. We investigate several proposed approaches introduced in recent works and suggest a new loss that relies on sentence reconstruction from normalized embeddings. Specifically, our method demonstrates how by adding a decoding layer for sentence reconstruction, we can improve the performance of various baselines. We show improved results on the CoNLL02 NER and UD 1.2 POS datasets and demonstrate the power of the method for transfer learning with low-resources achieving 0.6 F1 score in Dutch using only one sample from it.

pdf bib
Span Selection Pre-training for Question Answering
Michael Glass | Alfio Gliozzo | Rishav Chakravarti | Anthony Ferritto | Lin Pan | G P Shrivatsa Bhargav | Dinesh Garg | Avi Sil

BERT (Bidirectional Encoder Representations from Transformers) and related pre-trained Transformers have provided large gains across many language understanding tasks, achieving a new state-of-the-art (SOTA). BERT is pretrained on two auxiliary tasks : Masked Language Model and Next Sentence Prediction. In this paper we introduce a new pre-training task inspired by reading comprehension to better align the pre-training from memorization to understanding. Span Selection PreTraining (SSPT) poses cloze-like training instances, but rather than draw the answer from the model’s parameters, it is selected from a relevant passage. We find significant and consistent improvements over both BERT-BASE and BERT-LARGE on multiple Machine Reading Comprehension (MRC) datasets. Specifically, our proposed model has strong empirical evidence as it obtains SOTA results on Natural Questions, a new benchmark MRC dataset, outperforming BERT-LARGE by 3 F1 points on short answer prediction. We also show significant impact in HotpotQA, improving answer prediction F1 by 4 points and supporting fact prediction F1 by 1 point and outperforming the previous best system. Moreover, we show that our pre-training approach is particularly effective when training data is limited, improving the learning curve by a large amount.

pdf bib
schuBERT : Optimizing Elements of BERTBERT: Optimizing Elements of BERT
Ashish Khetan | Zohar Karnin

Transformers have gradually become a key component for many state-of-the-art natural language representation models. A recent Transformer based model- BERTachieved state-of-the-art results on various natural language processing tasks, including GLUE, SQuAD v1.1, and SQuAD v2.0. This model however is computationally prohibitive and has a huge number of parameters. In this work we revisit the architecture choices of BERT in efforts to obtain a lighter model. We focus on reducing the number of parameters yet our methods can be applied towards other objectives such FLOPs or latency. We show that much efficient light BERT models can be obtained by reducing algorithmically chosen correct architecture design dimensions rather than reducing the number of Transformer encoder layers. In particular, our schuBERT gives 6.6 % higher average accuracy on GLUE and SQuAD datasets as compared to BERT with three encoder layers while having the same number of parameters.

pdf bib
Predicting Degrees of Technicality in Automatic Terminology Extraction
Anna Hätty | Dominik Schlechtweg | Michael Dorna | Sabine Schulte im Walde

While automatic term extraction is a well-researched area, computational approaches to distinguish between degrees of technicality are still understudied. We semi-automatically create a German gold standard of technicality across four domains, and illustrate the impact of a web-crawled general-language corpus on technicality prediction. When defining a classification approach that combines general-language and domain-specific word embeddings, we go beyond previous work and align vector spaces to gain comparative embeddings. We suggest two novel models to exploit general- vs. domain-specific comparisons : a simple neural network model with pre-computed comparative-embedding information as input, and a multi-channel model computing the comparison internally. Both models outperform previous approaches, with the multi-channel model performing best.

pdf bib
Gender Bias in Multilingual Embeddings and Cross-Lingual Transfer
Jieyu Zhao | Subhabrata Mukherjee | Saghar Hosseini | Kai-Wei Chang | Ahmed Hassan Awadallah

Multilingual representations embed words from many languages into a single semantic space such that words with similar meanings are close to each other regardless of the language. These embeddings have been widely used in various settings, such as cross-lingual transfer, where a natural language processing (NLP) model trained on one language is deployed to another language. While the cross-lingual transfer techniques are powerful, they carry gender bias from the source to target languages. In this paper, we study gender bias in multilingual embeddings and how it affects transfer learning for NLP applications. We create a multilingual dataset for bias analysis and propose several ways for quantifying bias in multilingual representations from both the intrinsic and extrinsic perspectives. Experimental results show that the magnitude of bias in the multilingual representations changes differently when we align the embeddings to different target spaces and that the alignment direction can also have an influence on the bias in transfer learning. We further provide recommendations for using the multilingual word representations for downstream tasks.

pdf bib
Is Your Classifier Actually Biased? Measuring Fairness under Uncertainty with Bernstein Bounds
Kawin Ethayarajh

Most NLP datasets are not annotated with protected attributes such as gender, making it difficult to measure classification bias using standard measures of fairness (e.g., equal opportunity). However, manually annotating a large dataset with a protected attribute is slow and expensive. Instead of annotating all the examples, can we annotate a subset of them and use that sample to estimate the bias? While it is possible to do so, the smaller this annotated sample is, the less certain we are that the estimate is close to the true bias. In this work, we propose using Bernstein bounds to represent this uncertainty about the bias estimate as a confidence interval. We provide empirical evidence that a 95 % confidence interval derived this way consistently bounds the true bias. In quantifying this uncertainty, our method, which we call Bernstein-bounded unfairness, helps prevent classifiers from being deemed biased or unbiased when there is insufficient evidence to make either claim. Our findings suggest that the datasets currently used to measure specific biases are too small to conclusively identify bias except in the most egregious cases. For example, consider a co-reference resolution system that is 5 % more accurate on gender-stereotypical sentences to claim it is biased with 95 % confidence, we need a bias-specific dataset that is 3.8 times larger than WinoBias, the largest available.

pdf bib
Mitigating Gender Bias Amplification in Distribution by Posterior Regularization
Shengyu Jia | Tao Meng | Jieyu Zhao | Kai-Wei Chang

Advanced machine learning techniques have boosted the performance of natural language processing. Nevertheless, recent studies, e.g., (CITATION) show that these techniques inadvertently capture the societal bias hidden in the corpus and further amplify it. However, their analysis is conducted only on models’ top predictions. In this paper, we investigate the gender bias amplification issue from the distribution perspective and demonstrate that the bias is amplified in the view of predicted probability distribution over labels. We further propose a bias mitigation approach based on posterior regularization. With little performance loss, our method can almost remove the bias amplification in the distribution. Our study sheds the light on understanding the bias amplification.

pdf bib
Towards Understanding Gender Bias in Relation Extraction
Andrew Gaut | Tony Sun | Shirlyn Tang | Yuxin Huang | Jing Qian | Mai ElSherief | Jieyu Zhao | Diba Mirza | Elizabeth Belding | Kai-Wei Chang | William Yang Wang

Recent developments in Neural Relation Extraction (NRE) have made significant strides towards Automated Knowledge Base Construction. While much attention has been dedicated towards improvements in accuracy, there have been no attempts in the literature to evaluate social biases exhibited in NRE systems. In this paper, we create WikiGenderBias, a distantly supervised dataset composed of over 45,000 sentences including a 10 % human annotated test set for the purpose of analyzing gender bias in relation extraction systems. We find that when extracting spouse-of and hypernym (i.e., occupation) relations, an NRE system performs differently when the gender of the target entity is different. However, such disparity does not appear when extracting relations such as birthDate or birthPlace. We also analyze how existing bias mitigation techniques, such as name anonymization, word embedding debiasing, and data augmentation affect the NRE system in terms of maintaining the test performance and reducing biases. Unfortunately, due to NRE models rely heavily on surface level cues, we find that existing bias mitigation approaches have a negative effect on NRE. Our analysis lays groundwork for future quantifying and mitigating bias in NRE.

pdf bib
Zero-shot Text Classification via Reinforced Self-training
Zhiquan Ye | Yuxia Geng | Jiaoyan Chen | Jingmin Chen | Xiaoxiao Xu | SuHang Zheng | Feng Wang | Jun Zhang | Huajun Chen

Zero-shot learning has been a tough problem since no labeled data is available for unseen classes during training, especially for classes with low similarity. In this situation, transferring from seen classes to unseen classes is extremely hard. To tackle this problem, in this paper we propose a self-training based method to efficiently leverage unlabeled data. Traditional self-training methods use fixed heuristics to select instances from unlabeled data, whose performance varies among different datasets. We propose a reinforcement learning framework to learn data selection strategy automatically and provide more reliable selection. Experimental results on both benchmarks and a real-world e-commerce dataset show that our approach significantly outperforms previous methods in zero-shot text classification

pdf bib
A Relaxed Matching Procedure for Unsupervised BLIBLI
Xu Zhao | Zihao Wang | Yong Zhang | Hao Wu

Recently unsupervised Bilingual Lexicon Induction(BLI) without any parallel corpus has attracted much research interest. One of the crucial parts in methods for the BLI task is the matching procedure. Previous works impose a too strong constraint on the matching and lead to many counterintuitive translation pairings. Thus We propose a relaxed matching procedure to find a more precise matching between two languages. We also find that aligning source and target language embedding space bidirectionally will bring significant improvement. We follow the previous iterative framework to conduct experiments. Results on standard benchmark demonstrate the effectiveness of our proposed method, which substantially outperforms previous unsupervised methods.

pdf bib
Dynamic Programming Encoding for Subword Segmentation in Neural Machine Translation
Xuanli He | Gholamreza Haffari | Mohammad Norouzi

This paper introduces Dynamic Programming Encoding (DPE), a new segmentation algorithm for tokenizing sentences into subword units. We view the subword segmentation of output sentences as a latent variable that should be marginalized out for learning and inference. A mixed character-subword transformer is proposed, which enables exact log marginal likelihood estimation and exact MAP inference to find target segmentations with maximum posterior probability. DPE uses a lightweight mixed character-subword transformer as a means of pre-processing parallel data to segment output sentences using dynamic programming. Empirical results on machine translation suggest that DPE is effective for segmenting output sentences and can be combined with BPE dropout for stochastic segmentation of source sentences. DPE achieves an average improvement of 0.9 BLEU over BPE (Sennrich et al., 2016) and an average improvement of 0.55 BLEU over BPE dropout (Provilkov et al., 2019) on several WMT datasets including English = (German, Romanian, Estonian, Finnish, Hungarian).

pdf bib
Geometry-aware domain adaptation for unsupervised alignment of word embeddings
Pratik Jawanpuria | Mayank Meghwanshi | Bamdev Mishra

We propose a novel manifold based geometric approach for learning unsupervised alignment of word embeddings between the source and the target languages. Our approach formulates the alignment learning problem as a domain adaptation problem over the manifold of doubly stochastic matrices. This viewpoint arises from the aim to align the second order information of the two language spaces. The rich geometry of the doubly stochastic manifold allows to employ efficient Riemannian conjugate gradient algorithm for the proposed formulation. Empirically, the proposed approach outperforms state-of-the-art optimal transport based approach on the bilingual lexicon induction task across several language pairs. The performance improvement is more significant for distant language pairs.

pdf bib
Distinguish Confusing Law Articles for Legal Judgment Prediction
Nuo Xu | Pinghui Wang | Long Chen | Li Pan | Xiaoyan Wang | Junzhou Zhao

Legal Judgement Prediction (LJP) is the task of automatically predicting a law case’s judgment results given a text describing the case’s facts, which has great prospects in judicial assistance systems and handy services for the public. In practice, confusing charges are often presented, because law cases applicable to similar law articles are easily misjudged. To address this issue, existing work relies heavily on domain experts, which hinders its application in different law systems. In this paper, we present an end-to-end model, LADAN, to solve the task of LJP. To distinguish confusing charges, we propose a novel graph neural network, GDL, to automatically learn subtle differences between confusing law articles, and also design a novel attention mechanism that fully exploits the learned differences to attentively extract effective discriminative features from fact descriptions. Experiments conducted on real-world datasets demonstrate the superiority of our LADAN.

pdf bib
HyperCore : Hyperbolic and Co-graph Representation for Automatic ICD CodingHyperCore: Hyperbolic and Co-graph Representation for Automatic ICD Coding
Pengfei Cao | Yubo Chen | Kang Liu | Jun Zhao | Shengping Liu | Weifeng Chong

The International Classification of Diseases (ICD) provides a standardized way for classifying diseases, which endows each disease with a unique code. ICD coding aims to assign proper ICD codes to a medical record. Since manual coding is very laborious and prone to errors, many methods have been proposed for the automatic ICD coding task. However, most of existing methods independently predict each code, ignoring two important characteristics : Code Hierarchy and Code Co-occurrence. In this paper, we propose a Hyperbolic and Co-graph Representation method (HyperCore) to address the above problem. Specifically, we propose a hyperbolic representation method to leverage the code hierarchy. Moreover, we propose a graph convolutional network to utilize the code co-occurrence. Experimental results on two widely used datasets demonstrate that our proposed model outperforms previous state-of-the-art methods.

pdf bib
Effective Inter-Clause Modeling for End-to-End Emotion-Cause Pair Extraction
Penghui Wei | Jiahao Zhao | Wenji Mao

Emotion-cause pair extraction aims to extract all emotion clauses coupled with their cause clauses from a given document. Previous work employs two-step approaches, in which the first step extracts emotion clauses and cause clauses separately, and the second step trains a classifier to filter out negative pairs. However, such pipeline-style system for emotion-cause pair extraction is suboptimal because it suffers from error propagation and the two steps may not adapt to each other well. In this paper, we tackle emotion-cause pair extraction from a ranking perspective, i.e., ranking clause pair candidates in a document, and propose a one-step neural approach which emphasizes inter-clause modeling to perform end-to-end extraction. It models the interrelations between the clauses in a document to learn clause representations with graph attention, and enhances clause pair representations with kernel-based relative position embedding for effective ranking. Experimental results show that our approach significantly outperforms the current two-step systems, especially in the condition of extracting multiple pairs in one document.

pdf bib
Enhancing Cross-target Stance Detection with Transferable Semantic-Emotion Knowledge
Bowen Zhang | Min Yang | Xutao Li | Yunming Ye | Xiaofei Xu | Kuai Dai

Stance detection is an important task, which aims to classify the attitude of an opinionated text towards a given target. Remarkable success has been achieved when sufficient labeled training data is available. However, annotating sufficient data is labor-intensive, which establishes significant barriers for generalizing the stance classifier to the data with new targets. In this paper, we proposed a Semantic-Emotion Knowledge Transferring (SEKT) model for cross-target stance detection, which uses the external knowledge (semantic and emotion lexicons) as a bridge to enable knowledge transfer across different targets. Specifically, a semantic-emotion heterogeneous graph is constructed from external semantic and emotion lexicons, which is then fed into a graph convolutional network to learn multi-hop semantic connections between words and emotion tags. Then, the learned semantic-emotion graph representation, which serves as prior knowledge bridging the gap between the source and target domains, is fully integrated into the bidirectional long short-term memory (BiLSTM) stance classifier by adding a novel knowledge-aware memory unit to the BiLSTM cell. Extensive experiments on a large real-world dataset demonstrate the superiority of SEKT against the state-of-the-art baseline methods.

pdf bib
KinGDOM : Knowledge-Guided DOMain Adaptation for Sentiment AnalysisKinGDOM: Knowledge-Guided DOMain Adaptation for Sentiment Analysis
Deepanway Ghosal | Devamanyu Hazarika | Abhinaba Roy | Navonil Majumder | Rada Mihalcea | Soujanya Poria

Cross-domain sentiment analysis has received significant attention in recent years, prompted by the need to combat the domain gap between different applications that make use of sentiment analysis. In this paper, we take a novel perspective on this task by exploring the role of external commonsense knowledge. We introduce a new framework, KinGDOM, which utilizes the ConceptNet knowledge graph to enrich the semantics of a document by providing both domain-specific and domain-general background concepts. These concepts are learned by training a graph convolutional autoencoder that leverages inter-domain concepts in a domain-invariant manner. Conditioning a popular domain-adversarial baseline method with these learned concepts helps improve its performance over state-of-the-art approaches, demonstrating the efficacy of our proposed framework.

pdf bib
Parallel Data Augmentation for Formality Style Transfer
Yi Zhang | Tao Ge | Xu Sun

The main barrier to progress in the task of Formality Style Transfer is the inadequacy of training data. In this paper, we study how to augment parallel data and propose novel and simple data augmentation methods for this task to obtain useful sentence pairs with easily accessible models and systems. Experiments demonstrate that our augmented parallel data largely helps improve formality style transfer when it is used to pre-train the model, leading to the state-of-the-art results in the GYAFC benchmark dataset.

pdf bib
A Span-based Linearization for Constituent Trees
Yang Wei | Yuanbin Wu | Man Lan

We propose a novel linearization of a constituent tree, together with a new locally normalized model. For each split point in a sentence, our model computes the normalizer on all spans ending with that split point, and then predicts a tree span from them. Compared with global models, our model is fast and parallelizable. Different from previous local models, our linearization method is tied on the spans directly and considers more local features when performing span prediction, which is more interpretable and effective. Experiments on PTB (95.8 F1) and CTB (92.4 F1) show that our model significantly outperforms existing local models and efficiently achieves competitive results with global models.

pdf bib
Efficient Constituency Parsing by Pointing
Thanh-Tung Nguyen | Xuan-Phi Nguyen | Shafiq Joty | Xiaoli Li

We propose a novel constituency parsing model that casts the parsing problem into a series of pointing tasks. Specifically, our model estimates the likelihood of a span being a legitimate tree constituent via the pointing score corresponding to the boundary words of the span. Our parsing model supports efficient top-down decoding and our learning objective is able to enforce structural consistency without resorting to the expensive CKY inference. The experiments on the standard English Penn Treebank parsing task show that our method achieves 92.78 F1 without using pre-trained models, which is higher than all the existing methods with similar time complexity. Using pre-trained BERT, our model achieves 95.48 F1, which is competitive with the state-of-the-art while being faster. Our approach also establishes new state-of-the-art in Basque and Swedish in the SPMRL shared tasks on multilingual constituency parsing.

pdf bib
Improving Multimodal Named Entity Recognition via Entity Span Detection with Unified Multimodal Transformer
Jianfei Yu | Jing Jiang | Li Yang | Rui Xia

In this paper, we study Multimodal Named Entity Recognition (MNER) for social media posts. Existing approaches for MNER mainly suffer from two drawbacks : (1) despite generating word-aware visual representations, their word representations are insensitive to the visual context ; (2) most of them ignore the bias brought by the visual context. To tackle the first issue, we propose a multimodal interaction module to obtain both image-aware word representations and word-aware visual representations. To alleviate the visual bias, we further propose to leverage purely text-based entity span detection as an auxiliary module, and design a Unified Multimodal Transformer to guide the final predictions with the entity span predictions. Experiments show that our unified approach achieves the new state-of-the-art performance on two benchmark datasets.

pdf bib
Stock Embeddings Acquired from News Articles and Price History, and an Application to Portfolio Optimization
Xin Du | Kumiko Tanaka-Ishii

Previous works that integrated news articles to better process stock prices used a variety of neural networks to predict price movements. The textual and price information were both encoded in the neural network, and it is therefore difficult to apply this approach in situations other than the original framework of the notoriously hard problem of price prediction. In contrast, this paper presents a method to encode the influence of news articles through a vector representation of stocks called a stock embedding. The stock embedding is acquired with a deep learning framework using both news articles and price history. Because the embedding takes the operational form of a vector, it is applicable to other financial problems besides price prediction. As one example application, we show the results of portfolio optimization using Reuters & Bloomberg headlines, producing a capital gain 2.8 times larger than that obtained with a baseline method using only stock price data. This suggests that the proposed stock embedding can leverage textual financial semantics to solve financial prediction problems.stock embedding. The stock embedding is acquired with a deep learning framework using both news articles and price history. Because the embedding takes the operational form of a vector, it is applicable to other financial problems besides price prediction. As one example application, we show the results of portfolio optimization using Reuters & Bloomberg headlines, producing a capital gain 2.8 times larger than that obtained with a baseline method using only stock price data. This suggests that the proposed stock embedding can leverage textual financial semantics to solve financial prediction problems.

pdf bib
What Was Written vs. Who Read It : News Media Profiling Using Text Analysis and Social Media Context
Ramy Baly | Georgi Karadzhov | Jisun An | Haewoon Kwak | Yoan Dinkov | Ahmed Ali | James Glass | Preslav Nakov

Predicting the political bias and the factuality of reporting of entire news outlets are critical elements of media profiling, which is an understudied but an increasingly important research direction. The present level of proliferation of fake, biased, and propagandistic content online has made it impossible to fact-check every single suspicious claim, either manually or automatically. Thus, it has been proposed to profile entire news outlets and to look for those that are likely to publish fake or biased content. This makes it possible to detect likely fake news the moment they are published, by simply checking the reliability of their source. From a practical perspective, political bias and factuality of reporting have a linguistic aspect but also a social context. Here, we study the impact of both, namely (i) what was written (i.e., what was published by the target medium, and how it describes itself in Twitter) vs. (ii) who reads it (i.e., analyzing the target medium’s audience on social media). We further study (iii) what was written about the target medium (in Wikipedia). The evaluation results show that what was written matters most, and we further show that putting all information sources together yields huge improvements over the current state-of-the-art.

pdf bib
Roles and Utilization of Attention Heads in Transformer-based Neural Language Models
Jae-young Jo | Sung-Hyon Myaeng

Sentence encoders based on the transformer architecture have shown promising results on various natural language tasks. The main impetus lies in the pre-trained neural language models that capture long-range dependencies among words, owing to multi-head attention that is unique in the architecture. However, little is known for how linguistic properties are processed, represented, and utilized for downstream tasks among hundreds of attention heads inside the pre-trained transformer-based model. For the initial goal of examining the roles of attention heads in handling a set of linguistic features, we conducted a set of experiments with ten probing tasks and three downstream tasks on four pre-trained transformer families (GPT, GPT2, BERT, and ELECTRA). Meaningful insights are shown through the lens of heat map visualization and utilized to propose a relatively simple sentence representation method that takes advantage of most influential attention heads, resulting in additional performance improvements on the downstream tasks.

pdf bib
Understanding Attention for Text Classification
Xiaobing Sun | Wei Lu

Attention has been proven successful in many natural language processing (NLP) tasks. Recently, many researchers started to investigate the interpretability of attention on NLP tasks. Many existing approaches focused on examining whether the local attention weights could reflect the importance of input representations. In this work, we present a study on understanding the internal mechanism of attention by looking into the gradient update process, checking its behavior when approaching a local minimum during training. We propose to analyze for each word token the following two quantities : its polarity score and its attention score, where the latter is a global assessment on the token’s significance. We discuss conditions under which the attention mechanism may become more (or less) interpretable, and show how the interplay between the two quantities can contribute towards model performance.

pdf bib
Enhancing Pre-trained Chinese Character Representation with Word-aligned AttentionChinese Character Representation with Word-aligned Attention
Yanzeng Li | Bowen Yu | Xue Mengge | Tingwen Liu

Most Chinese pre-trained models take character as the basic unit and learn representation according to character’s external contexts, ignoring the semantics expressed in the word, which is the smallest meaningful utterance in Chinese. Hence, we propose a novel word-aligned attention to exploit explicit word information, which is complementary to various character-based Chinese pre-trained language models. Specifically, we devise a pooling mechanism to align the character-level attention to the word level and propose to alleviate the potential issue of segmentation error propagation by multi-source information fusion. As a result, word and character information are explicitly integrated at the fine-tuning procedure. Experimental results on five Chinese NLP benchmark tasks demonstrate that our method achieves significant improvements against BERT, ERNIE and BERT-wwm.

pdf bib
On the Encoder-Decoder Incompatibility in Variational Text Modeling and Beyond
Chen Wu | Prince Zizhuang Wang | William Yang Wang

Variational autoencoders (VAEs) combine latent variables with amortized variational inference, whose optimization usually converges into a trivial local optimum termed posterior collapse, especially in text modeling. By tracking the optimization dynamics, we observe the encoder-decoder incompatibility that leads to poor parameterizations of the data manifold. We argue that the trivial local optimum may be avoided by improving the encoder and decoder parameterizations since the posterior network is part of a transition map between them. To this end, we propose Coupled-VAE, which couples a VAE model with a deterministic autoencoder with the same structure and improves the encoder and decoder parameterizations via encoder weight sharing and decoder signal matching. We apply the proposed Coupled-VAE approach to various VAE models with different regularization, posterior family, decoder structure, and optimization strategy. Experiments on benchmark datasets (i.e., PTB, Yelp, and Yahoo) show consistently improved results in terms of probability estimation and richness of the latent space. We also generalize our method to conditional language modeling and propose Coupled-CVAE, which largely improves the diversity of dialogue generation on the Switchboard dataset.

pdf bib
A Simple and Effective Unified Encoder for Document-Level Machine Translation
Shuming Ma | Dongdong Zhang | Ming Zhou

Most of the existing models for document-level machine translation adopt dual-encoder structures. The representation of the source sentences and the document-level contexts are modeled with two separate encoders. Although these models can make use of the document-level contexts, they do not fully model the interaction between the contexts and the source sentences, and can not directly adapt to the recent pre-training models (e.g., BERT) which encodes multiple sentences with a single encoder. In this work, we propose a simple and effective unified encoder that can outperform the baseline models of dual-encoder models in terms of BLEU and METEOR scores. Moreover, the pre-training models can further boost the performance of our proposed model.

pdf bib
Knowledge Distillation for Multilingual Unsupervised Neural Machine Translation
Haipeng Sun | Rui Wang | Kehai Chen | Masao Utiyama | Eiichiro Sumita | Tiejun Zhao

Unsupervised neural machine translation (UNMT) has recently achieved remarkable results for several language pairs. However, it can only translate between a single language pair and can not produce translation results for multiple language pairs at the same time. That is, research on multilingual UNMT has been limited. In this paper, we empirically introduce a simple method to translate between thirteen languages using a single encoder and a single decoder, making use of multilingual data to improve UNMT for all language pairs. On the basis of the empirical findings, we propose two knowledge distillation methods to further enhance multilingual UNMT performance. Our experiments on a dataset with English translated to and from twelve other languages (including three language families and six language branches) show remarkable results, surpassing strong unsupervised individual baselines while achieving promising performance between non-English language pairs in zero-shot translation scenarios and alleviating poor performance in low-resource language pairs.

pdf bib
Towards Holistic and Automatic Evaluation of Open-Domain Dialogue Generation
Bo Pang | Erik Nijkamp | Wenjuan Han | Linqi Zhou | Yixian Liu | Kewei Tu

Open-domain dialogue generation has gained increasing attention in Natural Language Processing. Its evaluation requires a holistic means. Human ratings are deemed as the gold standard. As human evaluation is inefficient and costly, an automated substitute is highly desirable. In this paper, we propose holistic evaluation metrics that capture different aspects of open-domain dialogues. Our metrics consist of (1) GPT-2 based context coherence between sentences in a dialogue, (2) GPT-2 based fluency in phrasing, (3) n-gram based diversity in responses to augmented queries, and (4) textual-entailment-inference based logical self-consistency. The empirical validity of our metrics is demonstrated by strong correlations with human judgments. We open source the code and relevant materials.n-gram based diversity in responses to augmented queries, and (4) textual-entailment-inference based logical self-consistency. The empirical validity of our metrics is demonstrated by strong correlations with human judgments. We open source the code and relevant materials.

pdf bib
BiRRE : Learning Bidirectional Residual Relation Embeddings for Supervised Hypernymy DetectionBiRRE: Learning Bidirectional Residual Relation Embeddings for Supervised Hypernymy Detection
Chengyu Wang | Xiaofeng He

The hypernymy detection task has been addressed under various frameworks. Previously, the design of unsupervised hypernymy scores has been extensively studied. In contrast, supervised classifiers, especially distributional models, leverage the global contexts of terms to make predictions, but are more likely to suffer from lexical memorization. In this work, we revisit supervised distributional models for hypernymy detection. Rather than taking embeddings of two terms as classification inputs, we introduce a representation learning framework named Bidirectional Residual Relation Embeddings (BiRRE). In this model, a term pair is represented by a BiRRE vector as features for hypernymy classification, which models the possibility of a term being mapped to another in the embedding space by hypernymy relations. A Latent Projection Model with Negative Regularization (LPMNR) is proposed to simulate how hypernyms and hyponyms are generated by neural language models, and to generate BiRRE vectors based on bidirectional residuals of projections. Experiments verify BiRRE outperforms strong baselines over various evaluation frameworks.

pdf bib
Biomedical Entity Representations with Synonym Marginalization
Mujeen Sung | Hwisang Jeon | Jinhyuk Lee | Jaewoo Kang

Biomedical named entities often play important roles in many biomedical text mining tools. However, due to the incompleteness of provided synonyms and numerous variations in their surface forms, normalization of biomedical entities is very challenging. In this paper, we focus on learning representations of biomedical entities solely based on the synonyms of entities. To learn from the incomplete synonyms, we use a model-based candidate selection and maximize the marginal likelihood of the synonyms present in top candidates. Our model-based candidates are iteratively updated to contain more difficult negative samples as our model evolves. In this way, we avoid the explicit pre-selection of negative samples from more than 400 K candidates. On four biomedical entity normalization datasets having three different entity types (disease, chemical, adverse reaction), our model BioSyn consistently outperforms previous state-of-the-art models almost reaching the upper bound on each dataset.

pdf bib
SentiBERT : A Transferable Transformer-Based Architecture for Compositional Sentiment SemanticsSentiBERT: A Transferable Transformer-Based Architecture for Compositional Sentiment Semantics
Da Yin | Tao Meng | Kai-Wei Chang

We propose SentiBERT, a variant of BERT that effectively captures compositional sentiment semantics. The model incorporates contextualized representation with binary constituency parse tree to capture semantic composition. Comprehensive experiments demonstrate that SentiBERT achieves competitive performance on phrase-level sentiment classification. We further demonstrate that the sentiment composition learned from the phrase-level annotations on SST can be transferred to other sentiment analysis tasks as well as related tasks, such as emotion classification tasks. Moreover, we conduct ablation studies and design visualization methods to understand SentiBERT. We show that SentiBERT is better than baseline approaches in capturing negation and the contrastive relation and model the compositional sentiment semantics.

pdf bib
CH-SIMS : A Chinese Multimodal Sentiment Analysis Dataset with Fine-grained Annotation of ModalityCH-SIMS: A Chinese Multimodal Sentiment Analysis Dataset with Fine-grained Annotation of Modality
Wenmeng Yu | Hua Xu | Fanyang Meng | Yilin Zhu | Yixiao Ma | Jiele Wu | Jiyun Zou | Kaicheng Yang

Previous studies in multimodal sentiment analysis have used limited datasets, which only contain unified multimodal annotations. However, the unified annotations do not always reflect the independent sentiment of single modalities and limit the model to capture the difference between modalities. In this paper, we introduce a Chinese single- and multi-modal sentiment analysis dataset, CH-SIMS, which contains 2,281 refined video segments in the wild with both multimodal and independent unimodal annotations. It allows researchers to study the interaction between modalities or use independent unimodal annotations for unimodal sentiment analysis. Furthermore, we propose a multi-task learning framework based on late fusion as the baseline. Extensive experiments on the CH-SIMS show that our methods achieve state-of-the-art performance and learn more distinctive unimodal representations. The full dataset and codes are available for use at

pdf bib
Curriculum Pre-training for End-to-End Speech Translation
Chengyi Wang | Yu Wu | Shujie Liu | Ming Zhou | Zhenglu Yang

End-to-end speech translation poses a heavy burden on the encoder because it has to transcribe, understand, and learn cross-lingual semantics simultaneously. To obtain a powerful encoder, traditional methods pre-train it on ASR data to capture speech features. However, we argue that pre-training the encoder only through simple speech recognition is not enough, and high-level linguistic knowledge should be considered. Inspired by this, we propose a curriculum pre-training method that includes an elementary course for transcription learning and two advanced courses for understanding the utterance and mapping words in two languages. The difficulty of these courses is gradually increasing. Experiments show that our curriculum pre-training method leads to significant improvements on En-De and En-Fr speech translation benchmarks.

pdf bib
Meta-Transfer Learning for Code-Switched Speech Recognition
Genta Indra Winata | Samuel Cahyawijaya | Zhaojiang Lin | Zihan Liu | Peng Xu | Pascale Fung

An increasing number of people in the world today speak a mixed-language as a result of being multilingual. However, building a speech recognition system for code-switching remains difficult due to the availability of limited resources and the expense and significant effort required to collect mixed-language data. We therefore propose a new learning method, meta-transfer learning, to transfer learn on a code-switched speech recognition system in a low-resource setting by judiciously extracting information from high-resource monolingual datasets. Our model learns to recognize individual languages, and transfer them so as to better recognize mixed-language speech by conditioning the optimization on the code-switching data. Based on experimental results, our model outperforms existing baselines on speech recognition and language modeling tasks, and is faster to converge.

pdf bib
It Takes Two to Lie : One to Lie, and One to Listen
Denis Peskov | Benny Cheng | Ahmed Elgohary | Joe Barrow | Cristian Danescu-Niculescu-Mizil | Jordan Boyd-Graber

Trust is implicit in many online text conversationsstriking up new friendships, or asking for tech support. But trust can be betrayed through deception. We study the language and dynamics of deception in the negotiation-based game Diplomacy, where seven players compete for world domination by forging and breaking alliances with each other. Our study with players from the Diplomacy community gathers 17,289 messages annotated by the sender for their intended truthfulness and by the receiver for their perceived truthfulness. Unlike existing datasets, this captures deception in long-lasting relationships, where the interlocutors strategically combine truth with lies to advance objectives. A model that uses power dynamics and conversational contexts can predict when a lie occurs nearly as well as human players.

pdf bib
Pre-training Is (Almost) All You Need : An Application to Commonsense Reasoning
Alexandre Tamborrino | Nicola Pellicanò | Baptiste Pannier | Pascal Voitot | Louise Naudin

Fine-tuning of pre-trained transformer models has become the standard approach for solving common NLP tasks. Most of the existing approaches rely on a randomly initialized classifier on top of such networks. We argue that this fine-tuning procedure is sub-optimal as the pre-trained model has no prior on the specific classifier labels, while it might have already learned an intrinsic textual representation of the task. In this paper, we introduce a new scoring method that casts a plausibility ranking task in a full-text format and leverages the masked language modeling head tuned during the pre-training phase. We study commonsense reasoning tasks where the model must rank a set of hypotheses given a premise, focusing on the COPA, Swag, HellaSwag and CommonsenseQA datasets. By exploiting our scoring method without fine-tuning, we are able to produce strong baselines (e.g. 80 % test accuracy on COPA) that are comparable to supervised approaches. Moreover, when fine-tuning directly on the proposed scoring function, we show that our method provides a much more stable training phase across random restarts (e.g x10 standard deviation reduction on COPA test accuracy) and requires less annotated data than the standard classifier approach to reach equivalent performances.

pdf bib
Successfully Applying the Stabilized Lottery Ticket Hypothesis to the Transformer Architecture
Christopher Brix | Parnia Bahar | Hermann Ney

Sparse models require less memory for storage and enable a faster inference by reducing the necessary number of FLOPs. This is relevant both for time-critical and on-device computations using neural networks. The stabilized lottery ticket hypothesis states that networks can be pruned after none or few training iterations, using a mask computed based on the unpruned converged model. On the transformer architecture and the WMT 2014 English-to-German and English-to-French tasks, we show that stabilized lottery ticket pruning performs similar to magnitude pruning for sparsity levels of up to 85 %, and propose a new combination of pruning techniques that outperforms all other techniques for even higher levels of sparsity. Furthermore, we confirm that the parameter’s initial sign and not its specific value is the primary factor for successful training, and show that magnitude pruning can not be used to find winning lottery tickets.

pdf bib
Analysing Lexical Semantic Change with Contextualised Word Representations
Mario Giulianelli | Marco Del Tredici | Raquel Fernández

This paper presents the first unsupervised approach to lexical semantic change that makes use of contextualised word representations. We propose a novel method that exploits the BERT neural language model to obtain representations of word usages, clusters these representations into usage types, and measures change along time with three proposed metrics. We create a new evaluation dataset and show that the model representations and the detected semantic shifts are positively correlated with human judgements. Our extensive qualitative analysis demonstrates that our method captures a variety of synchronic and diachronic linguistic phenomena. We expect our work to inspire further research in this direction.

pdf bib
GoEmotions : A Dataset of Fine-Grained EmotionsGoEmotions: A Dataset of Fine-Grained Emotions
Dorottya Demszky | Dana Movshovitz-Attias | Jeongwoo Ko | Alan Cowen | Gaurav Nemade | Sujith Ravi

Understanding emotion expressed in language has a wide range of applications, from building empathetic chatbots to detecting harmful online behavior. Advancement in this area can be improved using large-scale datasets with a fine-grained typology, adaptable to multiple downstream tasks. We introduce GoEmotions, the largest manually annotated dataset of 58k English Reddit comments, labeled for 27 emotion categories or Neutral. We demonstrate the high quality of the annotations via Principal Preserved Component Analysis. We conduct transfer learning experiments with existing emotion benchmarks to show that our dataset generalizes well to other domains and different emotion taxonomies. Our BERT-based model achieves an average F1-score of.46 across our proposed taxonomy, leaving much room for improvement.

pdf bib
SKEP : Sentiment Knowledge Enhanced Pre-training for Sentiment AnalysisSKEP: Sentiment Knowledge Enhanced Pre-training for Sentiment Analysis
Hao Tian | Can Gao | Xinyan Xiao | Hao Liu | Bolei He | Hua Wu | Haifeng Wang | Feng Wu

Recently, sentiment analysis has seen remarkable advance with the help of pre-training approaches. However, sentiment knowledge, such as sentiment words and aspect-sentiment pairs, is ignored in the process of pre-training, despite the fact that they are widely used in traditional sentiment analysis approaches. In this paper, we introduce Sentiment Knowledge Enhanced Pre-training (SKEP) in order to learn a unified sentiment representation for multiple sentiment analysis tasks. With the help of automatically-mined knowledge, SKEP conducts sentiment masking and constructs three sentiment knowledge prediction objectives, so as to embed sentiment information at the word, polarity and aspect level into pre-trained sentiment representation. In particular, the prediction of aspect-sentiment pairs is converted into multi-label classification, aiming to capture the dependency between words in a pair. Experiments on three kinds of sentiment tasks show that SKEP significantly outperforms strong pre-training baseline, and achieves new state-of-the-art results on most of the test datasets. We release our code at

pdf bib
Exact yet Efficient Graph Parsing, Bi-directional Locality and the Constructivist Hypothesis
Yajie Ye | Weiwei Sun

A key problem in processing graph-based meaning representations is graph parsing, i.e. computing all possible derivations of a given graph according to a (competence) grammar. We demonstrate, for the first time, that exact graph parsing can be efficient for large graphs and with large Hyperedge Replacement Grammars (HRGs). The advance is achieved by exploiting locality as terminal edge-adjacency in HRG rules. In particular, we highlight the importance of 1) a terminal edge-first parsing strategy, 2) a categorization of a subclass of HRG, i.e. what we call Weakly Regular Graph Grammar, and 3) distributing argument-structures to both lexical and phrasal rules.

pdf bib
Neural Reranking for Dependency Parsing : An Evaluation
Bich-Ngoc Do | Ines Rehbein

Recent work has shown that neural rerankers can improve results for dependency parsing over the top k trees produced by a base parser. However, all neural rerankers so far have been evaluated on English and Chinese only, both languages with a configurational word order and poor morphology. In the paper, we re-assess the potential of successful neural reranking models from the literature on English and on two morphologically rich(er) languages, German and Czech. In addition, we introduce a new variation of a discriminative reranker based on graph convolutional networks (GCNs). We show that the GCN not only outperforms previous models on English but is the only model that is able to improve results over the baselines on German and Czech. We explain the differences in reranking performance based on an analysis of a) the gold tree ratio and b) the variety in the k-best lists.

pdf bib
Probing for Referential Information in Language Models
Ionut-Teodor Sorodoc | Kristina Gulordava | Gemma Boleda

Language models keep track of complex information about the preceding context including, e.g., syntactic relations in a sentence. We investigate whether they also capture information beneficial for resolving pronominal anaphora in English. We analyze two state of the art models with LSTM and Transformer architectures, via probe tasks and analysis on a coreference annotated corpus. The Transformer outperforms the LSTM in all analyses. Our results suggest that language models are more successful at learning grammatical constraints than they are at learning truly referential information, in the sense of capturing the fact that we use language to refer to entities in the world. However, we find traces of the latter aspect, too.

pdf bib
Towards Transparent and Explainable Attention Models
Akash Kumar Mohankumar | Preksha Nema | Sharan Narasimhan | Mitesh M. Khapra | Balaji Vasan Srinivasan | Balaraman Ravindran

Recent studies on interpretability of attention distributions have led to notions of faithful and plausible explanations for a model’s predictions. Attention distributions can be considered a faithful explanation if a higher attention weight implies a greater impact on the model’s prediction. They can be considered a plausible explanation if they provide a human-understandable justification for the model’s predictions. In this work, we first explain why current attention mechanisms in LSTM based encoders can neither provide a faithful nor a plausible explanation of the model’s predictions. We observe that in LSTM based encoders the hidden representations at different time-steps are very similar to each other (high conicity) and attention weights in these situations do not carry much meaning because even a random permutation of the attention weights does not affect the model’s predictions. Based on experiments on a wide variety of tasks and datasets, we observe attention distributions often attribute the model’s predictions to unimportant words such as punctuation and fail to offer a plausible explanation for the predictions. To make attention mechanisms more faithful and plausible, we propose a modified LSTM cell with a diversity-driven training objective that ensures that the hidden representations learned at different time steps are diverse. We show that the resulting attention distributions offer more transparency as they (i) provide a more precise importance ranking of the hidden states (ii) are better indicative of words important for the model’s predictions (iii) correlate better with gradient-based attribution methods. Human evaluations indicate that the attention distributions learned by our model offer a plausible explanation of the model’s predictions.

pdf bib
Tchebycheff Procedure for Multi-task Text Classification
Yuren Mao | Shuang Yun | Weiwei Liu | Bo Du

Multi-task Learning methods have achieved great progress in text classification. However, existing methods assume that multi-task text classification problems are convex multiobjective optimization problems, which is unrealistic in real-world applications. To address this issue, this paper presents a novel Tchebycheff procedure to optimize the multi-task classification problems without convex assumption. The extensive experiments back up our theoretical analysis and validate the superiority of our proposals.

pdf bib
Modeling Word Formation in EnglishGerman Neural Machine TranslationEnglish–German Neural Machine Translation
Marion Weller-Di Marco | Alexander Fraser

This paper studies strategies to model word formation in NMT using rich linguistic information, namely a word segmentation approach that goes beyond splitting into substrings by considering fusional morphology. Our linguistically sound segmentation is combined with a method for target-side inflection to accommodate modeling word formation. The best system variants employ source-side morphological analysis and model complex target-side words, improving over a standard system.

pdf bib
Joint Modelling of Emotion and Abusive Language Detection
Santhosh Rajamanickam | Pushkar Mishra | Helen Yannakoudakis | Ekaterina Shutova

The rise of online communication platforms has been accompanied by some undesirable effects, such as the proliferation of aggressive and abusive behaviour online. Aiming to tackle this problem, the natural language processing (NLP) community has experimented with a range of techniques for abuse detection. While achieving substantial success, these methods have so far only focused on modelling the linguistic properties of the comments and the online communities of users, disregarding the emotional state of the users and how this might affect their language. The latter is, however, inextricably linked to abusive behaviour. In this paper, we present the first joint model of emotion and abusive language detection, experimenting in a multi-task learning framework that allows one task to inform the other. Our results demonstrate that incorporating affective features leads to significant improvements in abuse detection performance across datasets.

pdf bib
Target Inference in Argument Conclusion Generation
Milad Alshomary | Shahbaz Syed | Martin Potthast | Henning Wachsmuth

In argumentation, people state premises to reason towards a conclusion. The conclusion conveys a stance towards some target, such as a concept or statement. Often, the conclusion remains implicit, though, since it is self-evident in a discussion or left out for rhetorical reasons. However, the conclusion is key to understanding an argument and, hence, to any application that processes argumentation. We thus study the question to what extent an argument’s conclusion can be reconstructed from its premises. In particular, we argue here that a decisive step is to infer a conclusion’s target, and we hypothesize that this target is related to the premises’ targets. We develop two complementary target inference approaches : one ranks premise targets and selects the top-ranked target as the conclusion target, the other finds a new conclusion target in a learned embedding space using a triplet neural network. Our evaluation on corpora from two domains indicates that a hybrid of both approaches is best, outperforming several strong baselines. According to human annotators, we infer a reasonably adequate conclusion target in 89 % of the cases.

pdf bib
Multimodal Transformer for Multimodal Machine Translation
Shaowei Yao | Xiaojun Wan

Multimodal Machine Translation (MMT) aims to introduce information from other modality, generally static images, to improve the translation quality. Previous works propose various incorporation methods, but most of them do not consider the relative importance of multiple modalities. Equally treating all modalities may encode too much useless information from less important modalities. In this paper, we introduce the multimodal self-attention in Transformer to solve the issues above in MMT. The proposed method learns the representation of images based on the text, which avoids encoding irrelevant information in images. Experiments and visualization analysis demonstrate that our model benefits from visual information and substantially outperforms previous works and competitive baselines in terms of various metrics.

pdf bib
Towards Emotion-aided Multi-modal Dialogue Act Classification
Tulika Saha | Aditya Patra | Sriparna Saha | Pushpak Bhattacharyya

The task of Dialogue Act Classification (DAC) that purports to capture communicative intent has been studied extensively. But these studies limit themselves to text. Non-verbal features (change of tone, facial expressions etc.) can provide cues to identify DAs, thus stressing the benefit of incorporating multi-modal inputs in the task. Also, the emotional state of the speaker has a substantial effect on the choice of the dialogue act, since conversations are often influenced by emotions. Hence, the effect of emotion too on automatic identification of DAs needs to be studied. In this work, we address the role of both multi-modality and emotion recognition (ER) in DAC. DAC and ER help each other by way of multi-task learning. One of the major contributions of this work is a new dataset- multimodal Emotion aware Dialogue Act dataset called EMOTyDA, collected from open-sourced dialogue datasets. To demonstrate the utility of EMOTyDA, we build an attention based (self, inter-modal, inter-task) multi-modal, multi-task Deep Neural Network (DNN) for joint learning of DAs and emotions. We show empirically that multi-modality and multi-tasking achieve better performance of DAC compared to uni-modal and single task DAC variants.both multi-modality and emotion recognition (ER) in DAC. DAC and ER help each other by way of multi-task learning. One of the major contributions of this work is a new dataset- multimodal Emotion aware Dialogue Act dataset called EMOTyDA, collected from open-sourced dialogue datasets. To demonstrate the utility of EMOTyDA, we build an attention based (self, inter-modal, inter-task) multi-modal, multi-task Deep Neural Network (DNN) for joint learning of DAs and emotions. We show empirically that multi-modality and multi-tasking achieve better performance of DAC compared to uni-modal and single task DAC variants.

pdf bib
Learning to Faithfully Rationalize by ConstructionLearning to Faithfully Rationalize by Construction
Sarthak Jain | Sarah Wiegreffe | Yuval Pinter | Byron C. Wallace

In many settings it is important for one to be able to understand why a model made a particular prediction. In NLP this often entails extracting snippets of an input text ‘responsible for’ corresponding model output ; when such a snippet comprises tokens that indeed informed the model’s prediction, it is a faithful explanation. In some settings, faithfulness may be critical to ensure transparency. Lei et al. (2016) proposed a model to produce faithful rationales for neural text classification by defining independent snippet extraction and prediction modules. However, the discrete selection over input tokens performed by this method complicates training, leading to high variance and requiring careful hyperparameter tuning. We propose a simpler variant of this approach that provides faithful explanations by construction. In our scheme, named FRESH, arbitrary feature importance scores (e.g., gradients from a trained model) are used to induce binary labels over token inputs, which an extractor can be trained to predict. An independent classifier module is then trained exclusively on snippets provided by the extractor ; these snippets thus constitute faithful explanations, even if the classifier is arbitrarily complex. In both automatic and manual evaluations we find that variants of this simple framework yield predictive performance superior to ‘end-to-end’ approaches, while being more general and easier to train. Code is available at

pdf bib
DeFormer : Decomposing Pre-trained Transformers for Faster Question AnsweringDeFormer: Decomposing Pre-trained Transformers for Faster Question Answering
Qingqing Cao | Harsh Trivedi | Aruna Balasubramanian | Niranjan Balasubramanian

Transformer-based QA models use input-wide self-attention i.e. across both the question and the input passage at all layers, causing them to be slow and memory-intensive. It turns out that we can get by without input-wide self-attention at all layers, especially in the lower layers. We introduce DeFormer, a decomposed transformer, which substitutes the full self-attention with question-wide and passage-wide self-attentions in the lower layers. This allows for question-independent processing of the input text representations, which in turn enables pre-computing passage representations reducing runtime compute drastically. Furthermore, because DeFormer is largely similar to the original model, we can initialize DeFormer with the pre-training weights of a standard transformer, and directly fine-tune on the target QA dataset. We show DeFormer versions of BERT and XLNet can be used to speed up QA by over 4.3x and with simple distillation-based losses they incur only a 1 % drop in accuracy. We open source the code at

pdf bib
Improving Multi-hop Question Answering over Knowledge Graphs using Knowledge Base Embeddings
Apoorv Saxena | Aditay Tripathi | Partha Talukdar

Knowledge Graphs (KG) are multi-relational graphs consisting of entities as nodes and relations among them as typed edges. Goal of the Question Answering over KG (KGQA) task is to answer natural language queries posed over the KG. Multi-hop KGQA requires reasoning over multiple edges of the KG to arrive at the right answer. KGs are often incomplete with many missing links, posing additional challenges for KGQA, especially for multi-hop KGQA. Recent research on multi-hop KGQA has attempted to handle KG sparsity using relevant external text, which is n’t always readily available. In a separate line of research, KG embedding methods have been proposed to reduce KG sparsity by performing missing link prediction. Such KG embedding methods, even though highly relevant, have not been explored for multi-hop KGQA so far. We fill this gap in this paper and propose EmbedKGQA. EmbedKGQA is particularly effective in performing multi-hop KGQA over sparse KGs. EmbedKGQA also relaxes the requirement of answer selection from a pre-specified neighborhood, a sub-optimal constraint enforced by previous multi-hop KGQA methods. Through extensive experiments on multiple benchmark datasets, we demonstrate EmbedKGQA’s effectiveness over other state-of-the-art baselines.

pdf bib
ParaCrawl : Web-Scale Acquisition of Parallel CorporaParaCrawl: Web-Scale Acquisition of Parallel Corpora
Marta Bañón | Pinzhen Chen | Barry Haddow | Kenneth Heafield | Hieu Hoang | Miquel Esplà-Gomis | Mikel L. Forcada | Amir Kamran | Faheem Kirefu | Philipp Koehn | Sergio Ortiz Rojas | Leopoldo Pla Sempere | Gema Ramírez-Sánchez | Elsa Sarrías | Marek Strelec | Brian Thompson | William Waites | Dion Wiggins | Jaume Zaragoza

We report on methods to create the largest publicly available parallel corpora by crawling the web, using open source software. We empirically compare alternative methods and publish benchmark data sets for sentence alignment and sentence pair filtering. We also describe the parallel corpora released and evaluate their quality and their usefulness to create machine translation systems.

pdf bib
Toward Gender-Inclusive Coreference Resolution
Yang Trista Cao | Hal Daumé III

Correctly resolving textual mentions of people fundamentally entails making inferences about those people. Such inferences raise the risk of systemic biases in coreference resolution systems, including biases that can harm binary and non-binary trans and cis stakeholders. To better understand such biases, we foreground nuanced conceptualizations of gender from sociology and sociolinguistics, and develop two new datasets for interrogating bias in crowd annotations and in existing coreference resolution systems. Through these studies, conducted on English text, we confirm that without acknowledging and building systems that recognize the complexity of gender, we build systems that lead to many potential harms.

pdf bib
Information-Theoretic Probing for Linguistic Structure
Tiago Pimentel | Josef Valvoda | Rowan Hall Maudslay | Ran Zmigrod | Adina Williams | Ryan Cotterell

The success of neural networks on a diverse set of NLP tasks has led researchers to question how much these networks actually know about natural language. Probes are a natural way of assessing this. When probing, a researcher chooses a linguistic task and trains a supervised model to predict annotations in that linguistic task from the network’s learned representations. If the probe does well, the researcher may conclude that the representations encode knowledge related to the task. A commonly held belief is that using simpler models as probes is better ; the logic is that simpler models will identify linguistic structure, but not learn the task itself. We propose an information-theoretic operationalization of probing as estimating mutual information that contradicts this received wisdom : one should always select the highest performing probe one can, even if it is more complex, since it will result in a tighter estimate, and thus reveal more of the linguistic information inherent in the representation. The experimental portion of our paper focuses on empirically estimating the mutual information between a linguistic property and BERT, comparing these estimates to several baselines. We evaluate on a set of ten typologically diverse languages often underrepresented in NLP researchplus Englishtotalling eleven languages. Our implementation is available in

pdf bib
Similarity Analysis of Contextual Word Representation Models
John Wu | Yonatan Belinkov | Hassan Sajjad | Nadir Durrani | Fahim Dalvi | James Glass

This paper investigates contextual word representation models from the lens of similarity analysis. Given a collection of trained models, we measure the similarity of their internal representations and attention. Critically, these models come from vastly different architectures. We use existing and novel similarity measures that aim to gauge the level of localization of information in the deep models, and facilitate the investigation of which design factors affect model similarity, without requiring any external linguistic annotation. The analysis reveals that models within the same family are more similar to one another, as may be expected. Surprisingly, different architectures have rather similar representations, but different individual neurons. We also observed differences in information localization in lower and higher layers and found that higher layers are more affected by fine-tuning on downstream tasks.

pdf bib
CraftAssist Instruction Parsing : Semantic Parsing for a Voxel-World AssistantCraftAssist Instruction Parsing: Semantic Parsing for a Voxel-World Assistant
Kavya Srinet | Yacine Jernite | Jonathan Gray | Arthur Szlam

We propose a semantic parsing dataset focused on instruction-driven communication with an agent in the game Minecraft. The dataset consists of 7 K human utterances and their corresponding parses. Given proper world state, the parses can be interpreted and executed in game. We report the performance of baseline models, and analyze their successes and failures.

pdf bib
How does BERT’s attention change when you fine-tune? An analysis methodology and a case study in negation scopeBERT’s attention change when you fine-tune? An analysis methodology and a case study in negation scope
Yiyun Zhao | Steven Bethard

Large pretrained language models like BERT, after fine-tuning to a downstream task, have achieved high performance on a variety of NLP problems. Yet explaining their decisions is difficult despite recent work probing their internal representations. We propose a procedure and analysis methods that take a hypothesis of how a transformer-based model might encode a linguistic phenomenon, and test the validity of that hypothesis based on a comparison between knowledge-related downstream tasks with downstream control tasks, and measurement of cross-dataset consistency. We apply this methodology to test BERT and RoBERTa on a hypothesis that some attention heads will consistently attend from a word in negation scope to the negation cue. We find that after fine-tuning BERT and RoBERTa on a negation scope task, the average attention head improves its sensitivity to negation and its attention consistency across negation datasets compared to the pre-trained models. However, only the base models (not the large models) improve compared to a control task, indicating there is evidence for a shallow encoding of negation only in the base models.

pdf bib
Pretraining with Contrastive Sentence Objectives Improves Discourse Performance of Language Models
Dan Iter | Kelvin Guu | Larry Lansing | Dan Jurafsky

Recent models for unsupervised representation learning of text have employed a number of techniques to improve contextual word representations but have put little focus on discourse-level representations. We propose Conpono, an inter-sentence objective for pretraining language models that models discourse coherence and the distance between sentences. Given an anchor sentence, our model is trained to predict the text k sentences away using a sampled-softmax objective where the candidates consist of neighboring sentences and sentences randomly sampled from the corpus. On the discourse representation benchmark DiscoEval, our model improves over the previous state-of-the-art by up to 13 % and on average 4 % absolute across 7 tasks. Our model is the same size as BERT-Base, but outperforms the much larger BERT-Large model and other more recent approaches that incorporate discourse. We also show that Conpono yields gains of 2%-6 % absolute even for tasks that do not explicitly evaluate discourse : textual entailment (RTE), common sense reasoning (COPA) and reading comprehension (ReCoRD).

pdf bib
Beyond Accuracy : Behavioral Testing of NLP Models with CheckListNLP Models with CheckList
Marco Tulio Ribeiro | Tongshuang Wu | Carlos Guestrin | Sameer Singh

Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models either focus on individual tasks or on specific behaviors. Inspired by principles of behavioral testing in software engineering, we introduce CheckList, a task-agnostic methodology for testing NLP models. CheckList includes a matrix of general linguistic capabilities and test types that facilitate comprehensive test ideation, as well as a software tool to generate a large and diverse number of test cases quickly. We illustrate the utility of CheckList with tests for three tasks, identifying critical failures in both commercial and state-of-art models. In a user study, a team responsible for a commercial sentiment analysis model found new and actionable bugs in an extensively tested model. In another user study, NLP practitioners with CheckList created twice as many tests, and found almost three times as many bugs as users without it.

pdf bib
More Diverse Dialogue Datasets via Diversity-Informed Data Collection
Katherine Stasaski | Grace Hui Yang | Marti A. Hearst

Automated generation of conversational dialogue using modern neural architectures has made notable advances. However, these models are known to have a drawback of often producing uninteresting, predictable responses ; this is known as the diversity problem. We introduce a new strategy to address this problem, called Diversity-Informed Data Collection. Unlike prior approaches, which modify model architectures to solve the problem, this method uses dynamically computed corpus-level statistics to determine which conversational participants to collect data from. Diversity-Informed Data Collection produces significantly more diverse data than baseline data collection methods, and better results on two downstream tasks : emotion classification and dialogue generation. This method is generalizable and can be used with other corpus-level metrics.

pdf bib
S2ORC : The Semantic Scholar Open Research CorpusS2ORC: The Semantic Scholar Open Research Corpus
Kyle Lo | Lucy Lu Wang | Mark Neumann | Rodney Kinney | Daniel Weld

We introduce S2ORC, a large corpus of 81.1 M English-language academic papers spanning many academic disciplines. The corpus consists of rich metadata, paper abstracts, resolved bibliographic references, as well as structured full text for 8.1 M open access papers. Full text is annotated with automatically-detected inline mentions of citations, figures, and tables, each linked to their corresponding paper objects. In S2ORC, we aggregate papers from hundreds of academic publishers and digital archives into a unified source, and create the largest publicly-available collection of machine-readable academic text to date. We hope this resource will facilitate research and development of tools and tasks for text mining over academic text.

pdf bib
A Transformer-based Approach for Source Code Summarization
Wasi Ahmad | Saikat Chakraborty | Baishakhi Ray | Kai-Wei Chang

Generating a readable summary that describes the functionality of a program is known as source code summarization. In this task, learning code representation by modeling the pairwise relationship between code tokens to capture their long-range dependencies is crucial. To learn code representation for summarization, we explore the Transformer model that uses a self-attention mechanism and has shown to be effective in capturing long-range dependencies. In this work, we show that despite the approach is simple, it outperforms the state-of-the-art techniques by a significant margin. We perform extensive analysis and ablation studies that reveal several important findings, e.g., the absolute encoding of source code tokens’ position hinders, while relative encoding significantly improves the summarization performance. We have made our code publicly available to facilitate future research.

pdf bib
Discrete Optimization for Unsupervised Sentence Summarization with Word-Level Extraction
Raphael Schumann | Lili Mou | Yao Lu | Olga Vechtomova | Katja Markert

Automatic sentence summarization produces a shorter version of a sentence, while preserving its most important information. A good summary is characterized by language fluency and high information overlap with the source sentence. We model these two aspects in an unsupervised objective function, consisting of language modeling and semantic similarity metrics. We search for a high-scoring summary by discrete optimization. Our proposed method achieves a new state-of-the art for unsupervised sentence summarization according to ROUGE scores. Additionally, we demonstrate that the commonly reported ROUGE F1 metric is sensitive to summary length. Since this is unwillingly exploited in recent work, we emphasize that future evaluation should explicitly group summarization systems by output length brackets.

pdf bib
FEQA : A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive SummarizationFEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization
Esin Durmus | He He | Mona Diab

Neural abstractive summarization models are prone to generate content inconsistent with the source document, i.e. unfaithful. Existing automatic metrics do not capture such mistakes effectively. We tackle the problem of evaluating faithfulness of a generated summary given its source document. We first collected human annotations of faithfulness for outputs from numerous models on two datasets. We find that current models exhibit a trade-off between abstractiveness and faithfulness : outputs with less word overlap with the source document are more likely to be unfaithful. Next, we propose an automatic question answering (QA) based metric for faithfulness, FEQA, which leverages recent advances in reading comprehension. Given question-answer pairs generated from the summary, a QA model extracts answers from the document ; non-matched answers indicate unfaithful information in the summary. Among metrics based on word overlap, embedding similarity, and learned language understanding models, our QA-based metric has significantly higher correlation with human faithfulness scores, especially on highly abstractive summaries.

pdf bib
Unsupervised Opinion Summarization as Copycat-Review Generation
Arthur Bražinskas | Mirella Lapata | Ivan Titov

Opinion summarization is the task of automatically creating summaries that reflect subjective information expressed in multiple documents, such as product reviews. While the majority of previous work has focused on the extractive setting, i.e., selecting fragments from input reviews to produce a summary, we let the model generate novel sentences and hence produce abstractive summaries. Recent progress in summarization has seen the development of supervised models which rely on large quantities of document-summary pairs. Since such training data is expensive to acquire, we instead consider the unsupervised setting, in other words, we do not use any summaries in training. We define a generative model for a review collection which capitalizes on the intuition that when generating a new review given a set of other reviews of a product, we should be able to control the amount of novelty going into the new review or, equivalently, vary the extent to which it deviates from the input. At test time, when generating summaries, we force the novelty to be minimal, and produce a text reflecting consensus opinions. We capture this intuition by defining a hierarchical variational autoencoder model. Both individual reviews and the products they correspond to are associated with stochastic latent codes, and the review generator (decoder) has direct access to the text of input reviews through the pointer-generator mechanism. Experiments on Amazon and Yelp datasets, show that setting at test time the review’s latent code to its mean, allows the model to produce fluent and coherent summaries reflecting common opinions.

pdf bib
Examining Citations of Natural Language Processing Literature
Saif M. Mohammad

We extracted information from the ACL Anthology (AA) and Google Scholar (GS) to examine trends in citations of NLP papers. We explore questions such as : how well cited are papers of different types (journal articles, conference papers, demo papers, etc.)? how well cited are papers from different areas of within NLP? etc. Notably, we show that only about 56 % of the papers in AA are cited ten or more times. CL Journal has the most cited papers, but its citation dominance has lessened in recent years. On average, long papers get almost three times as many citations as short papers ; and papers on sentiment classification, anaphora resolution, and entity recognition have the highest median citations. The analyses presented here, and the associated dataset of NLP papers mapped to citations, have a number of uses including : understanding how the field is growing and quantifying the impact of different types of papers.

pdf bib
How Can We Accelerate Progress Towards Human-like Linguistic Generalization?
Tal Linzen

This position paper describes and critiques the Pretraining-Agnostic Identically Distributed (PAID) evaluation paradigm, which has become a central tool for measuring progress in natural language understanding. This paradigm consists of three stages : (1) pre-training of a word prediction model on a corpus of arbitrary size ; (2) fine-tuning (transfer learning) on a training set representing a classification task ; (3) evaluation on a test set drawn from the same distribution as that training set. This paradigm favors simple, low-bias architectures, which, first, can be scaled to process vast amounts of data, and second, can capture the fine-grained statistical properties of a particular data set, regardless of whether those properties are likely to generalize to examples of the task outside the data set. This contrasts with humans, who learn language from several orders of magnitude less data than the systems favored by this evaluation paradigm, and generalize to new tasks in a consistent way. We advocate for supplementing or replacing PAID with paradigms that reward architectures that generalize as quickly and robustly as humans.

pdf bib
Balancing Objectives in Counseling Conversations : Advancing Forwards or Looking Backwards
Justine Zhang | Cristian Danescu-Niculescu-Mizil

Throughout a conversation, participants make choices that can orient the flow of the interaction. Such choices are particularly salient in the consequential domain of crisis counseling, where a difficulty for counselors is balancing between two key objectives : advancing the conversation towards a resolution, and empathetically addressing the crisis situation. In this work, we develop an unsupervised methodology to quantify how counselors manage this balance. Our main intuition is that if an utterance can only receive a narrow range of appropriate replies, then its likely aim is to advance the conversation forwards, towards a target within that range. Likewise, an utterance that can only appropriately follow a narrow range of possible utterances is likely aimed backwards at addressing a specific situation within that range. By applying this intuition, we can map each utterance to a continuous orientation axis that captures the degree to which it is intended to direct the flow of the conversation forwards or backwards. This unsupervised method allows us to characterize counselor behaviors in a large dataset of crisis counseling conversations, where we show that known counseling strategies intuitively align with this axis. We also illustrate how our measure can be indicative of a conversation’s progress, as well as its effectiveness.

pdf bib
Detecting Perceived Emotions in Hurricane Disasters
Shrey Desai | Cornelia Caragea | Junyi Jessy Li

Natural disasters (e.g., hurricanes) affect millions of people each year, causing widespread destruction in their wake. People have recently taken to social media websites (e.g., Twitter) to share their sentiments and feelings with the larger community. Consequently, these platforms have become instrumental in understanding and perceiving emotions at scale. In this paper, we introduce HurricaneEmo, an emotion dataset of 15,000 English tweets spanning three hurricanes : Harvey, Irma, and Maria. We present a comprehensive study of fine-grained emotions and propose classification tasks to discriminate between coarse-grained emotion groups. Our best BERT model, even after task-guided pre-training which leverages unlabeled Twitter data, achieves only 68 % accuracy (averaged across all groups). HurricaneEmo serves not only as a challenging benchmark for models but also as a valuable resource for analyzing emotions in disaster-centric domains.

pdf bib
Measuring Forecasting Skill from Text
Shi Zong | Alan Ritter | Eduard Hovy

People vary in their ability to make accurate predictions about the future. Prior studies have shown that some individuals can predict the outcome of future events with consistently better accuracy. This leads to a natural question : what makes some forecasters better than others? In this paper we explore connections between the language people use to describe their predictions and their forecasting skill. Datasets from two different forecasting domains are explored : (1) geopolitical forecasts from Good Judgment Open, an online prediction forum and (2) a corpus of company earnings forecasts made by financial analysts. We present a number of linguistic metrics which are computed over text associated with people’s predictions about the future including : uncertainty, readability, and emotion. By studying linguistic factors associated with predictions, we are able to shed some light on the approach taken by skilled forecasters. Furthermore, we demonstrate that it is possible to accurately predict forecasting skill using a model that is based solely on language. This could potentially be useful for identifying accurate predictions or potentially skilled forecasters earlier.

pdf bib
Text-Based Ideal Points
Keyon Vafa | Suresh Naidu | David Blei

Ideal point models analyze lawmakers’ votes to quantify their political positions, or ideal points. But votes are not the only way to express a political position. Lawmakers also give speeches, release press statements, and post tweets. In this paper, we introduce the text-based ideal point model (TBIP), an unsupervised probabilistic topic model that analyzes texts to quantify the political positions of its authors. We demonstrate the TBIP with two types of politicized text data : U.S. Senate speeches and senator tweets. Though the model does not analyze their votes or political affiliations, the TBIP separates lawmakers by party, learns interpretable politicized topics, and infers ideal points close to the classical vote-based ideal points. One benefit of analyzing texts, as opposed to votes, is that the TBIP can estimate ideal points of anyone who authors political texts, including non-voting actors. To this end, we use it to study tweets from the 2020 Democratic presidential candidates. Using only the texts of their tweets, it identifies them along an interpretable progressive-to-moderate spectrum.

pdf bib
Discourse as a Function of Event : Profiling Discourse Structure in News Articles around the Main Event
Prafulla Kumar Choubey | Aaron Lee | Ruihong Huang | Lu Wang

Understanding discourse structures of news articles is vital to effectively contextualize the occurrence of a news event. To enable computational modeling of news structures, we apply an existing theory of functional discourse structure for news articles that revolves around the main event and create a human-annotated corpus of 802 documents spanning over four domains and three media sources. Next, we propose several document-level neural-network models to automatically construct news content structures. Finally, we demonstrate that incorporating system predicted news structures yields new state-of-the-art performance for event coreference resolution. The news documents we annotated are openly available and the annotations are publicly released for future research.

pdf bib
Implicit Discourse Relation Classification : We Need to Talk about Evaluation
Najoung Kim | Song Feng | Chulaka Gunasekara | Luis Lastras

Implicit relation classification on Penn Discourse TreeBank (PDTB) 2.0 is a common benchmark task for evaluating the understanding of discourse relations. However, the lack of consistency in preprocessing and evaluation poses challenges to fair comparison of results in the literature. In this work, we highlight these inconsistencies and propose an improved evaluation protocol. Paired with this protocol, we report strong baseline results from pretrained sentence encoders, which set the new state-of-the-art for PDTB 2.0. Furthermore, this work is the first to explore fine-grained relation classification on PDTB 3.0. We expect our work to serve as a point of comparison for future work, and also as an initiative to discuss models of larger context and possible data augmentations for downstream transferability.

pdf bib
ZPR2 : Joint Zero Pronoun Recovery and Resolution using Multi-Task Learning and BERTZPR2: Joint Zero Pronoun Recovery and Resolution using Multi-Task Learning and BERT
Linfeng Song | Kun Xu | Yue Zhang | Jianshu Chen | Dong Yu

Zero pronoun recovery and resolution aim at recovering the dropped pronoun and pointing out its anaphoric mentions, respectively. We propose to better explore their interaction by solving both tasks together, while the previous work treats them separately. For zero pronoun resolution, we study this task in a more realistic setting, where no parsing trees or only automatic trees are available, while most previous work assumes gold trees. Experiments on two benchmarks show that joint modeling significantly outperforms our baseline that already beats the previous state of the arts.

pdf bib
Double-Hard Debias : Tailoring Word Embeddings for Gender Bias Mitigation
Tianlu Wang | Xi Victoria Lin | Nazneen Fatema Rajani | Bryan McCann | Vicente Ordonez | Caiming Xiong

Word embeddings derived from human-generated corpora inherit strong gender bias which can be further amplified by downstream models. Some commonly adopted debiasing approaches, including the seminal Hard Debias algorithm, apply post-processing procedures that project pre-trained word embeddings into a subspace orthogonal to an inferred gender subspace. We discover that semantic-agnostic corpus regularities such as word frequency captured by the word embeddings negatively impact the performance of these algorithms. We propose a simple but effective technique, Double Hard Debias, which purifies the word embeddings against such corpus regularities prior to inferring and removing the gender subspace. Experiments on three bias mitigation benchmarks show that our approach preserves the distributional semantics of the pre-trained word embeddings while reducing gender bias to a significantly larger degree than prior approaches.

pdf bib
Social Bias Frames : Reasoning about Social and Power Implications of Language
Maarten Sap | Saadia Gabriel | Lianhui Qin | Dan Jurafsky | Noah A. Smith | Yejin Choi

Warning : this paper contains content that may be offensive or upsetting. Language has the power to reinforce stereotypes and project social biases onto others. At the core of the challenge is that it is rarely what is stated explicitly, but rather the implied meanings, that frame people’s judgments about others. For example, given a statement that we should n’t lower our standards to hire more women, most listeners will infer the implicature intended by the speaker-that women (candidates) are less qualified. Most semantic formalisms, to date, do not capture such pragmatic implications in which people express social biases and power differentials in language. We introduce Social Bias Frames, a new conceptual formalism that aims to model the pragmatic frames in which people project social biases and stereotypes onto others. In addition, we introduce the Social Bias Inference Corpus to support large-scale modelling and evaluation with 150k structured annotations of social media posts, covering over 34k implications about a thousand demographic groups. We then establish baseline approaches that learn to recover Social Bias Frames from unstructured text. We find that while state-of-the-art neural models are effective at high-level categorization of whether a given statement projects unwanted social bias (80 % F1), they are not effective at spelling out more detailed explanations in terms of Social Bias Frames. Our study motivates future work that combines structured pragmatic inference with commonsense reasoning on social implications.

pdf bib
A Re-evaluation of Knowledge Graph Completion Methods
Zhiqing Sun | Shikhar Vashishth | Soumya Sanyal | Partha Talukdar | Yiming Yang

Knowledge Graph Completion (KGC) aims at automatically predicting missing links for large-scale knowledge graphs. A vast number of state-of-the-art KGC techniques have got published at top conferences in several research fields, including data mining, machine learning, and natural language processing. However, we notice that several recent papers report very high performance, which largely outperforms previous state-of-the-art methods. In this paper, we find that this can be attributed to the inappropriate evaluation protocol used by them and propose a simple evaluation protocol to address this problem. The proposed protocol is robust to handle bias in the model, which can substantially affect the final results. We conduct extensive experiments and report performance of several existing methods using our protocol. The reproducible code has been made publicly available.

pdf bib
Evaluating Explainable AI : Which Algorithmic Explanations Help Users Predict Model Behavior?AI: Which Algorithmic Explanations Help Users Predict Model Behavior?
Peter Hase | Mohit Bansal

Algorithmic approaches to interpreting machine learning models have proliferated in recent years. We carry out human subject tests that are the first of their kind to isolate the effect of algorithmic explanations on a key aspect of model interpretability, simulatability, while avoiding important confounding experimental factors. A model is simulatable when a person can predict its behavior on new inputs. Through two kinds of simulation tests involving text and tabular data, we evaluate five explanations methods : (1) LIME, (2) Anchor, (3) Decision Boundary, (4) a Prototype model, and (5) a Composite approach that combines explanations from each method. Clear evidence of method effectiveness is found in very few cases : LIME improves simulatability in tabular classification, and our Prototype method is effective in counterfactual simulation tests. We also collect subjective ratings of explanations, but we do not find that ratings are predictive of how helpful explanations are. Our results provide the first reliable and comprehensive estimates of how explanations influence simulatability across a variety of explanation methods and data domains. We show that (1) we need to be careful about the metrics we use to evaluate explanation methods, and (2) there is significant room for improvement in current methods.

pdf bib
Explaining Black Box Predictions and Unveiling Data Artifacts through Influence Functions
Xiaochuang Han | Byron C. Wallace | Yulia Tsvetkov

Modern deep learning models for NLP are notoriously opaque. This has motivated the development of methods for interpreting such models, e.g., via gradient-based saliency maps or the visualization of attention weights. Such approaches aim to provide explanations for a particular model prediction by highlighting important words in the corresponding input text. While this might be useful for tasks where decisions are explicitly influenced by individual tokens in the input, we suspect that such highlighting is not suitable for tasks where model decisions should be driven by more complex reasoning. In this work, we investigate the use of influence functions for NLP, providing an alternative approach to interpreting neural text classifiers. Influence functions explain the decisions of a model by identifying influential training examples. Despite the promise of this approach, influence functions have not yet been extensively evaluated in the context of NLP, a gap addressed by this work. We conduct a comparison between influence functions and common word-saliency methods on representative tasks. As suspected, we find that influence functions are particularly useful for natural language inference, a task in which ‘saliency maps’ may not have clear interpretation. Furthermore, we develop a new quantitative measure based on influence functions that can reveal artifacts in training data.

pdf bib
Finding Universal Grammatical Relations in Multilingual BERTBERT
Ethan A. Chi | John Hewitt | Christopher D. Manning

Recent work has found evidence that Multilingual BERT (mBERT), a transformer-based multilingual masked language model, is capable of zero-shot cross-lingual transfer, suggesting that some aspects of its representations are shared cross-lingually. To better understand this overlap, we extend recent work on finding syntactic trees in neural networks’ internal representations to the multilingual setting. We show that subspaces of mBERT representations recover syntactic tree distances in languages other than English, and that these subspaces are approximately shared across languages. Motivated by these results, we present an unsupervised analysis method that provides evidence mBERT learns representations of syntactic dependency labels, in the form of clusters which largely agree with the Universal Dependencies taxonomy. This evidence suggests that even without explicit supervision, multilingual masked language models learn certain linguistic universals.

pdf bib
Rationalizing Text Matching : Learning Sparse Alignments via Optimal TransportLearning Sparse Alignments via Optimal Transport
Kyle Swanson | Lili Yu | Tao Lei

Selecting input features of top relevance has become a popular method for building self-explaining models. In this work, we extend this selective rationalization approach to text matching, where the goal is to jointly select and align text pieces, such as tokens or sentences, as a justification for the downstream prediction. Our approach employs optimal transport (OT) to find a minimal cost alignment between the inputs. However, directly applying OT often produces dense and therefore uninterpretable alignments. To overcome this limitation, we introduce novel constrained variants of the OT problem that result in highly sparse alignments with controllable sparsity. Our model is end-to-end differentiable using the Sinkhorn algorithm for OT and can be trained without any alignment annotations. We evaluate our model on the StackExchange, MultiNews, e-SNLI, and MultiRC datasets. Our model achieves very sparse rationale selections with high fidelity while preserving prediction accuracy compared to strong attention baseline models.

pdf bib
On the Importance of Diversity in Question Generation for QAQA
Md Arafat Sultan | Shubham Chandel | Ramón Fernandez Astudillo | Vittorio Castelli

Automatic question generation (QG) has shown promise as a source of synthetic training data for question answering (QA). In this paper we ask : Is textual diversity in QG beneficial for downstream QA? Using top-p nucleus sampling to derive samples from a transformer-based question generator, we show that diversity-promoting QG indeed provides better QA training than likelihood maximization approaches such as beam search. We also show that standard QG evaluation metrics such as BLEU, ROUGE and METEOR are inversely correlated with diversity, and propose a diversity-aware intrinsic measure of overall QG quality that correlates well with extrinsic evaluation on QA.

pdf bib
Selective Question Answering under Domain Shift
Amita Kamath | Robin Jia | Percy Liang

To avoid giving wrong answers, question answering (QA) models need to know when to abstain from answering. Moreover, users often ask questions that diverge from the model’s training data, making errors more likely and thus abstention more critical. In this work, we propose the setting of selective question answering under domain shift, in which a QA model is tested on a mixture of in-domain and out-of-domain data, and must answer (i.e., not abstain on) as many questions as possible while maintaining high accuracy. Abstention policies based solely on the model’s softmax probabilities fare poorly, since models are overconfident on out-of-domain inputs. Instead, we train a calibrator to identify inputs on which the QA model errs, and abstain when it predicts an error is likely. Crucially, the calibrator benefits from observing the model’s behavior on out-of-domain data, even if from a different domain than the test data. We combine this method with a SQuAD-trained QA model and evaluate on mixtures of SQuAD and five other QA datasets. Our method answers 56 % of questions while maintaining 80 % accuracy ; in contrast, directly using the model’s probabilities only answers 48 % at 80 % accuracy.

pdf bib
The Cascade Transformer : an Application for Efficient Answer Sentence Selection
Luca Soldaini | Alessandro Moschitti

Large transformer-based language models have been shown to be very effective in many classification tasks. However, their computational complexity prevents their use in applications requiring the classification of a large set of candidates. While previous works have investigated approaches to reduce model size, relatively little attention has been paid to techniques to improve batch throughput during inference. In this paper, we introduce the Cascade Transformer, a simple yet effective technique to adapt transformer-based models into a cascade of rankers. Each ranker is used to prune a subset of candidates in a batch, thus dramatically increasing throughput at inference time. Partial encodings from the transformer model are shared among rerankers, providing further speed-up. When compared to a state-of-the-art transformer model, our approach reduces computation by 37 % with almost no impact on accuracy, as measured on two English Question Answering datasets.

pdf bib
STARC : Structured Annotations for Reading ComprehensionSTARC: Structured Annotations for Reading Comprehension
Yevgeni Berzak | Jonathan Malmaud | Roger Levy

We present STARC (Structured Annotations for Reading Comprehension), a new annotation framework for assessing reading comprehension with multiple choice questions. Our framework introduces a principled structure for the answer choices and ties them to textual span annotations. The framework is implemented in OneStopQA, a new high-quality dataset for evaluation and analysis of reading comprehension in English. We use this dataset to demonstrate that STARC can be leveraged for a key new application for the development of SAT-like reading comprehension materials : automatic annotation quality probing via span ablation experiments. We further show that it enables in-depth analyses and comparisons between machine and human reading comprehension behavior, including error distributions and guessing ability. Our experiments also reveal that the standard multiple choice dataset in NLP, RACE, is limited in its ability to measure reading comprehension. 47 % of its questions can be guessed by machines without accessing the passage, and 18 % are unanimously judged by humans as not having a unique correct answer. OneStopQA provides an alternative test set for reading comprehension which alleviates these shortcomings and has a substantially higher human ceiling performance.

pdf bib
A Comprehensive Analysis of Preprocessing for Word Representation Learning in Affective Tasks
Nastaran Babanejad | Ameeta Agrawal | Aijun An | Manos Papagelis

Affective tasks such as sentiment analysis, emotion classification, and sarcasm detection have been popular in recent years due to an abundance of user-generated data, accurate computational linguistic models, and a broad range of relevant applications in various domains. At the same time, many studies have highlighted the importance of text preprocessing, as an integral step to any natural language processing prediction model and downstream task. While preprocessing in affective systems is well-studied, preprocessing in word vector-based models applied to affective systems, is not. To address this limitation, we conduct a comprehensive analysis of the role of preprocessing techniques in affective analysis based on word vector models. Our analysis is the first of its kind and provides useful insights of the importance of each preprocessing technique when applied at the training phase, commonly ignored in pretrained word vector models, and/or at the downstream task phase.

pdf bib
Diverse and Informative Dialogue Generation with Context-Specific Commonsense Knowledge Awareness
Sixing Wu | Ying Li | Dawei Zhang | Yang Zhou | Zhonghai Wu

Generative dialogue systems tend to produce generic responses, which often leads to boring conversations. For alleviating this issue, Recent studies proposed to retrieve and introduce knowledge facts from knowledge graphs. While this paradigm works to a certain extent, it usually retrieves knowledge facts only based on the entity word itself, without considering the specific dialogue context. Thus, the introduction of the context-irrelevant knowledge facts can impact the quality of generations. To this end, this paper proposes a novel commonsense knowledge-aware dialogue generation model, ConKADI. We design a Felicitous Fact mechanism to help the model focus on the knowledge facts that are highly relevant to the context ; furthermore, two techniques, Context-Knowledge Fusion and Flexible Mode Fusion are proposed to facilitate the integration of the knowledge in the ConKADI. We collect and build a large-scale Chinese dataset aligned with the commonsense knowledge for dialogue generation. Extensive evaluations over both an open-released English dataset and our Chinese dataset demonstrate that our approach ConKADI outperforms the state-of-the-art approach CCM, in most experiments.

pdf bib
An Effective Transition-based Model for Discontinuous NERNER
Xiang Dai | Sarvnaz Karimi | Ben Hachey | Cecile Paris

Unlike widely used Named Entity Recognition (NER) data sets in generic domains, biomedical NER data sets often contain mentions consisting of discontinuous spans. Conventional sequence tagging techniques encode Markov assumptions that are efficient but preclude recovery of these mentions. We propose a simple, effective transition-based model with generic neural encoding for discontinuous NER. Through extensive experiments on three biomedical data sets, we show that our model can effectively recognize discontinuous mentions without sacrificing the accuracy on continuous mentions.

pdf bib
Relabel the Noise : Joint Extraction of Entities and Relations via Cooperative Multiagents
Daoyuan Chen | Yaliang Li | Kai Lei | Ying Shen

Distant supervision based methods for entity and relation extraction have received increasing popularity due to the fact that these methods require light human annotation efforts. In this paper, we consider the problem of shifted label distribution, which is caused by the inconsistency between the noisy-labeled training set subject to external knowledge graph and the human-annotated test set, and exacerbated by the pipelined entity-then-relation extraction manner with noise propagation. We propose a joint extraction approach to address this problem by re-labeling noisy instances with a group of cooperative multiagents. To handle noisy instances in a fine-grained manner, each agent in the cooperative group evaluates the instance by calculating a continuous confidence score from its own perspective ; To leverage the correlations between these two extraction tasks, a confidence consensus module is designed to gather the wisdom of all agents and re-distribute the noisy training set with confidence-scored labels. Further, the confidences are used to adjust the training losses of extractors. Experimental results on two real-world datasets verify the benefits of re-labeling noisy instance, and show that the proposed model significantly outperforms the state-of-the-art entity and relation extraction methods.

pdf bib
Simplify the Usage of Lexicon in Chinese NERChinese NER
Ruotian Ma | Minlong Peng | Qi Zhang | Zhongyu Wei | Xuanjing Huang

Recently, many works have tried to augment the performance of Chinese named entity recognition (NER) using word lexicons. As a representative, Lattice-LSTM has achieved new benchmark results on several public Chinese NER datasets. However, Lattice-LSTM has a complex model architecture. This limits its application in many industrial areas where real-time NER responses are needed. In this work, we propose a simple but effective method for incorporating the word lexicon into the character representations. This method avoids designing a complicated sequence modeling architecture, and for any neural NER model, it requires only subtle adjustment of the character representation layer to introduce the lexicon information. Experimental studies on four benchmark Chinese NER datasets show that our method achieves an inference speed up to 6.15 times faster than those of state-of-the-art methods, along with a better performance. The experimental results also show that the proposed method can be easily incorporated with pre-trained models like BERT.

pdf bib
AdvAug : Robust Adversarial Augmentation for Neural Machine TranslationAdvAug: Robust Adversarial Augmentation for Neural Machine Translation
Yong Cheng | Lu Jiang | Wolfgang Macherey | Jacob Eisenstein

In this paper, we propose a new adversarial augmentation method for Neural Machine Translation (NMT). The main idea is to minimize the vicinal risk over virtual sentences sampled from two vicinity distributions, in which the crucial one is a novel vicinity distribution for adversarial sentences that describes a smooth interpolated embedding space centered around observed training sentence pairs. We then discuss our approach, AdvAug, to train NMT models using the embeddings of virtual sentences in sequence-to-sequence learning. Experiments on Chinese-English, English-French, and English-German translation benchmarks show that AdvAug achieves significant improvements over theTransformer (up to 4.9 BLEU points), and substantially outperforms other data augmentation techniques (e.g.back-translation) without using extra corpora.

pdf bib
Emerging Cross-lingual Structure in Pretrained Language Models
Alexis Conneau | Shijie Wu | Haoran Li | Luke Zettlemoyer | Veselin Stoyanov

We study the problem of multilingual masked language modeling, i.e. the training of a single model on concatenated text from multiple languages, and present a detailed study of several factors that influence why these models are so effective for cross-lingual transfer. We show, contrary to what was previously hypothesized, that transfer is possible even when there is no shared vocabulary across the monolingual corpora and also when the text comes from very different domains. The only requirement is that there are some shared parameters in the top layers of the multi-lingual encoder. To better understand this result, we also show that representations from monolingual BERT models in different languages can be aligned post-hoc quite effectively, strongly suggesting that, much like for non-contextual word embeddings, there are universal latent symmetries in the learned embedding spaces. For multilingual masked language modeling, these symmetries are automatically discovered and aligned during the joint training process.

pdf bib
Incorporating External Knowledge through Pre-training for Natural Language to Code Generation
Frank F. Xu | Zhengbao Jiang | Pengcheng Yin | Bogdan Vasilescu | Graham Neubig

Open-domain code generation aims to generate code in a general-purpose programming language (such as Python) from natural language (NL) intents. Motivated by the intuition that developers usually retrieve resources on the web when writing code, we explore the effectiveness of incorporating two varieties of external knowledge into NL-to-code generation : automatically mined NL-code pairs from the online programming QA forum StackOverflow and programming language API documentation. Our evaluations show that combining the two sources with data augmentation and retrieval-based data re-sampling improves the current state-of-the-art by up to 2.2 % absolute BLEU score on the code generation testbed CoNaLa. The code and resources are available at

pdf bib
Word-level Textual Adversarial Attacking as Combinatorial Optimization
Yuan Zang | Fanchao Qi | Chenghao Yang | Zhiyuan Liu | Meng Zhang | Qun Liu | Maosong Sun

Adversarial attacks are carried out to reveal the vulnerability of deep neural networks. Textual adversarial attacking is challenging because text is discrete and a small perturbation can bring significant change to the original input. Word-level attacking, which can be regarded as a combinatorial optimization problem, is a well-studied class of textual attack methods. However, existing word-level attack models are far from perfect, largely because unsuitable search space reduction methods and inefficient optimization algorithms are employed. In this paper, we propose a novel attack model, which incorporates the sememe-based word substitution method and particle swarm optimization-based search algorithm to solve the two problems separately. We conduct exhaustive experiments to evaluate our attack model by attacking BiLSTM and BERT on three benchmark datasets. Experimental results demonstrate that our model consistently achieves much higher attack success rates and crafts more high-quality adversarial examples as compared to baseline methods. Also, further experiments show our model has higher transferability and can bring more robustness enhancement to victim models by adversarial training. All the code and data of this paper can be obtained on

pdf bib
Reasoning Over Semantic-Level Graph for Fact Checking
Wanjun Zhong | Jingjing Xu | Duyu Tang | Zenan Xu | Nan Duan | Ming Zhou | Jiahai Wang | Jian Yin

Fact checking is a challenging task because verifying the truthfulness of a claim requires reasoning about multiple retrievable evidence. In this work, we present a method suitable for reasoning about the semantic-level structure of evidence. Unlike most previous works, which typically represent evidence sentences with either string concatenation or fusing the features of isolated evidence sentences, our approach operates on rich semantic structures of evidence obtained by semantic role labeling. We propose two mechanisms to exploit the structure of evidence while leveraging the advances of pre-trained models like BERT, GPT or XLNet. Specifically, using XLNet as the backbone, we first utilize the graph structure to re-define the relative distances of words, with the intuition that semantically related words should have short distances. Then, we adopt graph convolutional network and graph attention network to propagate and aggregate information from neighboring nodes on the graph. We evaluate our system on FEVER, a benchmark dataset for fact checking, and find that rich structural information is helpful and both our graph-based mechanisms improve the accuracy. Our model is the state-of-the-art system in terms of both official evaluation metrics, namely claim verification accuracy and FEVER score.

pdf bib
Extractive Summarization as Text Matching
Ming Zhong | Pengfei Liu | Yiran Chen | Danqing Wang | Xipeng Qiu | Xuanjing Huang

This paper creates a paradigm shift with regard to the way we build neural extractive summarization systems. Instead of following the commonly used framework of extracting sentences individually and modeling the relationship between sentences, we formulate the extractive summarization task as a semantic text matching problem, in which a source document and candidate summaries will be (extracted from the original text) matched in a semantic space. Notably, this paradigm shift to semantic matching framework is well-grounded in our comprehensive analysis of the inherent gap between sentence-level and summary-level extractors based on the property of the dataset. Besides, even instantiating the framework with a simple form of a matching model, we have driven the state-of-the-art extractive result on CNN / DailyMail to a new level (44.41 in ROUGE-1). Experiments on the other five datasets also show the effectiveness of the matching framework. We believe the power of this matching-based summarization framework has not been fully exploited. To encourage more instantiations in the future, we have released our codes, processed dataset, as well as generated summaries in.

pdf bib
Multi-Granularity Interaction Network for Extractive and Abstractive Multi-Document Summarization
Hanqi Jin | Tianming Wang | Xiaojun Wan

In this paper, we propose a multi-granularity interaction network for extractive and abstractive multi-document summarization, which jointly learn semantic representations for words, sentences, and documents. The word representations are used to generate an abstractive summary while the sentence representations are used to produce an extractive summary. We employ attention mechanisms to interact between different granularity of semantic representations, which helps to capture multi-granularity key information and improves the performance of both abstractive and extractive summarization. Experiment results show that our proposed model substantially outperforms all strong baseline methods and achieves the best results on the Multi-News dataset.

pdf bib
Language (Re)modelling : Towards Embodied Language UnderstandingLanguage (Re)modelling: Towards Embodied Language Understanding
Ronen Tamari | Chen Shani | Tom Hope | Miriam R L Petruck | Omri Abend | Dafna Shahaf

While natural language understanding (NLU) is advancing rapidly, today’s technology differs from human-like language understanding in fundamental ways, notably in its inferior efficiency, interpretability, and generalization. This work proposes an approach to representation and learning based on the tenets of embodied cognitive linguistics (ECL). According to ECL, natural language is inherently executable (like programming languages), driven by mental simulation and metaphoric mappings over hierarchical compositions of structures and schemata learned through embodied interaction. This position paper argues that the use of grounding by metaphoric reasoning and simulation will greatly benefit NLU systems, and proposes a system architecture along with a roadmap towards realizing this vision.

pdf bib
The State and Fate of Linguistic Diversity and Inclusion in the NLP WorldNLP World
Pratik Joshi | Sebastin Santy | Amar Budhiraja | Kalika Bali | Monojit Choudhury

Language technologies contribute to promoting multilingualism and linguistic diversity around the world. However, only a very small number of the over 7000 languages of the world are represented in the rapidly evolving language technologies and applications. In this paper we look at the relation between the types of languages, resources, and their representation in NLP conferences to understand the trajectory that different languages have followed over time. Our quantitative investigation underlines the disparity between languages, especially in terms of their resources, and calls into question the language agnostic status of current models and systems. Through this paper, we attempt to convince the ACL community to prioritise the resolution of the predicaments highlighted here, so that no language is left behind.

pdf bib
The Unstoppable Rise of Computational Linguistics in Deep Learning
James Henderson

In this paper, we trace the history of neural networks applied to natural language understanding tasks, and identify key contributions which the nature of language has made to the development of neural network architectures. We focus on the importance of variable binding and its instantiation in attention-based models, and argue that Transformer is not a sequence model but an induced-structure model. This perspective leads to predictions of the challenges facing research in deep learning architectures for natural language understanding.

A Contextual Hierarchical Attention Network with Adaptive Objective for Dialogue State Tracking
Yong Shan | Zekang Li | Jinchao Zhang | Fandong Meng | Yang Feng | Cheng Niu | Jie Zhou

Recent studies in dialogue state tracking (DST) leverage historical information to determine states which are generally represented as slot-value pairs. However, most of them have limitations to efficiently exploit relevant context due to the lack of a powerful mechanism for modeling interactions between the slot and the dialogue history. Besides, existing methods usually ignore the slot imbalance problem and treat all slots indiscriminately, which limits the learning of hard slots and eventually hurts overall performance. In this paper, we propose to enhance the DST through employing a contextual hierarchical attention network to not only discern relevant information at both word level and turn level but also learn contextual representations. We further propose an adaptive objective to alleviate the slot imbalance problem by dynamically adjust weights of different slots during training. Experimental results show that our approach reaches 52.68 % and 58.55 % joint accuracy on MultiWOZ 2.0 and MultiWOZ 2.1 datasets respectively and achieves new state-of-the-art performance with considerable improvements (+1.24 % and +5.98 %).

pdf bib
Data Manipulation : Towards Effective Instance Learning for Neural Dialogue Generation via Learning to Augment and Reweight
Hengyi Cai | Hongshen Chen | Yonghao Song | Cheng Zhang | Xiaofang Zhao | Dawei Yin

Current state-of-the-art neural dialogue models learn from human conversations following the data-driven paradigm. As such, a reliable training corpus is the crux of building a robust and well-behaved dialogue model. However, due to the open-ended nature of human conversations, the quality of user-generated training data varies greatly, and effective training samples are typically insufficient while noisy samples frequently appear. This impedes the learning of those data-driven neural dialogue models. Therefore, effective dialogue learning requires not only more reliable learning samples, but also fewer noisy samples. In this paper, we propose a data manipulation framework to proactively reshape the data distribution towards reliable samples by augmenting and highlighting effective learning samples as well as reducing the effect of inefficient samples simultaneously. In particular, the data manipulation model selectively augments the training samples and assigns an importance weight to each instance to reform the training data. Note that, the proposed data manipulation framework is fully data-driven and learnable. It not only manipulates training samples to optimize the dialogue generation model, but also learns to increase its manipulation skills through gradient descent with validation samples. Extensive experiments show that our framework can improve the dialogue generation performance with respect to various automatic evaluation metrics and human judgments.

pdf bib
SAS : Dialogue State Tracking via Slot Attention and Slot Information SharingSAS: Dialogue State Tracking via Slot Attention and Slot Information Sharing
Jiaying Hu | Yan Yang | Chencai Chen | Liang He | Zhou Yu

Dialogue state tracker is responsible for inferring user intentions through dialogue history. Previous methods have difficulties in handling dialogues with long interaction context, due to the excessive information. We propose a Dialogue State Tracker with Slot Attention and Slot Information Sharing (SAS) to reduce redundant information’s interference and improve long dialogue context tracking. Specially, we first apply a Slot Attention to learn a set of slot-specific features from the original dialogue and then integrate them using a slot information sharing module. Our model yields a significantly improved performance compared to previous state-of the-art models on the MultiWOZ dataset.

pdf bib
MIE : A Medical Information Extractor towards Medical DialoguesMIE: A Medical Information Extractor towards Medical Dialogues
Yuanzhe Zhang | Zhongtao Jiang | Tao Zhang | Shiwan Liu | Jiarun Cao | Kang Liu | Shengping Liu | Jun Zhao

Electronic Medical Records (EMRs) have become key components of modern medical care systems. Despite the merits of EMRs, many doctors suffer from writing them, which is time-consuming and tedious. We believe that automatically converting medical dialogues to EMRs can greatly reduce the burdens of doctors, and extracting information from medical dialogues is an essential step. To this end, we annotate online medical consultation dialogues in a window-sliding style, which is much easier than the sequential labeling annotation. We then propose a Medical Information Extractor (MIE) towards medical dialogues. MIE is able to extract mentioned symptoms, surgeries, tests, other information and their corresponding status. To tackle the particular challenges of the task, MIE uses a deep matching architecture, taking dialogue turn-interaction into account. The experimental results demonstrate MIE is a promising solution to extract medical information from doctor-patient dialogues.

pdf bib
Span-based Localizing Network for Natural Language Video Localization
Hao Zhang | Aixin Sun | Wei Jing | Joey Tianyi Zhou

Given an untrimmed video and a text query, natural language video localization (NLVL) is to locate a matching span from the video that semantically corresponds to the query. Existing solutions formulate NLVL either as a ranking task and apply multimodal matching architecture, or as a regression task to directly regress the target video span. In this work, we address NLVL task with a span-based QA approach by treating the input video as text passage. We propose a video span localizing network (VSLNet), on top of the standard span-based QA framework, to address NLVL. The proposed VSLNet tackles the differences between NLVL and span-based QA through a simple and yet effective query-guided highlighting (QGH) strategy. The QGH guides VSLNet to search for matching video span within a highlighted region. Through extensive experiments on three benchmark datasets, we show that the proposed VSLNet outperforms the state-of-the-art methods ; and adopting span-based QA framework is a promising direction to solve NLVL.

pdf bib
Words Are n’t Enough, Their Order Matters : On the Robustness of Grounding Visual Referring Expressions
Arjun Akula | Spandana Gella | Yaser Al-Onaizan | Song-Chun Zhu | Siva Reddy

Visual referring expression recognition is a challenging task that requires natural language understanding in the context of an image. We critically examine RefCOCOg, a standard benchmark for this task, using a human study and show that 83.7 % of test instances do not require reasoning on linguistic structure, i.e., words are enough to identify the target object, the word order does n’t matter. To measure the true progress of existing models, we split the test set into two sets, one which requires reasoning on linguistic structure and the other which does n’t. Additionally, we create an out-of-distribution dataset Ref-Adv by asking crowdworkers to perturb in-domain examples such that the target object changes. Using these datasets, we empirically show that existing methods fail to exploit linguistic structure and are 12 % to 23 % lower in performance than the established progress for this task. We also propose two methods, one based on contrastive learning and the other based on multi-task learning, to increase the robustness of ViLBERT, the current state-of-the-art model for this task. Our datasets are publicly available at

pdf bib
Differentiable Window for Dynamic Local Attention
Thanh-Tung Nguyen | Xuan-Phi Nguyen | Shafiq Joty | Xiaoli Li

We propose Differentiable Window, a new neural module and general purpose component for dynamic window selection. While universally applicable, we demonstrate a compelling use case of utilizing Differentiable Window to improve standard attention modules by enabling more focused attentions over the input regions. We propose two variants of Differentiable Window, and integrate them within the Transformer architecture in two novel ways. We evaluate our proposed approach on a myriad of NLP tasks, including machine translation, sentiment analysis, subject-verb agreement and language modeling. Our experimental results demonstrate consistent and sizable improvements across all tasks.

pdf bib
The Right Tool for the Job : Matching Model and Instance Complexities
Roy Schwartz | Gabriel Stanovsky | Swabha Swayamdipta | Jesse Dodge | Noah A. Smith

As NLP models become larger, executing a trained model requires significant computational resources incurring monetary and environmental costs. To better respect a given inference budget, we propose a modification to contextual representation fine-tuning which, during inference, allows for an early (and fast) exit from neural network calculations for simple instances, and late (and accurate) exit for hard instances. To achieve this, we add classifiers to different layers of BERT and use their calibrated confidence scores to make early exit decisions. We test our proposed modification on five different datasets in two tasks : three text classification datasets and two natural language inference benchmarks. Our method presents a favorable speed / accuracy tradeoff in almost all cases, producing models which are up to five times faster than the state of the art, while preserving their accuracy. Our method also requires almost no additional training resources (in either time or parameters) compared to the baseline BERT model. Finally, our method alleviates the need for costly retraining of multiple models at different levels of efficiency ; we allow users to control the inference speed / accuracy tradeoff using a single trained model, by setting a single variable at inference time. We publicly release our code.

pdf bib
Coupling Distant Annotation and Adversarial Training for Cross-Domain Chinese Word SegmentationChinese Word Segmentation
Ning Ding | Dingkun Long | Guangwei Xu | Muhua Zhu | Pengjun Xie | Xiaobin Wang | Haitao Zheng

Fully supervised neural approaches have achieved significant progress in the task of Chinese word segmentation (CWS). Nevertheless, the performance of supervised models always drops gravely if the domain shifts due to the distribution gap across domains and the out of vocabulary (OOV) problem. In order to simultaneously alleviate the issues, this paper intuitively couples distant annotation and adversarial training for cross-domain CWS. 1) We rethink the essence of Chinese words and design an automatic distant annotation mechanism, which does not need any supervision or pre-defined dictionaries on the target domain. The method could effectively explore domain-specific words and distantly annotate the raw texts for the target domain. 2) We further develop a sentence-level adversarial training procedure to perform noise reduction and maximum utilization of the source domain information. Experiments on multiple real-world datasets across various domains show the superiority and robustness of our model, significantly outperforming previous state-of-the-arts cross-domain CWS methods.

pdf bib
Unsupervised Morphological Paradigm Completion
Huiming Jin | Liwei Cai | Yihui Peng | Chen Xia | Arya McCarthy | Katharina Kann

We propose the task of unsupervised morphological paradigm completion. Given only raw text and a lemma list, the task consists of generating the morphological paradigms, i.e., all inflected forms, of the lemmas. From a natural language processing (NLP) perspective, this is a challenging unsupervised task, and high-performing systems have the potential to improve tools for low-resource languages or to assist linguistic annotators. From a cognitive science perspective, this can shed light on how children acquire morphological knowledge. We further introduce a system for the task, which generates morphological paradigms via the following steps : (i) EDIT TREE retrieval, (ii) additional lemma retrieval, (iii) paradigm size discovery, and (iv) inflection generation. We perform an evaluation on 14 typologically diverse languages. Our system outperforms trivial baselines with ease and, for some languages, even obtains a higher accuracy than minimally supervised systems.

pdf bib
Document Modeling with Graph Attention Networks for Multi-grained Machine Reading Comprehension
Bo Zheng | Haoyang Wen | Yaobo Liang | Nan Duan | Wanxiang Che | Daxin Jiang | Ming Zhou | Ting Liu

Natural Questions is a new challenging machine reading comprehension benchmark with two-grained answers, which are a long answer (typically a paragraph) and a short answer (one or more entities inside the long answer). Despite the effectiveness of existing methods on this benchmark, they treat these two sub-tasks individually during training while ignoring their dependencies. To address this issue, we present a novel multi-grained machine reading comprehension framework that focuses on modeling documents at their hierarchical nature, which are different levels of granularity : documents, paragraphs, sentences, and tokens. We utilize graph attention networks to obtain different levels of representations so that they can be learned simultaneously. The long and short answers can be extracted from paragraph-level representation and token-level representation, respectively. In this way, we can model the dependencies between the two-grained answers to provide evidence for each other. We jointly train the two sub-tasks, and our experiments show that our approach significantly outperforms previous systems at both long and short answer criteria.

pdf bib
RikiNet : Reading Wikipedia Pages for Natural Question AnsweringRikiNet: Reading Wikipedia Pages for Natural Question Answering
Dayiheng Liu | Yeyun Gong | Jie Fu | Yu Yan | Jiusheng Chen | Daxin Jiang | Jiancheng Lv | Nan Duan

Reading long documents to answer open-domain questions remains challenging in natural language understanding. In this paper, we introduce a new model, called RikiNet, which reads Wikipedia pages for natural question answering. RikiNet contains a dynamic paragraph dual-attention reader and a multi-level cascaded answer predictor. The reader dynamically represents the document and question by utilizing a set of complementary attention mechanisms. The representations are then fed into the predictor to obtain the span of the short answer, the paragraph of the long answer, and the answer type in a cascaded manner. On the Natural Questions (NQ) dataset, a single RikiNet achieves 74.3 F1 and 57.9 F1 on long-answer and short-answer tasks. To our best knowledge, it is the first single model that outperforms the single human performance. Furthermore, an ensemble RikiNet obtains 76.1 F1 and 61.3 F1 on long-answer and short-answer tasks, achieving the best performance on the official NQ leaderboard.

pdf bib
Semantic Parsing for English as a Second LanguageEnglish as a Second Language
Yuanyuan Zhao | Weiwei Sun | Junjie Cao | Xiaojun Wan

This paper is concerned with semantic parsing for English as a second language (ESL). Motivated by the theoretical emphasis on the learning challenges that occur at the syntax-semantics interface during second language acquisition, we formulate the task based on the divergence between literal and intended meanings. We combine the complementary strengths of English Resource Grammar, a linguistically-precise hand-crafted deep grammar, and TLE, an existing manually annotated ESL UD-TreeBank with a novel reranking model. Experiments demonstrate that in comparison to human annotations, our method can obtain a very promising SemBanking quality. By means of the newly created corpus, we evaluate state-of-the-art semantic parsing as well as grammatical error correction models. The evaluation profiles the performance of neural NLP techniques for handling ESL data and suggests some research directions.

pdf bib
Unsupervised Dual Paraphrasing for Two-stage Semantic Parsing
Ruisheng Cao | Su Zhu | Chenyu Yang | Chen Liu | Rao Ma | Yanbin Zhao | Lu Chen | Kai Yu

One daunting problem for semantic parsing is the scarcity of annotation. Aiming to reduce nontrivial human labor, we propose a two-stage semantic parsing framework, where the first stage utilizes an unsupervised paraphrase model to convert an unlabeled natural language utterance into the canonical utterance. The downstream naive semantic parser accepts the intermediate output and returns the target logical form. Furthermore, the entire training process is split into two phases : pre-training and cycle learning. Three tailored self-supervised tasks are introduced throughout training to activate the unsupervised paraphrase model. Experimental results on benchmarks Overnight and GeoGranno demonstrate that our framework is effective and compatible with supervised training.

pdf bib
DRTS Parsing with Structure-Aware Encoding and DecodingDRTS Parsing with Structure-Aware Encoding and Decoding
Qiankun Fu | Yue Zhang | Jiangming Liu | Meishan Zhang

Discourse representation tree structure (DRTS) parsing is a novel semantic parsing task which has been concerned most recently. State-of-the-art performance can be achieved by a neural sequence-to-sequence model, treating the tree construction as an incremental sequence generation problem. Structural information such as input syntax and the intermediate skeleton of the partial output has been ignored in the model, which could be potentially useful for the DRTS parsing. In this work, we propose a structural-aware model at both the encoder and decoder phase to integrate the structural information, where graph attention network (GAT) is exploited for effectively modeling. Experimental results on a benchmark dataset show that our proposed model is effective and can obtain the best performance in the literature.

pdf bib
Document Translation vs. Query Translation for Cross-Lingual Information Retrieval in the Medical Domain
Shadi Saleh | Pavel Pecina

We present a thorough comparison of two principal approaches to Cross-Lingual Information Retrieval : document translation (DT) and query translation (QT). Our experiments are conducted using the cross-lingual test collection produced within the CLEF eHealth information retrieval tasks in 20132015 containing English documents and queries in several European languages. We exploit the Statistical Machine Translation (SMT) and Neural Machine Translation (NMT) paradigms and train several domain-specific and task-specific machine translation systems to translate the non-English queries into English (for the QT approach) and the English documents to all the query languages (for the DT approach). The results show that the quality of QT by SMT is sufficient enough to outperform the retrieval results of the DT approach for all the languages. NMT then further boosts translation quality and retrieval quality for both QT and DT for most languages, but still, QT provides generally better retrieval results than DT.

pdf bib
Gender in Danger? Evaluating Speech Translation Technology on the MuST-SHE CorpusMuST-SHE Corpus
Luisa Bentivogli | Beatrice Savoldi | Matteo Negri | Mattia A. Di Gangi | Roldano Cattoni | Marco Turchi

Translating from languages without productive grammatical gender like English into gender-marked languages is a well-known difficulty for machines. This difficulty is also due to the fact that the training data on which models are built typically reflect the asymmetries of natural languages, gender bias included. Exclusively fed with textual data, machine translation is intrinsically constrained by the fact that the input sentence does not always contain clues about the gender identity of the referred human entities. But what happens with speech translation, where the input is an audio signal? Can audio provide additional information to reduce gender bias? We present the first thorough investigation of gender bias in speech translation, contributing with : i) the release of a benchmark useful for future studies, and ii) the comparison of different technologies (cascade and end-to-end) on two language directions (English-Italian / French).

pdf bib
Closing the Gap : Joint De-Identification and Concept Extraction in the Clinical Domain
Lukas Lange | Heike Adel | Jannik Strötgen

Exploiting natural language processing in the clinical domain requires de-identification, i.e., anonymization of personal information in texts. However, current research considers de-identification and downstream tasks, such as concept extraction, only in isolation and does not study the effects of de-identification on other tasks. In this paper, we close this gap by reporting concept extraction performance on automatically anonymized data and investigating joint models for de-identification and concept extraction. In particular, we propose a stacked model with restricted access to privacy sensitive information and a multitask model. We set the new state of the art on benchmark datasets in English (96.1 % F1 for de-identification and 88.9 % F1 for concept extraction) and Spanish (91.4 % F1 for concept extraction).

pdf bib
Estimating predictive uncertainty for rumour verification models
Elena Kochkina | Maria Liakata

The inability to correctly resolve rumours circulating online can have harmful real-world consequences. We present a method for incorporating model and data uncertainty estimates into natural language processing models for automatic rumour verification. We show that these estimates can be used to filter out model predictions likely to be erroneous so that these difficult instances can be prioritised by a human fact-checker. We propose two methods for uncertainty-based instance rejection, supervised and unsupervised. We also show how uncertainty estimates can be used to interpret model performance as a rumour unfolds.

pdf bib
Language to Network : Conditional Parameter Adaptation with Natural Language Descriptions
Tian Jin | Zhun Liu | Shengjia Yan | Alexandre Eichenberger | Louis-Philippe Morency

Transfer learning using ImageNet pre-trained models has been the de facto approach in a wide range of computer vision tasks. However, fine-tuning still requires task-specific training data. In this paper, we propose N3 (Neural Networks from Natural Language)-a new paradigm of synthesizing task-specific neural networks from language descriptions and a generic pre-trained model. N3 leverages language descriptions to generate parameter adaptations as well as a new task-specific classification layer for a pre-trained neural network, effectively fine-tuning the network for a new task using only language descriptions as input. To the best of our knowledge, N3 is the first method to synthesize entire neural networks from natural language. Experimental results show that N3 can out-perform previous natural-language based zero-shot learning methods across 4 different zero-shot image classification benchmarks. We also demonstrate a simple method to help identify keywords in language descriptions leveraged by N3 when synthesizing model parameters.N3 (Neural Networks from Natural Language) - a new paradigm of synthesizing task-specific neural networks from language descriptions and a generic pre-trained model. N3 leverages language descriptions to generate parameter adaptations as well as a new task-specific classification layer for a pre-trained neural network, effectively “fine-tuning” the network for a new task using only language descriptions as input. To the best of our knowledge, N3 is the first method to synthesize entire neural networks from natural language. Experimental results show that N3 can out-perform previous natural-language based zero-shot learning methods across 4 different zero-shot image classification benchmarks. We also demonstrate a simple method to help identify keywords in language descriptions leveraged by N3 when synthesizing model parameters.

pdf bib
tBERT : Topic Models and BERT Joining Forces for Semantic Similarity DetectionBERT: Topic Models and BERT Joining Forces for Semantic Similarity Detection
Nicole Peinelt | Dong Nguyen | Maria Liakata

Semantic similarity detection is a fundamental task in natural language understanding. Adding topic information has been useful for previous feature-engineered semantic similarity models as well as neural models for other tasks. There is currently no standard way of combining topics with pretrained contextual representations such as BERT. We propose a novel topic-informed BERT-based architecture for pairwise semantic similarity detection and show that our model improves performance over strong neural baselines across a variety of English language datasets. We find that the addition of topics to BERT helps particularly with resolving domain-specific cases.

pdf bib
Out of the Echo Chamber : Detecting Countering Debate SpeechesDetecting Countering Debate Speeches
Matan Orbach | Yonatan Bilu | Assaf Toledo | Dan Lahav | Michal Jacovi | Ranit Aharonov | Noam Slonim

An educated and informed consumption of media content has become a challenge in modern times. With the shift from traditional news outlets to social media and similar venues, a major concern is that readers are becoming encapsulated in echo chambers and may fall prey to fake news and disinformation, lacking easy access to dissenting views. We suggest a novel task aiming to alleviate some of these concerns that of detecting articles that most effectively counter the arguments and not just the stance made in a given text. We study this problem in the context of debate speeches. Given such a speech, we aim to identify, from among a set of speeches on the same topic and with an opposing stance, the ones that directly counter it. We provide a large dataset of 3,685 such speeches (in English), annotated for this relation, which hopefully would be of general interest to the NLP community. We explore several algorithms addressing this task, and while some are successful, all fall short of expert human performance, suggesting room for further research. All data collected during this work is freely available for research.

pdf bib
KdConv : A Chinese Multi-domain Dialogue Dataset Towards Multi-turn Knowledge-driven ConversationKdConv: A Chinese Multi-domain Dialogue Dataset Towards Multi-turn Knowledge-driven Conversation
Hao Zhou | Chujie Zheng | Kaili Huang | Minlie Huang | Xiaoyan Zhu

The research of knowledge-driven conversational systems is largely limited due to the lack of dialog data which consists of multi-turn conversations on multiple topics and with knowledge annotations. In this paper, we propose a Chinese multi-domain knowledge-driven conversation dataset, KdConv, which grounds the topics in multi-turn conversations to knowledge graphs. Our corpus contains 4.5 K conversations from three domains (film, music, and travel), and 86 K utterances with an average turn number of 19.0. These conversations contain in-depth discussions on related topics and natural transition between multiple topics. To facilitate the following research on this corpus, we provide several benchmark models. Comparative results show that the models can be enhanced by introducing background knowledge, yet there is still a large space for leveraging knowledge to model multi-turn conversations for further research. Results also show that there are obvious performance differences between different domains, indicating that it is worth further explore transfer learning and domain adaptation. The corpus and benchmark models are publicly available.

pdf bib
Multi-Domain Dialogue Acts and Response Co-Generation
Kai Wang | Junfeng Tian | Rui Wang | Xiaojun Quan | Jianxing Yu

Generating fluent and informative responses is of critical importance for task-oriented dialogue systems. Existing pipeline approaches generally predict multiple dialogue acts first and use them to assist response generation. There are at least two shortcomings with such approaches. First, the inherent structures of multi-domain dialogue acts are neglected. Second, the semantic associations between acts and responses are not taken into account for response generation. To address these issues, we propose a neural co-generation model that generates dialogue acts and responses concurrently. Unlike those pipeline approaches, our act generation module preserves the semantic structures of multi-domain dialogue acts and our response generation module dynamically attends to different acts as needed. We train the two modules jointly using an uncertainty loss to adjust their task weights adaptively. Extensive experiments are conducted on the large-scale MultiWOZ dataset and the results show that our model achieves very favorable improvement over several state-of-the-art models in both automatic and human evaluations.

pdf bib
Neural Data-to-Text Generation via Jointly Learning the Segmentation and Correspondence
Xiaoyu Shen | Ernie Chang | Hui Su | Cheng Niu | Dietrich Klakow

The neural attention model has achieved great success in data-to-text generation tasks. Though usually excelling at producing fluent text, it suffers from the problem of information missing, repetition and hallucination. Due to the black-box nature of the neural attention architecture, avoiding these problems in a systematic way is non-trivial. To address this concern, we propose to explicitly segment target text into fragment units and align them with their data correspondences. The segmentation and correspondence are jointly learned as latent variables without any human annotations. We further impose a soft statistical constraint to regularize the segmental granularity. The resulting architecture maintains the same expressive power as neural attention models, while being able to generate fully interpretable outputs with several times less computational cost. On both E2E and WebNLG benchmarks, we show the proposed model consistently outperforms its neural attention counterparts.

pdf bib
Null It Out : Guarding Protected Attributes by Iterative Nullspace Projection
Shauli Ravfogel | Yanai Elazar | Hila Gonen | Michael Twiton | Yoav Goldberg

The ability to control for the kinds of information encoded in neural representation has a variety of use cases, especially in light of the challenge of interpreting these models. We present Iterative Null-space Projection (INLP), a novel method for removing information from neural representations. Our method is based on repeated training of linear classifiers that predict a certain property we aim to remove, followed by projection of the representations on their null-space. By doing so, the classifiers become oblivious to that target property, making it hard to linearly separate the data according to it. While applicable for multiple uses, we evaluate our method on bias and fairness use-cases, and show that our method is able to mitigate bias in word embeddings, as well as to increase fairness in a setting of multi-class classification.

pdf bib
Multi-source Meta Transfer for Low Resource Multiple-Choice Question Answering
Ming Yan | Hao Zhang | Di Jin | Joey Tianyi Zhou

Multiple-choice question answering (MCQA) is one of the most challenging tasks in machine reading comprehension since it requires more advanced reading comprehension skills such as logical reasoning, summarization, and arithmetic operations. Unfortunately, most existing MCQA datasets are small in size, which increases the difficulty of model learning and generalization. To address this challenge, we propose a multi-source meta transfer (MMT) for low-resource MCQA. In this framework, we first extend meta learning by incorporating multiple training sources to learn a generalized feature representation across domains. To bridge the distribution gap between training sources and the target, we further introduce the meta transfer that can be integrated into the multi-source meta training. More importantly, the proposed MMT is independent of backbone language models. Extensive experiments demonstrate the superiority of MMT over state-of-the-arts, and continuous improvements can be achieved on different backbone networks on both supervised and unsupervised domain adaptation settings.

pdf bib
Premise Selection in Natural Language Mathematical Texts
Deborah Ferreira | André Freitas

The discovery of supporting evidence for addressing complex mathematical problems is a semantically challenging task, which is still unexplored in the field of natural language processing for mathematical text. The natural language premise selection task consists in using conjectures written in both natural language and mathematical formulae to recommend premises that most likely will be useful to prove a particular statement. We propose an approach to solve this task as a link prediction problem, using Deep Convolutional Graph Neural Networks. This paper also analyses how different baselines perform in this task and shows that a graph structure can provide higher F1-score, especially when considering multi-hop premise selection.

pdf bib
A Call for More Rigor in Unsupervised Cross-lingual Learning
Mikel Artetxe | Sebastian Ruder | Dani Yogatama | Gorka Labaka | Eneko Agirre

We review motivations, definition, approaches, and methodology for unsupervised cross-lingual learning and call for a more rigorous position in each of them. An existing rationale for such research is based on the lack of parallel data for many of the world’s languages. However, we argue that a scenario without any parallel data and abundant monolingual data is unrealistic in practice. We also discuss different training signals that have been used in previous work, which depart from the pure unsupervised setting. We then describe common methodological issues in tuning and evaluation of unsupervised cross-lingual models and present best practices. Finally, we provide a unified outlook for different types of research in this area (i.e., cross-lingual word embeddings, deep multilingual pretraining, and unsupervised machine translation) and argue for comparable evaluation of these models.

pdf bib
Shape of Synth to Come : Why We Should Use Synthetic Data for English Surface RealizationShape of Synth to Come: Why We Should Use Synthetic Data for English Surface Realization
Henry Elder | Robert Burke | Alexander O’Connor | Jennifer Foster

The Surface Realization Shared Tasks of 2018 and 2019 were Natural Language Generation shared tasks with the goal of exploring approaches to surface realization from Universal-Dependency-like trees to surface strings for several languages. In the 2018 shared task there was very little difference in the absolute performance of systems trained with and without additional, synthetically created data, and a new rule prohibiting the use of synthetic data was introduced for the 2019 shared task. Contrary to the findings of the 2018 shared task, we show, in experiments on the English 2018 dataset, that the use of synthetic data can have a substantial positive effect an improvement of almost 8 BLEU points for a previously state-of-the-art system. We analyse the effects of synthetic data, and we argue that its use should be encouraged rather than prohibited so that future research efforts continue to explore systems that can take advantage of such data.

pdf bib
Toward Better Storylines with Sentence-Level Language Models
Daphne Ippolito | David Grangier | Douglas Eck | Chris Callison-Burch

We propose a sentence-level language model which selects the next sentence in a story from a finite set of fluent alternatives. Since it does not need to model fluency, the sentence-level language model can focus on longer range dependencies, which are crucial for multi-sentence coherence. Rather than dealing with individual words, our method treats the story so far as a list of pre-trained sentence embeddings and predicts an embedding for the next sentence, which is more efficient than predicting word embeddings. Notably this allows us to consider a large number of candidates for the next sentence during training. We demonstrate the effectiveness of our approach with state-of-the-art accuracy on the unsupervised Story Cloze task and with promising results on larger-scale next sentence prediction tasks.

pdf bib
A Two-Step Approach for Implicit Event Argument Detection
Zhisong Zhang | Xiang Kong | Zhengzhong Liu | Xuezhe Ma | Eduard Hovy

In this work, we explore the implicit event argument detection task, which studies event arguments beyond sentence boundaries. The addition of cross-sentence argument candidates imposes great challenges for modeling. To reduce the number of candidates, we adopt a two-step approach, decomposing the problem into two sub-problems : argument head-word detection and head-to-span expansion. Evaluated on the recent RAMS dataset (Ebner et al., 2020), our model achieves overall better performance than a strong sequence labeling baseline. We further provide detailed error analysis, presenting where the model mainly makes errors and indicating directions for future improvements. It remains a challenge to detect implicit arguments, calling for more future work of document-level modeling for this task.

pdf bib
Machine Reading of Historical Events
Or Honovich | Lucas Torroba Hennigen | Omri Abend | Shay B. Cohen

Machine reading is an ambitious goal in NLP that subsumes a wide range of text understanding capabilities. Within this broad framework, we address the task of machine reading the time of historical events, compile datasets for the task, and develop a model for tackling it. Given a brief textual description of an event, we show that good performance can be achieved by extracting relevant sentences from Wikipedia, and applying a combination of task-specific and general-purpose feature embeddings for the classification. Furthermore, we establish a link between the historical event ordering task and the event focus time task from the information retrieval literature, showing they also provide a challenging test case for machine reading algorithms.

pdf bib
Improving Disentangled Text Representation Learning with Information-Theoretic Guidance
Pengyu Cheng | Martin Renqiang Min | Dinghan Shen | Christopher Malon | Yizhe Zhang | Yitong Li | Lawrence Carin

Learning disentangled representations of natural language is essential for many NLP tasks, e.g., conditional text generation, style transfer, personalized dialogue systems, etc. Similar problems have been studied extensively for other forms of data, such as images and videos. However, the discrete nature of natural language makes the disentangling of textual representations more challenging (e.g., the manipulation over the data space can not be easily achieved). Inspired by information theory, we propose a novel method that effectively manifests disentangled representations of text, without any supervision on semantics. A new mutual information upper bound is derived and leveraged to measure dependence between style and content. By minimizing this upper bound, the proposed method induces style and content embeddings into two independent low-dimensional spaces. Experiments on both conditional text generation and text-style transfer demonstrate the high quality of our disentangled representation in terms of content and style preservation.

pdf bib
Good-Enough Compositional Data Augmentation
Jacob Andreas

We propose a simple data augmentation protocol aimed at providing a compositional inductive bias in conditional and unconditional sequence models. Under this protocol, synthetic training examples are constructed by taking real training examples and replacing (possibly discontinuous) fragments with other fragments that appear in at least one similar environment. The protocol is model-agnostic and useful for a variety of tasks. Applied to neural sequence-to-sequence models, it reduces error rate by as much as 87 % on diagnostic tasks from the SCAN dataset and 16 % on a semantic parsing task. Applied to n-gram language models, it reduces perplexity by roughly 1 % on small corpora in several languages.

pdf bib
Towards Open Domain Event Trigger Identification using Adversarial Domain Adaptation
Aakanksha Naik | Carolyn Rose

We tackle the task of building supervised event trigger identification models which can generalize better across domains. Our work leverages the adversarial domain adaptation (ADA) framework to introduce domain-invariance. ADA uses adversarial training to construct representations that are predictive for trigger identification, but not predictive of the example’s domain. It requires no labeled data from the target domain, making it completely unsupervised. Experiments with two domains (English literature and news) show that ADA leads to an average F1 score improvement of 3.9 on out-of-domain data. Our best performing model (BERT-A) reaches 44-49 F1 across both domains, using no labeled target data. Preliminary experiments reveal that finetuning on 1 % labeled data, followed by self-training leads to substantial improvement, reaching 51.5 and 67.2 F1 on literature and news respectively.

pdf bib
Learning Web-based Procedures by Reasoning over Explanations and Demonstrations in Context
Shashank Srivastava | Oleksandr Polozov | Nebojsa Jojic | Christopher Meek

We explore learning web-based tasks from a human teacher through natural language explanations and a single demonstration. Our approach investigates a new direction for semantic parsing that models explaining a demonstration in a context, rather than mapping explanations to demonstrations. By leveraging the idea of inverse semantics from program synthesis to reason backwards from observed demonstrations, we ensure that all considered interpretations are consistent with executable actions in any context, thus simplifying the problem of search over logical forms. We present a dataset of explanations paired with demonstrations for web-based tasks. Our methods show better task completion rates than a supervised semantic parsing baseline (40 % relative improvement on average), and are competitive with simple exploration-and-demonstration based methods, while requiring no exploration of the environment. In learning to align explanations with demonstrations, basic properties of natural language syntax emerge as learned behavior. This is an interesting example of pragmatic language acquisition without any linguistic annotation.

pdf bib
In Neural Machine Translation, What Does Transfer Learning Transfer?
Alham Fikri Aji | Nikolay Bogoychev | Kenneth Heafield | Rico Sennrich

Transfer learning improves quality for low-resource machine translation, but it is unclear what exactly it transfers. We perform several ablation studies that limit information transfer, then measure the quality impact across three language pairs to gain a black-box understanding of transfer learning. Word embeddings play an important role in transfer learning, particularly if they are properly aligned. Although transfer learning can be performed without embeddings, results are sub-optimal. In contrast, transferring only the embeddings but nothing else yields catastrophic results. We then investigate diagonal alignments with auto-encoders over real languages and randomly generated sequences, finding even randomly generated sequences as parents yield noticeable but smaller gains. Finally, transfer learning can eliminate the need for a warm-up phase when training transformer models in high resource language pairs.

pdf bib
Learning a Multi-Domain Curriculum for Neural Machine Translation
Wei Wang | Ye Tian | Jiquan Ngiam | Yinfei Yang | Isaac Caswell | Zarana Parekh

Most data selection research in machine translation focuses on improving a single domain. We perform data selection for multiple domains at once. This is achieved by carefully introducing instance-level domain-relevance features and automatically constructing a training curriculum to gradually concentrate on multi-domain relevant and noise-reduced data batches. Both the choice of features and the use of curriculum are crucial for balancing and improving all domains, including out-of-domain. In large-scale experiments, the multi-domain curriculum simultaneously reaches or outperforms the individual performance and brings solid gains over no-curriculum training.

pdf bib
Reducing Gender Bias in Neural Machine Translation as a Domain Adaptation Problem
Danielle Saunders | Bill Byrne

Training data for NLP tasks often exhibits gender bias in that fewer sentences refer to women than to men. In Neural Machine Translation (NMT) gender bias has been shown to reduce translation quality, particularly when the target language has grammatical gender. The recent WinoMT challenge set allows us to measure this effect directly (Stanovsky et al, 2019) Ideally we would reduce system bias by simply debiasing all data prior to training, but achieving this effectively is itself a challenge. Rather than attempt to create a ‘balanced’ dataset, we use transfer learning on a small set of trusted, gender-balanced examples. This approach gives strong and consistent improvements in gender debiasing with much less computational cost than training from scratch. A known pitfall of transfer learning on new domains is ‘catastrophic forgetting’, which we address at adaptation and inference time. During adaptation we show that Elastic Weight Consolidation allows a performance trade-off between general translation quality and bias reduction. At inference time we propose a lattice-rescoring scheme which outperforms all systems evaluated in Stanovsky et al, 2019 on WinoMT with no degradation of general test set BLEU. We demonstrate our approach translating from English into three languages with varied linguistic properties and data availability.

pdf bib
Variational Neural Machine Translation with Normalizing Flows
Hendra Setiawan | Matthias Sperber | Udhyakumar Nallasamy | Matthias Paulik

Variational Neural Machine Translation (VNMT) is an attractive framework for modeling the generation of target translations, conditioned not only on the source sentence but also on some latent random variables. The latent variable modeling may introduce useful statistical dependencies that can improve translation accuracy. Unfortunately, learning informative latent variables is non-trivial, as the latent space can be prohibitively large, and the latent codes are prone to be ignored by many translation models at training time. Previous works impose strong assumptions on the distribution of the latent code and limit the choice of the NMT architecture. In this paper, we propose to apply the VNMT framework to the state-of-the-art Transformer and introduce a more flexible approximate posterior based on normalizing flows. We demonstrate the efficacy of our proposal under both in-domain and out-of-domain conditions, significantly outperforming strong baselines.

pdf bib
The Paradigm Discovery Problem
Alexander Erdmann | Micha Elsner | Shijie Wu | Ryan Cotterell | Nizar Habash

This work treats the paradigm discovery problem (PDP), the task of learning an inflectional morphological system from unannotated sentences. We formalize the PDP and develop evaluation metrics for judging systems. Using currently available resources, we construct datasets for the task. We also devise a heuristic benchmark for the PDP and report empirical results on five diverse languages. Our benchmark system first makes use of word embeddings and string similarity to cluster forms by cell and by paradigm. Then, we bootstrap a neural transducer on top of the clustered data to predict words to realize the empty paradigm slots. An error analysis of our system suggests clustering by cell across different inflection classes is the most pressing challenge for future work.

pdf bib
Automated Evaluation of Writing 50 Years and Counting
Beata Beigman Klebanov | Nitin Madnani

In this theme paper, we focus on Automated Writing Evaluation (AWE), using Ellis Page’s seminal 1966 paper to frame the presentation. We discuss some of the current frontiers in the field and offer some thoughts on the emergent uses of this technology.

pdf bib
Returning the N to NLP : Towards Contextually Personalized Classification ModelsN to NLP: Towards Contextually Personalized Classification Models
Lucie Flek

Most NLP models today treat language as universal, even though socio- and psycholingustic research shows that the communicated message is influenced by the characteristics of the speaker as well as the target audience. This paper surveys the landscape of personalization in natural language processing and related fields, and offers a path forward to mitigate the decades of deviation of the NLP tools from sociolingustic findings, allowing to flexibly process the natural language of each user rather than enforcing a uniform NLP treatment. It outlines a possible direction to incorporate these aspects into neural NLP models by means of socially contextual personalization, and proposes to shift the focus of our evaluation strategies accordingly.

pdf bib
To Test Machine Comprehension, Start by Defining Comprehension
Jesse Dunietz | Greg Burnham | Akash Bharadwaj | Owen Rambow | Jennifer Chu-Carroll | Dave Ferrucci

Many tasks aim to measure machine reading comprehension (MRC), often focusing on question types presumed to be difficult. Rarely, however, do task designers start by considering what systems should in fact comprehend. In this paper we make two key contributions. First, we argue that existing approaches do not adequately define comprehension ; they are too unsystematic about what content is tested. Second, we present a detailed definition of comprehensiona Template of Understandingfor a widely useful class of texts, namely short narratives. We then conduct an experiment that strongly suggests existing systems are not up to the task of narrative understanding as we define it.

pdf bib
One Size Does Not Fit All : Generating and Evaluating Variable Number of Keyphrases
Xingdi Yuan | Tong Wang | Rui Meng | Khushboo Thaker | Peter Brusilovsky | Daqing He | Adam Trischler

Different texts shall by nature correspond to different number of keyphrases. This desideratum is largely missing from existing neural keyphrase generation models. In this study, we address this problem from both modeling and evaluation perspectives. We first propose a recurrent generative model that generates multiple keyphrases as delimiter-separated sequences. Generation diversity is further enhanced with two novel techniques by manipulating decoder hidden states. In contrast to previous approaches, our model is capable of generating diverse keyphrases and controlling number of outputs. We further propose two evaluation metrics tailored towards the variable-number generation. We also introduce a new dataset StackEx that expands beyond the only existing genre (i.e., academic writing) in keyphrase generation tasks. With both previous and new evaluation metrics, our model outperforms strong baselines on all datasets.

pdf bib
Structural Information Preserving for Graph-to-Text Generation
Linfeng Song | Ante Wang | Jinsong Su | Yue Zhang | Kun Xu | Yubin Ge | Dong Yu

The task of graph-to-text generation aims at producing sentences that preserve the meaning of input graphs. As a crucial defect, the current state-of-the-art models may mess up or even drop the core structural information of input graphs when generating outputs. We propose to tackle this problem by leveraging richer training signals that can guide our model for preserving input information. In particular, we introduce two types of autoencoding losses, each individually focusing on different aspects (a.k.a. views) of input graphs. The losses are then back-propagated to better calibrate our model via multi-task training. Experiments on two benchmarks for graph-to-text generation show the effectiveness of our approach over a state-of-the-art baseline.

pdf bib
A Joint Neural Model for Information Extraction with Global Features
Ying Lin | Heng Ji | Fei Huang | Lingfei Wu

Most existing joint neural models for Information Extraction (IE) use local task-specific classifiers to predict labels for individual instances (e.g., trigger, relation) regardless of their interactions. For example, a victim of a die event is likely to be a victim of an attack event in the same sentence. In order to capture such cross-subtask and cross-instance inter-dependencies, we propose a joint neural framework, OneIE, that aims to extract the globally optimal IE result as a graph from an input sentence. OneIE performs end-to-end IE in four stages : (1) Encoding a given sentence as contextualized word representations ; (2) Identifying entity mentions and event triggers as nodes ; (3) Computing label scores for all nodes and their pairwise links using local classifiers ; (4) Searching for the globally optimal graph with a beam decoder. At the decoding stage, we incorporate global features to capture the cross-subtask and cross-instance interactions. Experiments show that adding global features improves the performance of our model and achieves new state of-the-art on all subtasks. In addition, as OneIE does not use any language-specific feature, we prove it can be easily applied to new languages or trained in a multilingual manner.

pdf bib
From English to Code-Switching : Transfer Learning with Strong Morphological CluesEnglish to Code-Switching: Transfer Learning with Strong Morphological Clues
Gustavo Aguilar | Thamar Solorio

Linguistic Code-switching (CS) is still an understudied phenomenon in natural language processing. The NLP community has mostly focused on monolingual and multi-lingual scenarios, but little attention has been given to CS in particular. This is partly because of the lack of resources and annotated data, despite its increasing occurrence in social media platforms. In this paper, we aim at adapting monolingual models to code-switched text in various tasks. Specifically, we transfer English knowledge from a pre-trained ELMo model to different code-switched language pairs (i.e., Nepali-English, Spanish-English, and Hindi-English) using the task of language identification. Our method, CS-ELMo, is an extension of ELMo with a simple yet effective position-aware attention mechanism inside its character convolutions. We show the effectiveness of this transfer learning step by outperforming multilingual BERT and homologous CS-unaware ELMo models and establishing a new state of the art in CS tasks, such as NER and POS tagging. Our technique can be expanded to more English-paired code-switched languages, providing more resources to the CS community.

pdf bib
Learning Interpretable Relationships between Entities, Relations and Concepts via Bayesian Structure Learning on Open Domain FactsBayesian Structure Learning on Open Domain Facts
Jingyuan Zhang | Mingming Sun | Yue Feng | Ping Li

Concept graphs are created as universal taxonomies for text understanding in the open-domain knowledge. The nodes in concept graphs include both entities and concepts. The edges are from entities to concepts, showing that an entity is an instance of a concept. In this paper, we propose the task of learning interpretable relationships from open-domain facts to enrich and refine concept graphs. The Bayesian network structures are learned from open-domain facts as the interpretable relationships between relations of facts and concepts of entities. We conduct extensive experiments on public English and Chinese datasets. Compared to the state-of-the-art methods, the learned network structures help improving the identification of concepts for entities based on the relations of entities on both datasets.

pdf bib
Sources of Transfer in Multilingual Named Entity Recognition
David Mueller | Nicholas Andrews | Mark Dredze

Named-entities are inherently multilingual, and annotations in any given language may be limited. This motivates us to consider polyglot named-entity recognition (NER), where one model is trained using annotated data drawn from more than one language. However, a straightforward implementation of this simple idea does not always work in practice : naive training of NER models using annotated data drawn from multiple languages consistently underperforms models trained on monolingual data alone, despite having access to more training data. The starting point of this paper is a simple solution to this problem, in which polyglot models are fine-tuned on monolingual data to consistently and significantly outperform their monolingual counterparts. To explain this phenomena, we explore the sources of multilingual transfer in polyglot NER models and examine the weight structure of polyglot models compared to their monolingual counterparts. We find that polyglot models efficiently share many parameters across languages and that fine-tuning may utilize a large number of those parameters.polyglot named-entity recognition (NER), where one model is trained using annotated data drawn from more than one language. However, a straightforward implementation of this simple idea does not always work in practice: naive training of NER models using annotated data drawn from multiple languages consistently underperforms models trained on monolingual data alone, despite having access to more training data. The starting point of this paper is a simple solution to this problem, in which polyglot models are fine-tuned on monolingual data to consistently and significantly outperform their monolingual counterparts. To explain this phenomena, we explore the sources of multilingual transfer in polyglot NER models and examine the weight structure of polyglot models compared to their monolingual counterparts. We find that polyglot models efficiently share many parameters across languages and that fine-tuning may utilize a large number of those parameters.

pdf bib
ZeroShotCeres : Zero-Shot Relation Extraction from Semi-Structured WebpagesZeroShotCeres: Zero-Shot Relation Extraction from Semi-Structured Webpages
Colin Lockard | Prashant Shiralkar | Xin Luna Dong | Hannaneh Hajishirzi

In many documents, such as semi-structured webpages, textual semantics are augmented with additional information conveyed using visual elements including layout, font size, and color. Prior work on information extraction from semi-structured websites has required learning an extraction model specific to a given template via either manually labeled or distantly supervised data from that template. In this work, we propose a solution for zero-shot open-domain relation extraction from webpages with a previously unseen template, including from websites with little overlap with existing sources of knowledge for distant supervision and websites in entirely new subject verticals. Our model uses a graph neural network-based approach to build a rich representation of text fields on a webpage and the relationships between them, enabling generalization to new templates. Experiments show this approach provides a 31 % F1 gain over a baseline for zero-shot extraction in a new subject vertical.

pdf bib
A Prioritization Model for Suicidality Risk Assessment
Han-Chin Shing | Philip Resnik | Douglas Oard

We reframe suicide risk assessment from social media as a ranking problem whose goal is maximizing detection of severely at-risk individuals given the time available. Building on measures developed for resource-bounded document retrieval, we introduce a well founded evaluation paradigm, and demonstrate using an expert-annotated test collection that meaningful improvements over plausible cascade model baselines can be achieved using an approach that jointly ranks individuals and their social media posts.

pdf bib
History for Visual Dialog : Do we really need it?
Shubham Agarwal | Trung Bui | Joon-Young Lee | Ioannis Konstas | Verena Rieser

Visual Dialogue involves understanding the dialogue history (what has been discussed previously) and the current question (what is asked), in addition to grounding information in the image, to accurately generate the correct response. In this paper, we show that co-attention models which explicitly encode dialoh history outperform models that do n’t, achieving state-of-the-art performance (72 % NDCG on val set). However, we also expose shortcomings of the crowdsourcing dataset collection procedure, by showing that dialogue history is indeed only required for a small amount of the data, and that the current evaluation metric encourages generic replies. To that end, we propose a challenging subset (VisdialConv) of the VisdialVal set and the benchmark NDCG of 63 %.

pdf bib
Frugal Paradigm Completion
Alexander Erdmann | Tom Kenter | Markus Becker | Christian Schallhart

Lexica distinguishing all morphologically related forms of each lexeme are crucial to many language technologies, yet building them is expensive. We propose a frugal paradigm completion approach that predicts all related forms in a morphological paradigm from as few manually provided forms as possible. It induces typological information during training which it uses to determine the best sources at test time. We evaluate our language-agnostic approach on 7 diverse languages. Compared to popular alternative approaches, ours reduces manual labor by 16-63 % and is the most robust to typological variation.

pdf bib
Improving Chinese Word Segmentation with Wordhood Memory NetworksChinese Word Segmentation with Wordhood Memory Networks
Yuanhe Tian | Yan Song | Fei Xia | Tong Zhang | Yonggang Wang

Contextual features always play an important role in Chinese word segmentation (CWS). Wordhood information, being one of the contextual features, is proved to be useful in many conventional character-based segmenters. However, this feature receives less attention in recent neural models and it is also challenging to design a framework that can properly integrate wordhood information from different wordhood measures to existing neural frameworks. In this paper, we therefore propose a neural framework, WMSeg, which uses memory networks to incorporate wordhood information with several popular encoder-decoder combinations for CWS. Experimental results on five benchmark datasets indicate the memory mechanism successfully models wordhood information for neural segmenters and helps WMSeg achieve state-of-the-art performance on all those datasets. Further experiments and analyses also demonstrate the robustness of our proposed framework with respect to different wordhood measures and the efficiency of wordhood information in cross-domain experiments.

pdf bib
Phonetic and Visual Priors for Decipherment of Informal RomanizationRomanization
Maria Ryskina | Matthew R. Gormley | Taylor Berg-Kirkpatrick

Informal romanization is an idiosyncratic process used by humans in informal digital communication to encode non-Latin script languages into Latin character sets found on common keyboards. Character substitution choices differ between users but have been shown to be governed by the same main principles observed across a variety of languagesnamely, character pairs are often associated through phonetic or visual similarity. We propose a noisy-channel WFST cascade model for deciphering the original non-Latin script from observed romanized text in an unsupervised fashion. We train our model directly on romanized data from two languages : Egyptian Arabic and Russian. We demonstrate that adding inductive bias through phonetic and visual priors on character mappings substantially improves the model’s performance on both languages, yielding results much closer to the supervised skyline. Finally, we introduce a new dataset of romanized Russian, collected from a Russian social network website and partially annotated for our experiments.

pdf bib
Active Learning for Coreference Resolution using Discrete Annotation
Belinda Z. Li | Gabriel Stanovsky | Luke Zettlemoyer

We improve upon pairwise annotation for active learning in coreference resolution, by asking annotators to identify mention antecedents if a presented mention pair is deemed not coreferent. This simple modification, when combined with a novel mention clustering algorithm for selecting which examples to label, is much more efficient in terms of the performance obtained per annotation budget. In experiments with existing benchmark coreference datasets, we show that the signal from this additional question leads to significant performance gains per human-annotation hour. Future work can use our annotation protocol to effectively develop coreference models for new domains. Our code is publicly available.

pdf bib
Beyond Possession Existence : Duration and Co-Possession
Dhivya Chinnappa | Srikala Murugan | Eduardo Blanco

This paper introduces two tasks : determining (a) the duration of possession relations and (b) co-possessions, i.e., whether multiple possessors possess a possessee at the same time. We present new annotations on top of corpora annotating possession existence and experimental results. Regarding possession duration, we derive the time spans we work with empirically from annotations indicating lower and upper bounds. Regarding co-possessions, we use a binary label. Cohen’s kappa coefficients indicate substantial agreement, and experimental results show that text is more useful than the image for solving these tasks.

pdf bib
Do n’t Stop Pretraining : Adapt Language Models to Domains and Tasks
Suchin Gururangan | Ana Marasović | Swabha Swayamdipta | Kyle Lo | Iz Beltagy | Doug Downey | Noah A. Smith

Language models pretrained on text from a wide variety of sources form the foundation of today’s NLP. In light of the success of these broad-coverage models, we investigate whether it is still helpful to tailor a pretrained model to the domain of a target task. We present a study across four domains (biomedical and computer science publications, news, and reviews) and eight classification tasks, showing that a second phase of pretraining in-domain (domain-adaptive pretraining) leads to performance gains, under both high- and low-resource settings. Moreover, adapting to the task’s unlabeled data (task-adaptive pretraining) improves performance even after domain-adaptive pretraining. Finally, we show that adapting to a task corpus augmented using simple data selection strategies is an effective alternative, especially when resources for domain-adaptive pretraining might be unavailable. Overall, we consistently find that multi-phase adaptive pretraining offers large gains in task performance.

pdf bib
Exploring Unexplored Generalization Challenges for Cross-Database Semantic Parsing
Alane Suhr | Ming-Wei Chang | Peter Shaw | Kenton Lee

We study the task of cross-database semantic parsing (XSP), where a system that maps natural language utterances to executable SQL queries is evaluated on databases unseen during training. Recently, several datasets, including Spider, were proposed to support development of XSP systems. We propose a challenging evaluation setup for cross-database semantic parsing, focusing on variation across database schemas and in-domain language use. We re-purpose eight semantic parsing datasets that have been well-studied in the setting where in-domain training data is available, and instead use them as additional evaluation data for XSP systems instead. We build a system that performs well on Spider, and find that it struggles to generalize to our re-purposed set. Our setup uncovers several generalization challenges for cross-database semantic parsing, demonstrating the need to use and develop diverse training and evaluation datasets.

pdf bib
Predicting the Focus of Negation : Model and Error Analysis
Md Mosharaf Hossain | Kathleen Hamilton | Alexis Palmer | Eduardo Blanco

The focus of a negation is the set of tokens intended to be negated, and a key component for revealing affirmative alternatives to negated utterances. In this paper, we experiment with neural networks to predict the focus of negation. Our main novelty is leveraging a scope detector to introduce the scope of negation as an additional input to the network. Experimental results show that doing so obtains the best results to date. Additionally, we perform a detailed error analysis providing insights into the main error categories, and analyze errors depending on whether the model takes into account scope and context information.

pdf bib
Structured Tuning for Semantic Role Labeling
Tao Li | Parth Anand Jawale | Martha Palmer | Vivek Srikumar

Recent neural network-driven semantic role labeling (SRL) systems have shown impressive improvements in F1 scores. These improvements are due to expressive input representations, which, at least at the surface, are orthogonal to knowledge-rich constrained decoding mechanisms that helped linear SRL models. Introducing the benefits of structure to inform neural models presents a methodological challenge. In this paper, we present a structured tuning framework to improve models using softened constraints only at training time. Our framework leverages the expressiveness of neural networks and provides supervision with structured loss components. We start with a strong baseline (RoBERTa) to validate the impact of our approach, and show that our framework outperforms the baseline by learning to comply with declarative constraints. Additionally, our experiments with smaller training sizes show that we can achieve consistent improvements under low-resource scenarios.

pdf bib
Universal Decompositional Semantic Parsing
Elias Stengel-Eskin | Aaron Steven White | Sheng Zhang | Benjamin Van Durme

We introduce a transductive model for parsing into Universal Decompositional Semantics (UDS) representations, which jointly learns to map natural language utterances into UDS graph structures and annotate the graph with decompositional semantic attribute scores. We also introduce a strong pipeline model for parsing into the UDS graph structure, and show that our transductive parser performs comparably while additionally performing attribute prediction. By analyzing the attribute prediction errors, we find the model captures natural relationships between attribute groups.

pdf bib
Unsupervised Cross-lingual Representation Learning at Scale
Alexis Conneau | Kartikay Khandelwal | Naman Goyal | Vishrav Chaudhary | Guillaume Wenzek | Francisco Guzmán | Edouard Grave | Myle Ott | Luke Zettlemoyer | Veselin Stoyanov

This paper shows that pretraining multilingual language models at scale leads to significant performance gains for a wide range of cross-lingual transfer tasks. We train a Transformer-based masked language model on one hundred languages, using more than two terabytes of filtered CommonCrawl data. Our model, dubbed XLM-R, significantly outperforms multilingual BERT (mBERT) on a variety of cross-lingual benchmarks, including +14.6 % average accuracy on XNLI, +13 % average F1 score on MLQA, and +2.4 % F1 score on NER. XLM-R performs particularly well on low-resource languages, improving 15.7 % in XNLI accuracy for Swahili and 11.4 % for Urdu over previous XLM models. We also present a detailed empirical analysis of the key factors that are required to achieve these gains, including the trade-offs between (1) positive transfer and capacity dilution and (2) the performance of high and low resource languages at scale. Finally, we show, for the first time, the possibility of multilingual modeling without sacrificing per-language performance ; XLM-R is very competitive with strong monolingual models on the GLUE and XNLI benchmarks. We will make our code and models publicly available.

pdf bib
Multi-Domain Named Entity Recognition with Genre-Aware and Agnostic Inference
Jing Wang | Mayank Kulkarni | Daniel Preotiuc-Pietro

Named entity recognition is a key component of many text processing pipelines and it is thus essential for this component to be robust to different types of input. However, domain transfer of NER models with data from multiple genres has not been widely studied. To this end, we conduct NER experiments in three predictive setups on data from : a) multiple domains ; b) multiple domains where the genre label is unknown at inference time ; c) domains not encountered in training. We introduce a new architecture tailored to this task by using shared and private domain parameters and multi-task learning. This consistently outperforms all other baseline and competitive methods on all three experimental setups, with differences ranging between +1.95 to +3.11 average F1 across multiple genres when compared to standard approaches. These results illustrate the challenges that need to be taken into account when building real-world NLP applications that are robust to various types of text and the methods that can help, at least partially, alleviate these issues.

pdf bib
Addressing Posterior Collapse with Mutual Information for Improved Variational Neural Machine Translation
Arya D. McCarthy | Xian Li | Jiatao Gu | Ning Dong

This paper proposes a simple and effective approach to address the problem of posterior collapse in conditional variational autoencoders (CVAEs). It thus improves performance of machine translation models that use noisy or monolingual data, as well as in conventional settings. Extending Transformer and conditional VAEs, our proposed latent variable model measurably prevents posterior collapse by (1) using a modified evidence lower bound (ELBO) objective which promotes mutual information between the latent variable and the target, and (2) guiding the latent variable with an auxiliary bag-of-words prediction task. As a result, the proposed model yields improved translation quality compared to existing variational NMT models on WMT RoEn and DeEn. With latent variables being effectively utilized, our model demonstrates improved robustness over non-latent Transformer in handling uncertainty : exploiting noisy source-side monolingual data (up to +3.2 BLEU), and training with weakly aligned web-mined parallel data (up to +4.7 BLEU).

pdf bib
Evaluating Robustness to Input Perturbations for Neural Machine Translation
Xing Niu | Prashant Mathur | Georgiana Dinu | Yaser Al-Onaizan

Neural Machine Translation (NMT) models are sensitive to small perturbations in the input. Robustness to such perturbations is typically measured using translation quality metrics such as BLEU on the noisy input. This paper proposes additional metrics which measure the relative degradation and changes in translation when small perturbations are added to the input. We focus on a class of models employing subword regularization to address robustness and perform extensive evaluations of these models using the robustness measures proposed. Results show that our proposed metrics reveal a clear trend of improved robustness to perturbations when subword regularization methods are used.

pdf bib
Clinical Concept Linking with Contextualized Neural Representations
Elliot Schumacher | Andriy Mulyar | Mark Dredze

In traditional approaches to entity linking, linking decisions are based on three sources of information the similarity of the mention string to an entity’s name, the similarity of the context of the document to the entity, and broader information about the knowledge base (KB). In some domains, there is little contextual information present in the KB and thus we rely more heavily on mention string similarity. We consider one example of this, concept linking, which seeks to link mentions of medical concepts to a medical concept ontology. We propose an approach to concept linking that leverages recent work in contextualized neural models, such as ELMo (Peters et al. 2018), which create a token representation that integrates the surrounding context of the mention and concept name. We find a neural ranking approach paired with contextualized embeddings provides gains over a competitive baseline (Leaman et al. Additionally, we find that a pre-training step using synonyms from the ontology offers a useful initialization for the ranker.

pdf bib
NILE : Natural Language Inference with Faithful Natural Language ExplanationsNILE : Natural Language Inference with Faithful Natural Language Explanations
Sawan Kumar | Partha Talukdar

The recent growth in the popularity and success of deep learning models on NLP classification tasks has accompanied the need for generating some form of natural language explanation of the predicted labels. Such generated natural language (NL) explanations are expected to be faithful, i.e., they should correlate well with the model’s internal decision making. In this work, we focus on the task of natural language inference (NLI) and address the following question : can we build NLI systems which produce labels with high accuracy, while also generating faithful explanations of its decisions? We propose Natural-language Inference over Label-specific Explanations (NILE), a novel NLI method which utilizes auto-generated label-specific NL explanations to produce labels along with its faithful explanation. We demonstrate NILE’s effectiveness over previously reported methods through automated and human evaluation of the produced labels and explanations. Our evaluation of NILE also supports the claim that accurate systems capable of providing testable explanations of their decisions can be designed. We discuss the faithfulness of NILE’s explanations in terms of sensitivity of the decisions to the corresponding explanations. We argue that explicit evaluation of faithfulness, in addition to label and explanation accuracy, is an important step in evaluating model’s explanations. Further, we demonstrate that task-specific probes are necessary to establish such sensitivity.

pdf bib
Uncertain Natural Language Inference
Tongfei Chen | Zhengping Jiang | Adam Poliak | Keisuke Sakaguchi | Benjamin Van Durme

We introduce Uncertain Natural Language Inference (UNLI), a refinement of Natural Language Inference (NLI) that shifts away from categorical labels, targeting instead the direct prediction of subjective probability assessments. We demonstrate the feasibility of collecting annotations for UNLI by relabeling a portion of the SNLI dataset under a probabilistic scale, where items even with the same categorical label differ in how likely people judge them to be true given a premise. We describe a direct scalar regression modeling approach, and find that existing categorically-labeled NLI data can be used in pre-training. Our best models correlate well with humans, demonstrating models are capable of more subtle inferences than the categorical bin assignment employed in current NLI tasks.

pdf bib
Revisiting Higher-Order Dependency Parsers
Erick Fonseca | André F. T. Martins

Neural encoders have allowed dependency parsers to shift from higher-order structured models to simpler first-order ones, making decoding faster and still achieving better accuracy than non-neural parsers. This has led to a belief that neural encoders can implicitly encode structural constraints, such as siblings and grandparents in a tree. We tested this hypothesis and found that neural parsers may benefit from higher-order features, even when employing a powerful pre-trained encoder, such as BERT. While the gains of higher-order features are small in the presence of a powerful encoder, they are consistent for long-range dependencies and long sentences. In particular, higher-order models are more accurate on full sentence parses and on the exact match of modifier lists, indicating that they deal better with larger, more complex structures.

pdf bib
Treebank Embedding Vectors for Out-of-Domain Dependency Parsing
Joachim Wagner | James Barry | Jennifer Foster

A recent advance in monolingual dependency parsing is the idea of a treebank embedding vector, which allows all treebanks for a particular language to be used as training data while at the same time allowing the model to prefer training data from one treebank over others and to select the preferred treebank at test time. We build on this idea by 1) introducing a method to predict a treebank vector for sentences that do not come from a treebank used in training, and 2) exploring what happens when we move away from predefined treebank embedding vectors during test time and instead devise tailored interpolations. We show that 1) there are interpolated vectors that are superior to the predefined ones, and 2) treebank vectors can be predicted with sufficient accuracy, for nine out of ten test languages, to match the performance of an oracle approach that knows the most suitable predefined treebank embedding for the test set.


pdf (full)
bib (full)
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations

pdf bib
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations
Asli Celikyilmaz | Tsung-Hsien Wen

pdf bib
Xiaomingbot : A Multilingual Robot News ReporterXiaomingbot: A Multilingual Robot News Reporter
Runxin Xu | Jun Cao | Mingxuan Wang | Jiaze Chen | Hao Zhou | Ying Zeng | Yuping Wang | Li Chen | Xiang Yin | Xijin Zhang | Songcheng Jiang | Yuxuan Wang | Lei Li

This paper proposes the building of Xiaomingbot, an intelligent, multilingual and multimodal software robot equipped with four inte- gral capabilities : news generation, news translation, news reading and avatar animation. Its system summarizes Chinese news that it automatically generates from data tables. Next, it translates the summary or the full article into multiple languages, and reads the multi- lingual rendition through synthesized speech. Notably, Xiaomingbot utilizes a voice cloning technology to synthesize the speech trained from a real person’s voice data in one input language. The proposed system enjoys several merits : it has an animated avatar, and is able to generate and read multilingual news. Since it was put into practice, Xiaomingbot has written over 600,000 articles, and gained over 150,000 followers on social media platforms.

pdf bib
TextBrewer : An Open-Source Knowledge Distillation Toolkit for Natural Language ProcessingTextBrewer: An Open-Source Knowledge Distillation Toolkit for Natural Language Processing
Ziqing Yang | Yiming Cui | Zhipeng Chen | Wanxiang Che | Ting Liu | Shijin Wang | Guoping Hu

In this paper, we introduce TextBrewer, an open-source knowledge distillation toolkit designed for natural language processing. It works with different neural network models and supports various kinds of supervised learning tasks, such as text classification, reading comprehension, sequence labeling. TextBrewer provides a simple and uniform workflow that enables quick setting up of distillation experiments with highly flexible configurations. It offers a set of predefined distillation methods and can be extended with custom code. As a case study, we use TextBrewer to distill BERT on several typical NLP tasks. With simple configurations, we achieve results that are comparable with or even higher than the public distilled BERT models with similar numbers of parameters.

pdf bib
Personalized PageRank with Syntagmatic Information for Multilingual Word Sense DisambiguationPageRank with Syntagmatic Information for Multilingual Word Sense Disambiguation
Federico Scozzafava | Marco Maru | Fabrizio Brignone | Giovanni Torrisi | Roberto Navigli

Exploiting syntagmatic information is an encouraging research focus to be pursued in an effort to close the gap between knowledge-based and supervised Word Sense Disambiguation (WSD) performance. We follow this direction in our next-generation knowledge-based WSD system, SyntagRank, which we make available via a Web interface and a RESTful API. SyntagRank leverages the disambiguated pairs of co-occurring words included in SyntagNet, a lexical-semantic combination resource, to perform state-of-the-art knowledge-based WSD in a multilingual setting. Our service provides both a user-friendly interface, available at, and a RESTful endpoint to query the system programmatically (accessible at

pdf bib
Stanza : A Python Natural Language Processing Toolkit for Many Human LanguagesStanza: A Python Natural Language Processing Toolkit for Many Human Languages
Peng Qi | Yuhao Zhang | Yuhui Zhang | Jason Bolton | Christopher D. Manning

We introduce Stanza, an open-source Python natural language processing toolkit supporting 66 human languages. Compared to existing widely used toolkits, Stanza features a language-agnostic fully neural pipeline for text analysis, including tokenization, multi-word token expansion, lemmatization, part-of-speech and morphological feature tagging, dependency parsing, and named entity recognition. We have trained Stanza on a total of 112 datasets, including the Universal Dependencies treebanks and other multilingual corpora, and show that the same neural architecture generalizes well and achieves competitive performance on all languages tested. Additionally, Stanza includes a native Python interface to the widely used Java Stanford CoreNLP software, which further extends its functionality to cover other tasks such as coreference resolution and relation extraction. Source code, documentation, and pretrained models for 66 languages are available at

pdf bib
jiant : A Software Toolkit for Research on General-Purpose Text Understanding Models
Yada Pruksachatkun | Phil Yeres | Haokun Liu | Jason Phang | Phu Mon Htut | Alex Wang | Ian Tenney | Samuel R. Bowman

We introduce jiant, an open source toolkit for conducting multitask and transfer learning experiments on English NLU tasks. jiant enables modular and configuration driven experimentation with state-of-the-art models and a broad set of tasks for probing, transfer learning, and multitask training experiments. jiant implements over 50 NLU tasks, including all GLUE and SuperGLUE benchmark tasks. We demonstrate that jiant reproduces published performance on a variety of tasks and models, e.g., RoBERTa and BERT.

pdf bib
The Microsoft Toolkit of Multi-Task Deep Neural Networks for Natural Language UnderstandingMicrosoft Toolkit of Multi-Task Deep Neural Networks for Natural Language Understanding
Xiaodong Liu | Yu Wang | Jianshu Ji | Hao Cheng | Xueyun Zhu | Emmanuel Awa | Pengcheng He | Weizhu Chen | Hoifung Poon | Guihong Cao | Jianfeng Gao

We present MT-DNN, an open-source natural language understanding (NLU) toolkit that makes it easy for researchers and developers to train customized deep learning models. Built upon PyTorch and Transformers, MT-DNN is designed to facilitate rapid customization for a broad spectrum of NLU tasks, using a variety of objectives (classification, regression, structured prediction) and text encoders (e.g., RNNs, BERT, RoBERTa, UniLM). A unique feature of MT-DNN is its built-in support for robust and transferable learning using the adversarial multi-task learning paradigm. To enable efficient production deployment, MT-DNN supports multi-task knowledge distillation, which can substantially compress a deep neural model without significant performance drop. We demonstrate the effectiveness of MT-DNN on a wide range of NLU applications across general and biomedical domains. The software and pre-trained models will be publicly available at

pdf bib
ConvLab-2 : An Open-Source Toolkit for Building, Evaluating, and Diagnosing Dialogue SystemsConvLab-2: An Open-Source Toolkit for Building, Evaluating, and Diagnosing Dialogue Systems
Qi Zhu | Zheng Zhang | Yan Fang | Xiang Li | Ryuichi Takanobu | Jinchao Li | Baolin Peng | Jianfeng Gao | Xiaoyan Zhu | Minlie Huang

We present ConvLab-2, an open-source toolkit that enables researchers to build task-oriented dialogue systems with state-of-the-art models, perform an end-to-end evaluation, and diagnose the weakness of systems. As the successor of ConvLab, ConvLab-2 inherits ConvLab’s framework but integrates more powerful dialogue models and supports more datasets. Besides, we have developed an analysis tool and an interactive tool to assist researchers in diagnosing dialogue systems. The analysis tool presents rich statistics and summarizes common mistakes from simulated dialogues, which facilitates error analysis and system improvement. The interactive tool provides an user interface that allows developers to diagnose an assembled dialogue system by interacting with the system and modifying the output of each system component.

pdf bib
Label Noise in Context
Michael Desmond | Catherine Finegan-Dollak | Jeff Boston | Matt Arnold

Label noiseincorrectly or ambiguously labeled training examplescan negatively impact model performance. Although noise detection techniques have been around for decades, practitioners rarely apply them, as manual noise remediation is a tedious process. Examples incorrectly flagged as noise waste reviewers’ time, and correcting label noise without guidance can be difficult. We propose LNIC, a noise-detection method that uses an example’s neighborhood within the training set to (a) reduce false positives and (b) provide an explanation as to why the ex- ample was flagged as noise. We demonstrate on several short-text classification datasets that LNIC outperforms the state of the art on measures of precision and F0.5-score. We also show how LNIC’s training set context helps a reviewer to understand and correct label noise in a dataset. The LNIC tool lowers the barriers to label noise remediation, increasing its utility for NLP practitioners.

pdf bib
Photon : A Robust Cross-Domain Text-to-SQL SystemPhoton: A Robust Cross-Domain Text-to-SQL System
Jichuan Zeng | Xi Victoria Lin | Steven C.H. Hoi | Richard Socher | Caiming Xiong | Michael Lyu | Irwin King

Natural language interfaces to databases(NLIDB) democratize end user access to relational data. Due to fundamental differences between natural language communication and programming, it is common for end users to issue questions that are ambiguous to the system or fall outside the semantic scope of its underlying query language. We present PHOTON, a robust, modular, cross-domain NLIDB that can flag natural language input to which a SQL mapping can not be immediately determined. PHOTON consists of a strong neural semantic parser (63.2 % structure accuracy on the Spider dev benchmark), a human-in-the-loop question corrector, a SQL executor and a response generator. The question corrector isa discriminative neural sequence editor which detects confusion span(s) in the input question and suggests rephrasing until a translatable input is given by the user or a maximum number of iterations are conducted. Experiments on simulated data show that the proposed method effectively improves the robustness of text-to-SQL system against untranslatable user input. The live demo of our system is available at

pdf bib
NLP Scholar : An Interactive Visual Explorer for Natural Language Processing LiteratureNLP Scholar: An Interactive Visual Explorer for Natural Language Processing Literature
Saif M. Mohammad

As part of the NLP Scholar project, we created a single unified dataset of NLP papers and their meta-information (including citation numbers), by extracting and aligning information from the ACL Anthology and Google Scholar. In this paper, we describe several interconnected interactive visualizations (dashboards) that present various aspects of the data. Clicking on an item within a visualization or entering query terms in the search boxes filters the data in all visualizations in the dashboard. This allows users to search for papers in the area of their interest, published within specific time periods, published by specified authors, etc. The interactive visualizations presented here, and the associated dataset of papers mapped to citations, have additional uses as well including understanding how the field is growing (both overall and across sub-areas), as well as quantifying the impact of different types of papers on subsequent publications.

pdf bib
Prta : A System to Support the Analysis of Propaganda Techniques in the NewsPrta: A System to Support the Analysis of Propaganda Techniques in the News
Giovanni Da San Martino | Shaden Shaar | Yifan Zhang | Seunghak Yu | Alberto Barrón-Cedeño | Preslav Nakov

Recent events, such as the 2016 US Presidential Campaign, Brexit and the COVID-19 infodemic, have brought into the spotlight the dangers of online disinformation. There has been a lot of research focusing on fact-checking and disinformation detection. However, little attention has been paid to the specific rhetorical and psychological techniques used to convey propaganda messages. Revealing the use of such techniques can help promote media literacy and critical thinking, and eventually contribute to limiting the impact of fake news and disinformation campaigns. Prta (Propaganda Persuasion Techniques Analyzer) allows users to explore the articles crawled on a regular basis by highlighting the spans in which propaganda techniques occur and to compare them on the basis of their use of propaganda techniques. The system further reports statistics about the use of such techniques, overall and over time, or according to filtering criteria specified by the user based on time interval, keywords, and/or political orientation of the media. Moreover, it allows users to analyze any text or URL through a dedicated interface or via an API. The system is available online :

pdf bib
MMPE : A Multi-Modal Interface using Handwriting, Touch Reordering, and Speech Commands for Post-Editing Machine TranslationMMPE: A Multi-Modal Interface using Handwriting, Touch Reordering, and Speech Commands for Post-Editing Machine Translation
Nico Herbig | Santanu Pal | Tim Düwel | Kalliopi Meladaki | Mahsa Monshizadeh | Vladislav Hnatovskiy | Antonio Krüger | Josef van Genabith

The shift from traditional translation to post-editing (PE) of machine-translated (MT) text can save time and reduce errors, but it also affects the design of translation interfaces, as the task changes from mainly generating text to correcting errors within otherwise helpful translation proposals. Since this paradigm shift offers potential for modalities other than mouse and keyboard, we present MMPE, the first prototype to combine traditional input modes with pen, touch, and speech modalities for PE of MT. Users can directly cross out or hand-write new text, drag and drop words for reordering, or use spoken commands to update the text in place. All text manipulations are logged in an easily interpretable format to simplify subsequent translation process research. The results of an evaluation with professional translators suggest that pen and touch interaction are suitable for deletion and reordering tasks, while speech and multi-modal combinations of select & speech are considered suitable for replacements and insertions. Overall, experiment participants were enthusiastic about the new modalities and saw them as useful extensions to mouse & keyboard, but not as a complete substitute.

pdf bib
Conversation Learner-A Machine Teaching Tool for Building Dialog Managers for Task-Oriented Dialog SystemsConversation Learner - A Machine Teaching Tool for Building Dialog Managers for Task-Oriented Dialog Systems
Swadheen Shukla | Lars Liden | Shahin Shayandeh | Eslam Kamal | Jinchao Li | Matt Mazzola | Thomas Park | Baolin Peng | Jianfeng Gao

Traditionally, industry solutions for building a task-oriented dialog system have relied on helping dialog authors define rule-based dialog managers, represented as dialog flows. While dialog flows are intuitively interpretable and good for simple scenarios, they fall short of performance in terms of the flexibility needed to handle complex dialogs. On the other hand, purely machine-learned models can handle complex dialogs, but they are considered to be black boxes and require large amounts of training data. In this demonstration, we showcase Conversation Learner, a machine teaching tool for building dialog managers. It combines the best of both approaches by enabling dialog authors to create a dialog flow using familiar tools, converting the dialog flow into a parametric model (e.g., neural networks), and allowing dialog authors to improve the dialog manager (i.e., the parametric model) over time by leveraging user-system dialog logs as training data through a machine teaching interface.

pdf bib
SUPP.AI : finding evidence for supplement-drug interactionsSUPP.AI: finding evidence for supplement-drug interactions
Lucy Wang | Oyvind Tafjord | Arman Cohan | Sarthak Jain | Sam Skjonsberg | Carissa Schoenick | Nick Botner | Waleed Ammar

Dietary supplements are used by a large portion of the population, but information on their pharmacologic interactions is incomplete. To address this challenge, we present SUPP.AI, an application for browsing evidence of supplement-drug interactions (SDIs) extracted from the biomedical literature. We train a model to automatically extract supplement information and identify such interactions from the scientific literature. To address the lack of labeled data for SDI identification, we use labels of the closely related task of identifying drug-drug interactions (DDIs) for supervision. We fine-tune the contextualized word representations of the RoBERTa language model using labeled DDI data, and apply the fine-tuned model to identify supplement interactions. We extract 195k evidence sentences from 22 M articles (P=0.82, R=0.58, F1=0.68) for 60k interactions. We create the SUPP.AI application for users to search evidence sentences extracted by our model. SUPP.AI is an attempt to close the information gap on dietary supplements by making up-to-date evidence on SDIs more discoverable for researchers, clinicians, and consumers. An informational video on how to use SUPP.AI is available at :


pdf (full)
bib (full)
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

pdf bib
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop
Shruti Rijhwani | Jiangming Liu | Yizhong Wang | Rotem Dror

pdf bib
Adaptive Transformers for Learning Multimodal Representations
Prajjwal Bhargava

The usage of transformers has grown from learning about language semantics to forming meaningful visiolinguistic representations. These architectures are often over-parametrized, requiring large amounts of computation. In this work, we extend adaptive approaches to learn more about model interpretability and computational efficiency. Specifically, we study attention spans, sparse, and structured dropout methods to help understand how their attention mechanism extends for vision and language tasks. We further show that these approaches can help us learn more about how the network perceives the complexity of input sequences, sparsity preferences for different modalities, and other related phenomena.

pdf bib
Unsupervised Paraphasia Classification in Aphasic Speech
Sharan Pai | Nikhil Sachdeva | Prince Sachdeva | Rajiv Ratn Shah

Aphasia is a speech and language disorder which results from brain damage, often characterized by word retrieval deficit (anomia) resulting in naming errors (paraphasia). Automatic paraphasia detection has many benefits for both treatment and diagnosis of Aphasia and its type. But supervised learning methods ca nt be properly utilized as there is a lack of aphasic speech data. In this paper, we describe our novel unsupervised method which can be implemented without the need for labeled paraphasia data. Our evaluations show that our method outperforms previous work based on supervised learning and transfer learning approaches for English. We demonstrate the utility of our method as an essential first step in developing augmentative and alternative communication (AAC) devices for patients suffering from aphasia in any language.

pdf bib
Grammatical Error Correction Using Pseudo Learner Corpus Considering Learner’s Error Tendency
Yujin Takahashi | Satoru Katsumata | Mamoru Komachi

Recently, several studies have focused on improving the performance of grammatical error correction (GEC) tasks using pseudo data. However, a large amount of pseudo data are required to train an accurate GEC model. To address the limitations of language and computational resources, we assume that introducing pseudo errors into sentences similar to those written by the language learners is more efficient, rather than incorporating random pseudo errors into monolingual data. In this regard, we study the effect of pseudo data on GEC task performance using two approaches. First, we extract sentences that are similar to the learners’ sentences from monolingual data. Second, we generate realistic pseudo errors by considering error types that learners often make. Based on our comparative results, we observe that F0.5 scores for the Russian GEC task are significantly improved.

pdf bib
Media Bias, the Social Sciences, and NLP : Automating Frame Analyses to Identify Bias by Word Choice and LabelingNLP: Automating Frame Analyses to Identify Bias by Word Choice and Labeling
Felix Hamborg

Media bias can strongly impact the public perception of topics reported in the news. A difficult to detect, yet powerful form of slanted news coverage is called bias by word choice and labeling (WCL). WCL bias can occur, for example, when journalists refer to the same semantic concept by using different terms that frame the concept differently and consequently may lead to different assessments by readers, such as the terms freedom fighters and terrorists, or gun rights and gun control. In this research project, I aim to devise methods that identify instances of WCL bias and estimate the frames they induce, e.g., not only is terrorists of negative polarity but also ascribes to aggression and fear. To achieve this, I plan to research methods using natural language processing and deep learning while employing models and using analysis concepts from the social sciences, where researchers have studied media bias for decades. The first results indicate the effectiveness of this interdisciplinary research approach. My vision is to devise a system that helps news readers to become aware of the differences in media coverage caused by bias.

pdf bib
SCAR : Sentence Compression using Autoencoders for ReconstructionSCAR: Sentence Compression using Autoencoders for Reconstruction
Chanakya Malireddy | Tirth Maniar | Manish Shrivastava

Sentence compression is the task of shortening a sentence while retaining its meaning. Most methods proposed for this task rely on labeled or paired corpora (containing pairs of verbose and compressed sentences), which is often expensive to collect. To overcome this limitation, we present a novel unsupervised deep learning framework (SCAR) for deletion-based sentence compression. SCAR is primarily composed of two encoder-decoder pairs : a compressor and a reconstructor. The compressor masks the input, and the reconstructor tries to regenerate it. The model is entirely trained on unlabeled data and does not require additional inputs such as explicit syntactic information or optimal compression length. SCAR’s merit lies in the novel Linkage Loss function, which correlates the compressor and its effect on reconstruction, guiding it to drop inferable tokens. SCAR achieves higher ROUGE scores on benchmark datasets than the existing state-of-the-art methods and baselines. We also conduct a user study to demonstrate the application of our model as a text highlighting system. Using our model to underscore salient information facilitates speed-reading and reduces the time required to skim a document.

pdf bib
Multi-Task Neural Model for Agglutinative Language Translation
Yirong Pan | Xiao Li | Yating Yang | Rui Dong

Neural machine translation (NMT) has achieved impressive performance recently by using large-scale parallel corpora. However, it struggles in the low-resource and morphologically-rich scenarios of agglutinative language translation task. Inspired by the finding that monolingual data can greatly improve the NMT performance, we propose a multi-task neural model that jointly learns to perform bi-directional translation and agglutinative language stemming. Our approach employs the shared encoder and decoder to train a single model without changing the standard NMT architecture but instead adding a token before each source-side sentence to specify the desired target outputs of the two different tasks. Experimental results on Turkish-English and Uyghur-Chinese show that our proposed approach can significantly improve the translation performance on agglutinative languages by using a small amount of monolingual data.

pdf bib
Why is penguin more similar to polar bear than to sea gull? Analyzing conceptual knowledge in distributional models
Pia Sommerauer

What do powerful models of word mean- ing created from distributional data (e.g. Word2vec (Mikolov et al., 2013) BERT (Devlin et al., 2019) and ELMO (Peters et al., 2018)) represent? What causes words to be similar in the semantic space? What type of information is lacking? This thesis proposal presents a framework for investigating the information encoded in distributional semantic models. Several analysis methods have been suggested, but they have been shown to be limited and are not well understood. This approach pairs observations made on actual corpora with insights obtained from data manipulation experiments. The expected outcome is a better understanding of (1) the semantic information we can infer purely based on linguistic co-occurrence patterns and (2) the potential of distributional semantic models to pick up linguistic evidence.

pdf bib
Efficient Neural Machine Translation for Low-Resource Languages via Exploiting Related Languages
Vikrant Goyal | Sourav Kumar | Dipti Misra Sharma

A large percentage of the world’s population speaks a language of the Indian subcontinent, comprising languages from both Indo-Aryan (e.g. Hindi, Punjabi, Gujarati, etc.) and Dravidian (e.g. Tamil, Telugu, Malayalam, etc.) families. A universal characteristic of Indian languages is their complex morphology, which, when combined with the general lack of sufficient quantities of high-quality parallel data, can make developing machine translation (MT) systems for these languages difficult. Neural Machine Translation (NMT) is a rapidly advancing MT paradigm and has shown promising results for many language pairs, especially in large training data scenarios. Since the condition of large parallel corpora is not met for Indian-English language pairs, we present our efforts towards building efficient NMT systems between Indian languages (specifically Indo-Aryan languages) and English via efficiently exploiting parallel data from the related languages. We propose a technique called Unified Transliteration and Subword Segmentation to leverage language similarity while exploiting parallel data from related language pairs. We also propose a Multilingual Transfer Learning technique to leverage parallel data from multiple related languages to assist translation for low resource language pair of interest. Our experiments demonstrate an overall average improvement of 5 BLEU points over the standard Transformer-based NMT baselines.

pdf bib
Crossing the Line : Where do Demographic Variables Fit into Humor Detection?
J. A. Meaney

Recent humor classification shared tasks have struggled with two issues : either the data comprises a highly constrained genre of humor which does not broadly represent humor, or the data is so indiscriminate that the inter-annotator agreement on its humor content is drastically low. These tasks typically average over all annotators’ judgments, in spite of the fact that humor is a highly subjective phenomenon. We argue that demographic factors influence whether a text is perceived as humorous or not. We propose the addition of demographic information about the humor annotators in order to bin ratings more sensibly. We also suggest the addition of an ‘offensive’ label to distinguish between different generations, in terms of humor. This would allow for more nuanced shared tasks and could lead to better performance on downstream tasks, such as content moderation.

pdf bib
uBLEU : Uncertainty-Aware Automatic Evaluation Method for Open-Domain Dialogue SystemsBLEU: Uncertainty-Aware Automatic Evaluation Method for Open-Domain Dialogue Systems
Yuma Tsuta | Naoki Yoshinaga | Masashi Toyoda

Because open-domain dialogues allow diverse responses, basic reference-based metrics such as BLEU do not work well unless we prepare a massive reference set of high-quality responses for input utterances. To reduce this burden, a human-aided, uncertainty-aware metric, BLEU, has been proposed ; it embeds human judgment on the quality of reference outputs into the computation of multiple-reference BLEU. In this study, we instead propose a fully automatic, uncertainty-aware evaluation method for open-domain dialogue systems, BLEU. This method first collects diverse reference responses from massive dialogue data and then annotates their quality judgments by using a neural network trained on automatically collected training data. Experimental results on massive Twitter data confirmed that BLEU is comparable to BLEU in terms of its correlation with human judgment and that the state of the art automatic evaluation method, RUBER, is improved by integrating BLEU.

pdf bib
Embeddings of Label Components for Sequence Labeling : A Case Study of Fine-grained Named Entity Recognition
Takuma Kato | Kaori Abe | Hiroki Ouchi | Shumpei Miyawaki | Jun Suzuki | Kentaro Inui

In general, the labels used in sequence labeling consist of different types of elements. For example, IOB-format entity labels, such as B-Person and I-Person, can be decomposed into span (B and I) and type information (Person). However, while most sequence labeling models do not consider such label components, the shared components across labels, such as Person, can be beneficial for label prediction. In this work, we propose to integrate label component information as embeddings into models. Through experiments on English and Japanese fine-grained named entity recognition, we demonstrate that the proposed method improves performance, especially for instances with low-frequency labels.

pdf bib
Checkpoint Reranking : An Approach to Select Better Hypothesis for Neural Machine Translation Systems
Vinay Pandramish | Dipti Misra Sharma

In this paper, we propose a method of re-ranking the outputs of Neural Machine Translation (NMT) systems. After the decoding process, we select a few last iteration outputs in the training process as the N-best list. After training a Neural Machine Translation (NMT) baseline system, it has been observed that these iteration outputs have an oracle score higher than baseline up to 1.01 BLEU points compared to the last iteration of the trained system. We come up with a ranking mechanism by solely focusing on the decoder’s ability to generate distinct tokens and without the usage of any language model or data. With this method, we achieved a translation improvement up to +0.16 BLEU points over baseline. We also evaluate our approach by applying the coverage penalty to the training process. In cases of moderate coverage penalty, the oracle scores are higher than the final iteration up to +0.99 BLEU points, and our algorithm gives an improvement up to +0.17 BLEU points. With excessive penalty, there is a decrease in translation quality compared to the baseline system. Still, an increase in oracle scores up to +1.30 is observed with the re-ranking algorithm giving an improvement up to +0.15 BLEU points is found in case of excessive penalty. The proposed re-ranking method is a generic one and can be extended to other language pairs as well.N-best list. After training a Neural Machine Translation (NMT) baseline system, it has been observed that these iteration outputs have an oracle score higher than baseline up to 1.01 BLEU points compared to the last iteration of the trained system.We come up with a ranking mechanism by solely focusing on the decoder’s ability to generate distinct tokens and without the usage of any language model or data. With this method, we achieved a translation improvement up to +0.16 BLEU points over baseline.We also evaluate our approach by applying the coverage penalty to the training process.In cases of moderate coverage penalty, the oracle scores are higher than the final iteration up to +0.99 BLEU points, and our algorithm gives an improvement up to +0.17 BLEU points.With excessive penalty, there is a decrease in translation quality compared to the baseline system. Still, an increase in oracle scores up to +1.30 is observed with the re-ranking algorithm giving an improvement up to +0.15 BLEU points is found in case of excessive penalty.The proposed re-ranking method is a generic one and can be extended to other language pairs as well.

pdf bib
Cross-Lingual Disaster-related Multi-label Tweet Classification with Manifold Mixup
Jishnu Ray Chowdhury | Cornelia Caragea | Doina Caragea

Distinguishing informative and actionable messages from a social media platform like Twitter is critical for facilitating disaster management. For this purpose, we compile a multilingual dataset of over 130 K samples for multi-label classification of disaster-related tweets. We present a masking-based loss function for partially labelled samples and demonstrate the effectiveness of Manifold Mixup in the text domain. Our main model is based on Multilingual BERT, which we further improve with Manifold Mixup. We show that our model generalizes to unseen disasters in the test set. Furthermore, we analyze the capability of our model for zero-shot generalization to new languages. Our code, dataset, and other resources are available on Github.

pdf bib
Exploring the Role of Context to Distinguish Rhetorical and Information-Seeking Questions
Yuan Zhuang | Ellen Riloff

Social media posts often contain questions, but many of the questions are rhetorical and do not seek information. Our work studies the problem of distinguishing rhetorical and information-seeking questions on Twitter. Most work has focused on features of the question itself, but we hypothesize that the prior context plays a role too. This paper introduces a new dataset containing questions in tweets paired with their prior tweets to provide context. We create classification models to assess the difficulty of distinguishing rhetorical and information-seeking questions, and experiment with different properties of the prior context. Our results show that the prior tweet and topic features can improve performance on this task.


pdf (full)
bib (full)
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts

pdf bib
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts
Agata Savary | Yue Zhang

pdf bib
Integrating Ethics into the NLP CurriculumNLP Curriculum
Emily M. Bender | Dirk Hovy | Alexandra Schofield

To raise awareness among future NLP practitioners and prevent inertia in the field, we need to place ethics in the curriculum for all NLP studentsnot as an elective, but as a core part of their education. Our goal in this tutorial is to empower NLP researchers and practitioners with tools and resources to teach others about how to ethically apply NLP techniques. We will present both high-level strategies for developing an ethics-oriented curriculum, based on experience and best practices, as well as specific sample exercises that can be brought to a classroom. This highly interactive work session will culminate in a shared online resource page that pools lesson plans, assignments, exercise ideas, reading suggestions, and ideas from the attendees. Though the tutorial will focus particularly on examples for university classrooms, we believe these ideas can extend to company-internal workshops or tutorials in a variety of organizations. In this setting, a key lesson is that there is no single approach to ethical NLP : each project requires thoughtful consideration about what steps can be taken to best support people affected by that project. However, we can learn (and teach) what issues to be aware of, what questions to ask, and what strategies are available to mitigate harm.

pdf bib
Stylized Text Generation : Approaches and Applications
Lili Mou | Olga Vechtomova

Text generation has played an important role in various applications of natural language processing (NLP), and kn recent studies, researchers are paying increasing attention to modeling and manipulating the style of the generation text, which we call stylized text generation. In this tutorial, we will provide a comprehensive literature review in this direction. We start from the definition of style and different settings of stylized text generation, illustrated with various applications. Then, we present different settings of stylized generation, such as style-conditioned generation, style-transfer generation, and style-adversarial generation. In each setting, we delve deep into machine learning methods, including embedding learning techniques to represent style, adversarial learning, and reinforcement learning with cycle consistency to match content but to distinguish different styles. We also introduce current approaches to evaluating stylized text generation systems. We conclude our tutorial by presenting the challenges of stylized text generation and discussing future directions, such as small-data training, non-categorical style modeling, and a generalized scope of style transfer (e.g., controlling the syntax as a style).


pdf (full)
bib (full)
Proceedings of the First Workshop on Advances in Language and Vision Research

pdf bib
Proceedings of the First Workshop on Advances in Language and Vision Research
Xin Wang | Jesse Thomason | Ronghang Hu | Xinlei Chen | Peter Anderson | Qi Wu | Asli Celikyilmaz | Jason Baldridge | William Yang Wang

pdf bib
Visual Question Generation from Radiology Images
Mourad Sarrouti | Asma Ben Abacha | Dina Demner-Fushman

Visual Question Generation (VQG), the task of generating a question based on image contents, is an increasingly important area that combines natural language processing and computer vision. Although there are some recent works that have attempted to generate questions from images in the open domain, the task of VQG in the medical domain has not been explored so far. In this paper, we introduce an approach to generation of visual questions about radiology images called VQGR, i.e. an algorithm that is able to ask a question when shown an image. VQGR first generates new training data from the existing examples, based on contextual word embeddings and image augmentation techniques. It then uses the variational auto-encoders model to encode images into a latent space and decode natural language questions. Experimental automatic evaluations performed on the VQA-RAD dataset of clinical visual questions show that VQGR achieves good performances compared with the baseline system. The source code is available at

pdf bib
On the role of effective and referring questions in GuessWhat? !GuessWhat?!
Mauricio Mazuecos | Alberto Testoni | Raffaella Bernardi | Luciana Benotti

Task success is the standard metric used to evaluate referential visual dialogue systems. In this paper we propose two new metrics that evaluate how each question contributes to the goal. First, we measure how effective each question is by evaluating whether the question discards objects that are not the referent. Second, we define referring questions as those that univocally identify one object in the image. We report the new metrics for human dialogues and for state of the art publicly available models on GuessWhat? !. Regarding our first metric, we find that successful dialogues do not have a higher percentage of effective questions for most models. With respect to the second metric, humans make questions at the end of the dialogue that are referring, confirming their guess before guessing. Human dialogues that use this strategy have a higher task success but models do not seem to learn it.

pdf bib
Latent Alignment of Procedural Concepts in Multimodal Recipes
Hossein Rajaby Faghihi | Roshanak Mirzaee | Sudarshan Paliwal | Parisa Kordjamshidi

We propose a novel alignment mechanism to deal with procedural reasoning on a newly released multimodal QA dataset, named RecipeQA. Our model is solving the textual cloze task which is a reading comprehension on a recipe containing images and instructions. We exploit the power of attention networks, cross-modal representations, and a latent alignment space between instructions and candidate answers to solve the problem. We introduce constrained max-pooling which refines the max pooling operation on the alignment matrix to impose disjoint constraints among the outputs of the model. Our evaluation result indicates a 19 % improvement over the baselines.


pdf (full)
bib (full)
Proceedings of the First Workshop on Automatic Simultaneous Translation

pdf bib
Proceedings of the First Workshop on Automatic Simultaneous Translation
Hua Wu | Collin Cherry | Liang Huang | Zhongjun He | Mark Liberman | James Cross | Yang Liu

pdf bib
Modeling Discourse Structure for Document-level Neural Machine Translation
Junxuan Chen | Xiang Li | Jiarui Zhang | Chulun Zhou | Jianwei Cui | Bin Wang | Jinsong Su

Recently, document-level neural machine translation (NMT) has become a hot topic in the community of machine translation. Despite its success, most of existing studies ignored the discourse structure information of the input document to be translated, which has shown effective in other tasks. In this paper, we propose to improve document-level NMT with the aid of discourse structure information. Our encoder is based on a hierarchical attention network (HAN) (Miculicich et al., 2018). Specifically, we first parse the input document to obtain its discourse structure. Then, we introduce a Transformer-based path encoder to embed the discourse structure information of each word. Finally, we combine the discourse structure information with the word embedding before it is fed into the encoder. Experimental results on the English-to-German dataset show that our model can significantly outperform both Transformer and Transformer+HAN.


pdf (full)
bib (full)
Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications

pdf bib
Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications
Jill Burstein | Ekaterina Kochmar | Claudia Leacock | Nitin Madnani | Ildikó Pilán | Helen Yannakoudakis | Torsten Zesch

pdf bib
Complementary Systems for Off-Topic Spoken Response Detection
Vatsal Raina | Mark Gales | Kate Knill

Increased demand to learn English for business and education has led to growing interest in automatic spoken language assessment and teaching systems. With this shift to automated approaches it is important that systems reliably assess all aspects of a candidate’s responses. This paper examines one form of spoken language assessment ; whether the response from the candidate is relevant to the prompt provided. This will be referred to as off-topic spoken response detection. Two forms of previously proposed approaches are examined in this work : the hierarchical attention-based topic model (HATM) ; and the similarity grid model (SGM). The work focuses on the scenario when the prompt, and associated responses, have not been seen in the training data, enabling the system to be applied to new test scripts without the need to collect data or retrain the model. To improve the performance of the systems for unseen prompts, data augmentation based on easy data augmentation (EDA) and translation based approaches are applied. Additionally for the HATM, a form of prompt dropout is described. The systems were evaluated on both seen and unseen prompts from Linguaskill Business and General English tests. For unseen data the performance of the HATM was improved using data augmentation, in contrast to the SGM where no gains were obtained. The two approaches were found to be complementary to one another, yielding a combined F0.5 score of 0.814 for off-topic response detection where the prompts have not been seen in training.

pdf bib
CIMA : A Large Open Access Dialogue Dataset for TutoringCIMA: A Large Open Access Dialogue Dataset for Tutoring
Katherine Stasaski | Kimberly Kao | Marti A. Hearst

One-to-one tutoring is often an effective means to help students learn, and recent experiments with neural conversation systems are promising. However, large open datasets of tutoring conversations are lacking. To remedy this, we propose a novel asynchronous method for collecting tutoring dialogue via crowdworkers that is both amenable to the needs of deep learning algorithms and reflective of pedagogical concerns. In this approach, extended conversations are obtained between crowdworkers role-playing as both students and tutors. The CIMA collection, which we make publicly available, is novel in that students are exposed to overlapping grounded concepts between exercises and multiple relevant tutoring responses are collected for the same input. CIMA contains several compelling properties from an educational perspective : student role-players complete exercises in fewer turns during the course of the conversation and tutor players adopt strategies that conform with some educational conversational norms, such as providing hints versus asking questions in appropriate contexts. The dataset enables a model to be trained to generate the next tutoring utterance in a conversation, conditioned on a provided action strategy.

pdf bib
Becoming Linguistically Mature : Modeling English and German Children’s Writing Development Across School GradesEnglish and German Children’s Writing Development Across School Grades
Elma Kerz | Yu Qiao | Daniel Wiechmann | Marcus Ströbel

In this paper we employ a novel approach to advancing our understanding of the development of writing in English and German children across school grades using classification tasks. The data used come from two recently compiled corpora : The English data come from the the GiC corpus (983 school children in second-, sixth-, ninth- and eleventh-grade) and the German data are from the FD-LEX corpus (930 school children in fifth- and ninth-grade). The key to this paper is the combined use of what we refer to as ‘complexity contours’, i.e. series of measurements that capture the progression of linguistic complexity within a text, and Recurrent Neural Network (RNN) classifiers that adequately capture the sequential information in those contours. Our experiments demonstrate that RNN classifiers trained on complexity contours achieve higher classification accuracy than one trained on text-average complexity scores. In a second step, we determine the relative importance of the features from four distinct categories through a Sensitivity-Based Pruning approach.

pdf bib
Can Neural Networks Automatically Score Essay Traits?
Sandeep Mathias | Pushpak Bhattacharyya

Essay traits are attributes of an essay that can help explain how well written (or badly written) the essay is. Examples of traits include Content, Organization, Language, Sentence Fluency, Word Choice, etc. A lot of research in the last decade has dealt with automatic holistic essay scoring-where a machine rates an essay and gives a score for the essay. However, writers need feedback, especially if they want to improve their writing-which is why trait-scoring is important. In this paper, we show how a deep-learning based system can outperform feature-based machine learning systems, as well as a string kernel system in scoring essay traits.

pdf bib
Applications of Natural Language Processing in Bilingual Language Teaching : An Indonesian-English Case StudyIndonesian-English Case Study
Zara Maxwelll-Smith | Simón González Ochoa | Ben Foley | Hanna Suominen

Multilingual corpora are difficult to compile and a classroom setting adds pedagogy to the mix of factors which make this data so rich and problematic to classify. In this paper, we set out methodological considerations of using automated speech recognition to build a corpus of teacher speech in an Indonesian language classroom. Our preliminary results (64 % word error rate) suggest these tools have the potential to speed data collection in this context. We provide practical examples of our data structure, details of our piloted computer-assisted processes, and fine-grained error analysis. Our study is informed and directed by genuine research questions and discussion in both the education and computational linguistics fields. We highlight some of the benefits and risks of using these emerging technologies to analyze the complex work of language teachers and in education more generally.

pdf bib
An empirical investigation of neural methods for content scoring of science explanations
Brian Riordan | Sarah Bichler | Allison Bradford | Jennifer King Chen | Korah Wiley | Libby Gerard | Marcia C. Linn

With the widespread adoption of the Next Generation Science Standards (NGSS), science teachers and online learning environments face the challenge of evaluating students’ integration of different dimensions of science learning. Recent advances in representation learning in natural language processing have proven effective across many natural language processing tasks, but a rigorous evaluation of the relative merits of these methods for scoring complex constructed response formative assessments has not previously been carried out. We present a detailed empirical investigation of feature-based, recurrent neural network, and pre-trained transformer models on scoring content in real-world formative assessment data. We demonstrate that recent neural methods can rival or exceed the performance of feature-based methods. We also provide evidence that different classes of neural models take advantage of different learning cues, and pre-trained transformer models may be more robust to spurious, dataset-specific learning cues, better reflecting scoring rubrics.

pdf bib
GECToR Grammatical Error Correction : Tag, Not RewriteGECToR – Grammatical Error Correction: Tag, Not Rewrite
Kostiantyn Omelianchuk | Vitaliy Atrasevych | Artem Chernodub | Oleksandr Skurzhanskyi

In this paper, we present a simple and efficient GEC sequence tagger using a Transformer encoder. Our system is pre-trained on synthetic data and then fine-tuned in two stages : first on errorful corpora, and second on a combination of errorful and error-free parallel corpora. We design custom token-level transformations to map input tokens to target corrections. Our best single-model / ensemble GEC tagger achieves an F_0.5 of 65.3/66.5 on CONLL-2014 (test) and F_0.5 of 72.4/73.6 on BEA-2019 (test). Its inference speed is up to 10 times as fast as a Transformer-based seq2seq GEC system.

pdf bib
Interpreting Neural CWI Classifiers’ Weights as Vocabulary SizeCWI Classifiers’ Weights as Vocabulary Size
Yo Ehara

Complex Word Identification (CWI) is a task for the identification of words that are challenging for second-language learners to read. Even though the use of neural classifiers is now common in CWI, the interpretation of their parameters remains difficult. This paper analyzes neural CWI classifiers and shows that some of their parameters can be interpreted as vocabulary size. We present a novel formalization of vocabulary size measurement methods that are practiced in the applied linguistics field as a kind of neural classifier. We also contribute to building a novel dataset for validating vocabulary testing and readability via crowdsourcing.

pdf bib
Predicting the Difficulty and Response Time of Multiple Choice Questions Using Transfer Learning
Kang Xue | Victoria Yaneva | Christopher Runyon | Peter Baldwin

This paper investigates whether transfer learning can improve the prediction of the difficulty and response time parameters for 18,000 multiple-choice questions from a high-stakes medical exam. The type the signal that best predicts difficulty and response time is also explored, both in terms of representation abstraction and item component used as input (e.g., whole item, answer options only, etc.). The results indicate that, for our sample, transfer learning can improve the prediction of item difficulty when response time is used as an auxiliary task but not the other way around. In addition, difficulty was best predicted using signal from the item stem (the description of the clinical case), while all parts of the item were important for predicting the response time.


pdf (full)
bib (full)
Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing

pdf bib
Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing
Dina Demner-Fushman | Kevin Bretonnel Cohen | Sophia Ananiadou | Junichi Tsujii

pdf bib
Interactive Extractive Search over Biomedical Corpora
Hillel Taub Tabib | Micah Shlain | Shoval Sadde | Dan Lahav | Matan Eyal | Yaara Cohen | Yoav Goldberg

We present a system that allows life-science researchers to search a linguistically annotated corpus of scientific texts using patterns over dependency graphs, as well as using patterns over token sequences and a powerful variant of boolean keyword queries. In contrast to previous attempts to dependency-based search, we introduce a light-weight query language that does not require the user to know the details of the underlying linguistic representations, and instead to query the corpus by providing an example sentence coupled with simple markup. Search is performed at an interactive speed due to efficient linguistic graph-indexing and retrieval engine. This allows for rapid exploration, development and refinement of user queries. We demonstrate the system using example workflows over two corpora : the PubMed corpus including 14,446,243 PubMed abstracts and the CORD-19 dataset, a collection of over 45,000 research papers focused on COVID-19 research. The system is publicly available at

pdf bib
Improving Biomedical Analogical Retrieval with Embedding of Structural Dependencies
Amandalynne Paullada | Bethany Percha | Trevor Cohen

Inferring the nature of the relationships between biomedical entities from text is an important problem due to the difficulty of maintaining human-curated knowledge bases in rapidly evolving fields. Neural word embeddings have earned attention for an apparent ability to encode relational information. However, word embedding models that disregard syntax during training are limited in their ability to encode the structural relationships fundamental to cognitive theories of analogy. In this paper, we demonstrate the utility of encoding dependency structure in word embeddings in a model we call Embedding of Structural Dependencies (ESD) as a way to represent biomedical relationships in two analogical retrieval tasks : a relationship retrieval (RR) task, and a literature-based discovery (LBD) task meant to hypothesize plausible relationships between pairs of entities unseen in training. We compare our model to skip-gram with negative sampling (SGNS), using 19 databases of biomedical relationships as our evaluation data, with improvements in performance on 17 (LBD) and 18 (RR) of these sets. These results suggest embeddings encoding dependency path information are of value for biomedical analogy retrieval.

pdf bib
A BERT-based One-Pass Multi-Task Model for Clinical Temporal Relation ExtractionBERT-based One-Pass Multi-Task Model for Clinical Temporal Relation Extraction
Chen Lin | Timothy Miller | Dmitriy Dligach | Farig Sadeque | Steven Bethard | Guergana Savova

Recently BERT has achieved a state-of-the-art performance in temporal relation extraction from clinical Electronic Medical Records text. However, the current approach is inefficient as it requires multiple passes through each input sequence. We extend a recently-proposed one-pass model for relation classification to a one-pass model for relation extraction. We augment this framework by introducing global embeddings to help with long-distance relation inference, and by multi-task learning to increase model performance and generalizability. Our proposed model produces results on par with the state-of-the-art in temporal relation extraction on the THYME corpus and is much greener in computational cost.

pdf bib
Neural Transduction of Letter Position Dyslexia using an Anagram Matrix Representation
Avi Bleiweiss

Research on analyzing reading patterns of dyslectic children has mainly been driven by classifying dyslexia types offline. We contend that a framework to remedy reading errors inline is more far-reaching and will help to further advance our understanding of this impairment. In this paper, we propose a simple and intuitive neural model to reinstate migrating words that transpire in letter position dyslexia, a visual analysis deficit to the encoding of character order within a word. Introduced by the anagram matrix representation of an input verse, the novelty of our work lies in the expansion from one to a two dimensional context window for training. This warrants words that only differ in the disposition of letters to remain interpreted semantically similar in the embedding space. Subject to the apparent constraints of the self-attention transformer architecture, our model achieved a unigram BLEU score of 40.6 on our reconstructed dataset of the Shakespeare sonnets.

pdf bib
Extensive Error Analysis and a Learning-Based Evaluation of Medical Entity Recognition Systems to Approximate User Experience
Isar Nejadgholi | Kathleen C. Fraser | Berry de Bruijn

When comparing entities extracted by a medical entity recognition system with gold standard annotations over a test set, two types of mismatches might occur, label mismatch or span mismatch. Here we focus on span mismatch and show that its severity can vary from a serious error to a fully acceptable entity extraction due to the subjectivity of span annotations. For a domain-specific BERT-based NER system, we showed that 25 % of the errors have the same labels and overlapping span with gold standard entities. We collected expert judgement which shows more than 90 % of these mismatches are accepted or partially accepted by the user. Using the training set of the NER system, we built a fast and lightweight entity classifier to approximate the user experience of such mismatches through accepting or rejecting them. The decisions made by this classifier are used to calculate a learning-based F-score which is shown to be a better approximation of a forgiving user’s experience than the relaxed F-score. We demonstrated the results of applying the proposed evaluation metric for a variety of deep learning medical entity recognition models trained with two datasets.

pdf bib
Global Locality in Biomedical Relation and Event Extraction
Elaheh ShafieiBavani | Antonio Jimeno Yepes | Xu Zhong | David Martinez Iraola

Due to the exponential growth of biomedical literature, event and relation extraction are important tasks in biomedical text mining. Most work only focus on relation extraction, and detect a single entity pair mention on a short span of text, which is not ideal due to long sentences that appear in biomedical contexts. We propose an approach to both relation and event extraction, for simultaneously predicting relationships between all mention pairs in a text. We also perform an empirical study to discuss different network setups for this purpose. The best performing model includes a set of multi-head attentions and convolutions, an adaptation of the transformer architecture, which offers self-attention the ability to strengthen dependencies among related elements, and models the interaction between features extracted by multiple attention heads. Experiment results demonstrate that our approach outperforms the state of the art on a set of benchmark biomedical corpora including BioNLP 2009, 2011, 2013 and BioCreative 2017 shared tasks.


pdf (full)
bib (full)
Second Grand-Challenge and Workshop on Multimodal Language (Challenge-HML)

pdf bib
Second Grand-Challenge and Workshop on Multimodal Language (Challenge-HML)
Amir Zadeh | Louis-Philippe Morency | Paul Pu Liang | Soujanya Poria

pdf bib
A Multi-modal Approach to Fine-grained Opinion Mining on Video Reviews
Edison Marrese-Taylor | Cristian Rodriguez | Jorge Balazs | Stephen Gould | Yutaka Matsuo

Despite the recent advances in opinion mining for written reviews, few works have tackled the problem on other sources of reviews. In light of this issue, we propose a multi-modal approach for mining fine-grained opinions from video reviews that is able to determine the aspects of the item under review that are being discussed and the sentiment orientation towards them. Our approach works at the sentence level without the need for time annotations and uses features derived from the audio, video and language transcriptions of its contents. We evaluate our approach on two datasets and show that leveraging the video and audio modalities consistently provides increased performance over text-only baselines, providing evidence these extra modalities are key in better understanding video reviews.


pdf (full)
bib (full)
Proceedings of The 3rd Workshop on e-Commerce and NLP

pdf bib
Proceedings of The 3rd Workshop on e-Commerce and NLP
Shervin Malmasi | Surya Kallumadi | Nicola Ueffing | Oleg Rokhlenko | Eugene Agichtein | Ido Guy

pdf bib
Bootstrapping Named Entity Recognition in E-Commerce with Positive Unlabeled LearningE-Commerce with Positive Unlabeled Learning
Hanchu Zhang | Leonhard Hennig | Christoph Alt | Changjian Hu | Yao Meng | Chao Wang

In this work, we introduce a bootstrapped, iterative NER model that integrates a PU learning algorithm for recognizing named entities in a low-resource setting. Our approach combines dictionary-based labeling with syntactically-informed label expansion to efficiently enrich the seed dictionaries. Experimental results on a dataset of manually annotated e-commerce product descriptions demonstrate the effectiveness of the proposed framework.

pdf bib
A Deep Learning System for Sentiment Analysis of Service Calls
Yanan Jia

Sentiment analysis is crucial for the advancement of artificial intelligence (AI). Sentiment understanding can help AI to replicate human language and discourse. Studying the formation and response of sentiment state from well-trained Customer Service Representatives (CSRs) can help make the interaction between humans and AI more intelligent. In this paper, a sentiment analysis pipeline is first carried out with respect to real-world multi-party conversations-that is, service calls. Based on the acoustic and linguistic features extracted from the source information, a novel aggregated method for voice sentiment recognition framework is built. Each party’s sentiment pattern during the communication is investigated along with the interaction sentiment pattern between all parties.

pdf bib
SimsterQ : A Similarity based Clustering Approach to Opinion Question AnsweringSimsterQ: A Similarity based Clustering Approach to Opinion Question Answering
Aishwarya Ashok | Ganapathy Natarajan | Ramez Elmasri | Laurel Smith-Stvan

In recent years, there has been an increase in online shopping resulting in an increased number of online reviews. Customers can not delve into the huge amount of data when they are looking for specific aspects of a product. Some of these aspects can be extracted from the product reviews. In this paper we introduced SimsterQ-a clustering based system for answering questions that makes use of word vectors. Clustering was performed using cosine similarity scores between sentence vectors of reviews and questions. Two variants (Sim and Median) with and without stopwords were evaluated against traditional methods that use term frequency. We also used an n-gram approach to study the effect of noise. We used the reviews in the Amazon Reviews dataset to pick the answers. Evaluation was performed both at the individual sentence level using the top sentence from Okapi BM25 as the gold standard and at the whole answer level using review snippets as the gold standard. At the sentence level our system performed slightly better than a more complicated deep learning method. Our system returned answers similar to the review snippets from the Amazon QA Dataset as measured by the cosine similarity. Analysis was also performed on the quality of the clusters generated by our system.

pdf bib
On Application of Bayesian Parametric and Non-parametric Methods for User Cohorting in Product SearchBayesian Parametric and Non-parametric Methods for User Cohorting in Product Search
Shashank Gupta

In this paper, we study the applicability of Bayesian Parametric and Non-parametric methods for user clustering in an E-commerce search setting. To the best of our knowledge, this is the first work that presents a comparative study of various Bayesian clustering methods in the context of product search. Specifically, we cluster users based on their topical patterns from their respective product search queries. To evaluate the quality of the clusters formed, we perform a collaborative query recommendation task. Our findings indicate that simple parametric model like Latent Dirichlet Allocation (LDA) outperforms more sophisticated non-parametric methods like Distance Dependent Chinese Restaurant Process and Dirichlet Process-based clustering in both tasks.


pdf (full)
bib (full)
Proceedings of the Third Workshop on Fact Extraction and VERification (FEVER)

pdf bib
Proceedings of the Third Workshop on Fact Extraction and VERification (FEVER)
Christos Christodoulopoulos | James Thorne | Andreas Vlachos | Oana Cocarascu | Arpit Mittal

pdf bib
Simple Compounded-Label Training for Fact Extraction and Verification
Yixin Nie | Lisa Bauer | Mohit Bansal

Automatic fact checking is an important task motivated by the need for detecting and preventing the spread of misinformation across the web. The recently released FEVER challenge provides a benchmark task that assesses systems’ capability for both the retrieval of required evidence and the identification of authentic claims. Previous approaches share a similar pipeline training paradigm that decomposes the task into three subtasks, with each component built and trained separately. Although achieving acceptable scores, these methods induce difficulty for practical application development due to unnecessary complexity and expensive computation. In this paper, we explore the potential of simplifying the system design and reducing training computation by proposing a joint training setup in which a single sequence matching model is trained with compounded labels that give supervision for both sentence selection and claim verification subtasks, eliminating the duplicate computation that occurs when models are designed and trained separately. Empirical results on FEVER indicate that our method : (1) outperforms the typical multi-task learning approach, and (2) gets comparable results to top performing systems with a much simpler training setup and less training computation (in terms of the amount of data consumed and the number of model parameters), facilitating future works on the automatic fact checking task and its practical usage.

pdf bib
Language Models as Fact Checkers?
Nayeon Lee | Belinda Z. Li | Sinong Wang | Wen-tau Yih | Hao Ma | Madian Khabsa

Recent work has suggested that language models (LMs) store both common-sense and factual knowledge learned from pre-training data. In this paper, we leverage this implicit knowledge to create an effective end-to-end fact checker using a solely a language model, without any external knowledge or explicit retrieval components. While previous work on extracting knowledge from LMs have focused on the task of open-domain question answering, to the best of our knowledge, this is the first work to examine the use of language models as fact checkers. In a closed-book setting, we show that our zero-shot LM approach outperforms a random baseline on the standard FEVER task, and that our finetuned LM compares favorably with standard baselines. Though we do not ultimately outperform methods which use explicit knowledge bases, we believe our exploration shows that this method is viable and has much room for exploration.


pdf (full)
bib (full)
Proceedings of the Second Workshop on Figurative Language Processing

pdf bib
Proceedings of the Second Workshop on Figurative Language Processing
Beata Beigman Klebanov | Ekaterina Shutova | Patricia Lichtenstein | Smaranda Muresan | Chee Wee | Anna Feldman | Debanjan Ghosh

pdf bib
Sarcasm Detection in Tweets with BERT and GloVe EmbeddingsBERT and GloVe Embeddings
Akshay Khatri | Pranav P

Sarcasm is a form of communication in which the person states opposite of what he actually means. In this paper, we propose using machine learning techniques with BERT and GloVe embeddings to detect sarcasm in tweets. The dataset is preprocessed before extracting the embeddings. The proposed model also uses all of the context provided in the dataset to which the user is reacting along with his actual response.

pdf bib
C-Net : Contextual Network for Sarcasm DetectionC-Net: Contextual Network for Sarcasm Detection
Amit Kumar Jena | Aman Sinha | Rohit Agarwal

Automatic Sarcasm Detection in conversations is a difficult and tricky task. Classifying an utterance as sarcastic or not in isolation can be futile since most of the time the sarcastic nature of a sentence heavily relies on its context. This paper presents our proposed model, C-Net, which takes contextual information of a sentence in a sequential manner to classify it as sarcastic or non-sarcastic. Our model showcases competitive performance in the Sarcasm Detection shared task organised on CodaLab and achieved 75.0 % F1-score on the Twitter dataset and 66.3 % F1-score on Reddit dataset.

pdf bib
Sarcasm Identification and Detection in Conversion Context using BERTBERT
Kalaivani A. | Thenmozhi D.

Sarcasm analysis in user conversion text is automatic detection of any irony, insult, hurting, painful, caustic, humour, vulgarity that degrades an individual. It is helpful in the field of sentimental analysis and cyberbullying. As an immense growth of social media, sarcasm analysis helps to avoid insult, hurts and humour to affect someone. In this paper, we present traditional machine learning approaches, deep learning approach (LSTM -RNN) and BERT (Bidirectional Encoder Representations from Transformers) for identifying sarcasm. We have used the approaches to build the model, to identify and categorize how much conversion context or response is needed for sarcasm detection and evaluated on the two social media forums that is twitter conversation dataset and reddit conversion dataset. We compare the performance based on the approaches and obtained the best F1 scores as 0.722, 0.679 for the twitter forums and reddit forums respectively.

pdf bib
Neural Sarcasm Detection using Conversation Context
Nikhil Jaiswal

Social media platforms and discussion forums such as Reddit, Twitter, etc. are filled with figurative languages. Sarcasm is one such category of figurative language whose presence in a conversation makes language understanding a challenging task. In this paper, we present a deep neural architecture for sarcasm detection. We investigate various pre-trained language representation models (PLRMs) like BERT, RoBERTa, etc. and fine-tune it on the Twitter dataset. We experiment with a variety of PLRMs either on the twitter utterance in isolation or utilizing the contextual information along with the utterance. Our findings indicate that by taking into consideration the previous three most recent utterances, the model is more accurately able to classify a conversation as being sarcastic or not. Our best performing ensemble model achieves an overall F1 score of 0.790, which ranks us second on the leaderboard of the Sarcasm Shared Task 2020.

pdf bib
A Novel Hierarchical BERT Architecture for Sarcasm DetectionBERT Architecture for Sarcasm Detection
Himani Srivastava | Vaibhav Varshney | Surabhi Kumari | Saurabh Srivastava

Online discussion platforms are often flooded with opinions from users across the world on a variety of topics. Many such posts, comments, or utterances are often sarcastic in nature, i.e., the actual intent is hidden in the sentence and is different from its literal meaning, making the detection of such utterances challenging without additional context. In this paper, we propose a novel deep learning-based approach to detect whether an utterance is sarcastic or non-sarcastic by utilizing the given contexts ina hierarchical manner. We have used datasets from two online discussion platforms-Twitter and Reddit1for our experiments. Experimental and error analysis shows that the hierarchical models can make full use of history to obtain a better representation of contexts and thus, in turn, can outperform their sequential counterparts.

pdf bib
Detecting Sarcasm in Conversation Context Using Transformer-Based ModelsDetecting Sarcasm in Conversation Context Using Transformer-Based Models
Adithya Avvaru | Sanath Vobilisetty | Radhika Mamidi

Sarcasm detection, regarded as one of the sub-problems of sentiment analysis, is a very typical task because the introduction of sarcastic words can flip the sentiment of the sentence itself. To date, many research works revolve around detecting sarcasm in one single sentence and there is very limited research to detect sarcasm resulting from multiple sentences. Current models used Long Short Term Memory (LSTM) variants with or without attention to detect sarcasm in conversations. We showed that the models using state-of-the-art Bidirectional Encoder Representations from Transformers (BERT), to capture syntactic and semantic information across conversation sentences, performed better than the current models. Based on the data analysis, we estimated that the number of sentences in the conversation that can contribute to the sarcasm and the results agrees to this estimation. We also perform a comparative study of our different versions of BERT-based model with other variants of LSTM model and XLNet (both using the estimated number of conversation sentences) and find out that BERT-based models outperformed them.

pdf bib
Using Conceptual Norms for Metaphor Detection
Mingyu Wan | Kathleen Ahrens | Emmanuele Chersoni | Menghan Jiang | Qi Su | Rong Xiang | Chu-Ren Huang

This paper reports a linguistically-enriched method of detecting token-level metaphors for the second shared task on Metaphor Detection. We participate in all four phases of competition with both datasets, i.e. Verbs and AllPOS on the VUA and the TOFEL datasets. We use the modality exclusivity and embodiment norms for constructing a conceptual representation of the nodes and the context. Our system obtains an F-score of 0.652 for the VUA Verbs track, which is 5 % higher than the strong baselines. The experimental results across models and datasets indicate the salient contribution of using modality exclusivity and modality shift information for predicting metaphoricity.

pdf bib
Character aware models with similarity learning for metaphor detection
Tarun Kumar | Yashvardhan Sharma

Recent work on automatic sequential metaphor detection has involved recurrent neural networks initialized with different pre-trained word embeddings and which are sometimes combined with hand engineered features. To capture lexical and orthographic information automatically, in this paper we propose to add character based word representation. Also, to contrast the difference between literal and contextual meaning, we utilize a similarity network. We explore these components via two different architectures-a BiLSTM model and a Transformer Encoder model similar to BERT to perform metaphor identification. We participate in the Second Shared Task on Metaphor Detection on both the VUA and TOFEL datasets with the above models. The experimental results demonstrate the effectiveness of our method as it outperforms all the systems which participated in the previous shared task.

pdf bib
Recognizing Euphemisms and Dysphemisms Using Sentiment Analysis
Christian Felt | Ellen Riloff

This paper presents the first research aimed at recognizing euphemistic and dysphemistic phrases with natural language processing. Euphemisms soften references to topics that are sensitive, disagreeable, or taboo. Conversely, dysphemisms refer to sensitive topics in a harsh or rude way. For example, passed away and departed are euphemisms for death, while croaked and six feet under are dysphemisms for death. Our work explores the use of sentiment analysis to recognize euphemistic and dysphemistic language. First, we identify near-synonym phrases for three topics (firing, lying, and stealing) using a bootstrapping algorithm for semantic lexicon induction. Next, we classify phrases as euphemistic, dysphemistic, or neutral using lexical sentiment cues and contextual sentiment analysis. We introduce a new gold standard data set and present our experimental results for this task.

pdf bib
Generating Ethnographic Models from Communities’ Online Data
Tomek Strzalkowski | Anna Newheiser | Nathan Kemper | Ning Sa | Bharvee Acharya | Gregorios Katsios

In this paper we describe computational ethnography study to demonstrate how machine learning techniques can be utilized to exploit bias resident in language data produced by communities with online presence. Specifically, we leverage the use of figurative language (i.e., the choice of metaphors) in online text (e.g., news media, blogs) produced by distinct communities to obtain models of community worldviews that can be shown to be distinctly biased and thus different from other communities’ models. We automatically construct metaphor-based community models for two distinct scenarios : gun rights and marriage equality. We then conduct a series of experiments to validate the hypothesis that the metaphors found in each community’s online language convey the bias in the community’s worldview.

pdf bib
Augmenting Neural Metaphor Detection with Concreteness
Ghadi Alnafesah | Harish Tayyar Madabushi | Mark Lee

The idea that a shift in concreteness within a sentence indicates the presence of a metaphor has been around for a while. However, recent methods of detecting metaphor that have relied on deep neural models have ignored concreteness and related psycholinguistic information. We hypothesis that this information is not available to these models and that their addition will boost the performance of these models in detecting metaphor. We test this hypothesis on the Metaphor Detection Shared Task 2020 and find that the addition of concreteness information does in fact boost deep neural models. We also run tests on data from a previous shared task and show similar results.

pdf bib
Metaphor Detection using Ensembles of Bidirectional Recurrent Neural Networks
Jennifer Brooks | Abdou Youssef

In this paper we present our results from the Second Shared Task on Metaphor Detection, hosted by the Second Workshop on Figurative Language Processing. We use an ensemble of RNN models with bidirectional LSTMs and bidirectional attention mechanisms. Some of the models were trained on all parts of speech. Each of the other models was trained on one of four categories for parts of speech : nouns, verbs, adverbs / adjectives, or other. The models were combined into voting pools and the voting pools were combined using the logical OR operator.

pdf bib
Testing the role of metadata in metaphor identification
Egon Stemle | Alexander Onysko

This paper describes the adaptation and application of a neural network system for the automatic detection of metaphors. The LSTM BiRNN system participated in the shared task of metaphor identification that was part of the Second Workshop of Figurative Language Processing (FigLang2020) held at the Annual Conference of the Association for Computational Linguistics (ACL2020). The particular focus of our approach is on the potential influence that the metadata given in the ETS Corpus of Non-Native Written English might have on the automatic detection of metaphors in this dataset. The article first discusses the annotated ETS learner data, highlighting some of its peculiarities and inherent biases of metaphor use. A series of evaluations follow in order to test whether specific metadata influence the system performance in the task of automatic metaphor identification. The system is available under the APLv2 open-source license.


pdf (full)
bib (full)
Proceedings of the 16th International Conference on Parsing Technologies and the IWPT 2020 Shared Task on Parsing into Enhanced Universal Dependencies

pdf bib
Proceedings of the 16th International Conference on Parsing Technologies and the IWPT 2020 Shared Task on Parsing into Enhanced Universal Dependencies
Gosse Bouma | Yuji Matsumoto | Stephan Oepen | Kenji Sagae | Djamé Seddah | Weiwei Sun | Anders Søgaard | Reut Tsarfaty | Dan Zeman

pdf bib
Syntactic Parsing in Humans and Machines
Paola Merlo

To process the syntactic structures of a language in ways that are compatible with human expectations, we need computational representations of lexical and syntactic properties that form the basis of human knowledge of words and sentences. Recent neural-network-based and distributed semantics techniques have developed systems of considerable practical success and impressive performance. As has been advocated by many, however, such systems still lack human-like properties. In particular, linguistic, psycholinguistic and neuroscientific investigations have shown that human processing of sentences is sensitive to structure and unbounded relations. In the spirit of better understanding the structure building and long-distance properties of neural networks, I will present an overview of recent results on agreement and island effects in syntax in several languages. While certain sets of results in the literature indicate that neural language models exhibit long-distance agreement abilities, other finer-grained investigation of how these effects are calculated indicates that that the similarity spaces they define do not correlate with human experimental results on intervention similarity in long-distance dependencies. This opens the way to reflections on how to better match the syntactic properties of natural languages in the representations of neural models.

pdf bib
Semi-supervised Parsing with a Variational Autoencoding Parser
Xiao Zhang | Dan Goldwasser

We propose an end-to-end variational autoencoding parsing (VAP) model for semi-supervised graph-based projective dependency parsing. It encodes the input using continuous latent variables in a sequential manner by deep neural networks (DNN) that can utilize the contextual information, and reconstruct the input using a generative model. The VAP model admits a unified structure with different loss functions for labeled and unlabeled data with shared parameters. We conducted experiments on the WSJ data sets, showing the proposed model can use the unlabeled data to increase the performance on a limited amount of labeled data, on a par with a recently proposed semi-supervised parser with faster inference.

pdf bib
Obfuscation for Privacy-preserving Syntactic Parsing
Zhifeng Hu | Serhii Havrylov | Ivan Titov | Shay B. Cohen

The goal of homomorphic encryption is to encrypt data such that another party can operate on it without being explicitly exposed to the content of the original data. We introduce an idea for a privacy-preserving transformation on natural language data, inspired by homomorphic encryption. Our primary tool is obfuscation, relying on the properties of natural language. Specifically, a given English text is obfuscated using a neural model that aims to preserve the syntactic relationships of the original sentence so that the obfuscated sentence can be parsed instead of the original one. The model works at the word level, and learns to obfuscate each word separately by changing it into a new word that has a similar syntactic role. The text obfuscated by our model leads to better performance on three syntactic parsers (two dependency and one constituency parsers) in comparison to an upper-bound random substitution baseline. More specifically, the results demonstrate that as more terms are obfuscated (by their part of speech), the substitution upper bound significantly degrades, while the neural model maintains a relatively high performing parser. All of this is done without much sacrifice of privacy compared to the random substitution upper bound. We also further analyze the results, and discover that the substituted words have similar syntactic properties, but different semantic content, compared to the original words.obfuscation, relying on the properties of natural language. Specifically, a given English text is obfuscated using a neural model that aims to preserve the syntactic relationships of the original sentence so that the obfuscated sentence can be parsed instead of the original one. The model works at the word level, and learns to obfuscate each word separately by changing it into a new word that has a similar syntactic role. The text obfuscated by our model leads to better performance on three syntactic parsers (two dependency and one constituency parsers) in comparison to an upper-bound random substitution baseline. More specifically, the results demonstrate that as more terms are obfuscated (by their part of speech), the substitution upper bound significantly degrades, while the neural model maintains a relatively high performing parser. All of this is done without much sacrifice of privacy compared to the random substitution upper bound. We also further analyze the results, and discover that the substituted words have similar syntactic properties, but different semantic content, compared to the original words.

pdf bib
Tensors over Semirings for Latent-Variable Weighted Logic Programs
Esma Balkir | Daniel Gildea | Shay B. Cohen

Semiring parsing is an elegant framework for describing parsers by using semiring weighted logic programs. In this paper we present a generalization of this concept : latent-variable semiring parsing. With our framework, any semiring weighted logic program can be latentified by transforming weights from scalar values of a semiring to rank-n arrays, or tensors, of semiring values, allowing the modelling of latent-variable models within the semiring parsing framework. Semiring is too strong a notion when dealing with tensors, and we have to resort to a weaker structure : a partial semiring. We prove that this generalization preserves all the desired properties of the original semiring framework while strictly increasing its expressiveness.

pdf bib
Self-Training for Unsupervised Parsing with PRPNPRPN
Anhad Mohananey | Katharina Kann | Samuel R. Bowman

Neural unsupervised parsing (UP) models learn to parse without access to syntactic annotations, while being optimized for another task like language modeling. In this work, we propose self-training for neural UP models : we leverage aggregated annotations predicted by copies of our model as supervision for future copies. To be able to use our model’s predictions during training, we extend a recent neural UP architecture, the PRPN (Shen et al., 2018a), such that it can be trained in a semi-supervised fashion. We then add examples with parses predicted by our model to our unlabeled UP training data. Our self-trained model outperforms the PRPN by 8.1 % F1 and the previous state of the art by 1.6 % F1. In addition, we show that our architecture can also be helpful for semi-supervised parsing in ultra-low-resource settings.

pdf bib
Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal Dependency ParsingUniversal Dependency Parsing
Han He | Jinho D. Choi

This paper presents our enhanced dependency parsing approach using transformer encoders, coupled with a simple yet powerful ensemble algorithm that takes advantage of both tree and graph dependency parsing. Two types of transformer encoders are compared, a multilingual encoder and language-specific encoders. Our dependency tree parsing (DTP) approach generates only primary dependencies to form trees whereas our dependency graph parsing (DGP) approach handles both primary and secondary dependencies to form graphs. Since DGP does not guarantee the generated graphs are acyclic, the ensemble algorithm is designed to add secondary arcs predicted by DGP to primary arcs predicted by DTP. Our results show that models using the multilingual encoder outperform ones using the language specific encoders for most languages. The ensemble models generally show higher labeled attachment score on enhanced dependencies (ELAS) than the DTP and DGP models. As the result, our best models rank the third place on the macro-average ELAS over 17 languages.

pdf bib
Linear Neural Parsing and Hybrid Enhancement for Enhanced Universal DependenciesUniversal Dependencies
Giuseppe Attardi | Daniele Sartiano | Maria Simi

To accomplish the shared task on dependency parsing we explore the use of a linear transition-based neural dependency parser as well as a combination of three of them by means of a linear tree combination algorithm. We train separate models for each language on the shared task data. We compare our base parser with two biaffine parsers and also present an ensemble combination of all five parsers, which achieves an average UAS 1.88 point lower than the top official submission. For producing the enhanced dependencies, we exploit a hybrid approach, coupling an algorithmic graph transformation of the dependency tree with predictions made by a multitask machine learning model.

pdf bib
How Much of Enhanced UD Is Contained in UD?UD Is Contained in UD?
Adam Ek | Jean-Philippe Bernardy

In this paper, we present the submission of team CLASP to the IWPT 2020 Shared Task on parsing enhanced universal dependencies. We develop a tree-to-graph transformation algorithm based on dependency patterns. This algorithm can transform gold UD trees to EUD graphs with an ELAS score of 81.55 and a EULAS score of 96.70. These results show that much of the information needed to construct EUD graphs from UD trees are present in the UD trees. Coupled with a standard UD parser, the method applies to the official test data and yields and ELAS score of 67.85 and a EULAS score is 80.18.


pdf (full)
bib (full)
Proceedings of the 17th International Conference on Spoken Language Translation

pdf bib
Proceedings of the 17th International Conference on Spoken Language Translation
Marcello Federico | Alex Waibel | Kevin Knight | Satoshi Nakamura | Hermann Ney | Jan Niehues | Sebastian Stüker | Dekai Wu | Joseph Mariani | Francois Yvon

pdf bib
SRPOL’s System for the IWSLT 2020 End-to-End Speech Translation TaskSRPOL’s System for the IWSLT 2020 End-to-End Speech Translation Task
Tomasz Potapczyk | Pawel Przybysz

We took part in the offline End-to-End English to German TED lectures translation task. We based our solution on our last year’s submission. We used a slightly altered Transformer architecture with ResNet-like convolutional layer preparing the audio input to Transformer encoder. To improve the model’s quality of translation we introduced two regularization techniques and trained on machine translated Librispeech corpus in addition to iwslt-corpus, TEDLIUM2 andMust_C corpora. Our best model scored almost 3 BLEU higher than last year’s model. To segment 2020 test set we used exactly the same procedure as last year.

pdf bib
The University of Helsinki Submission to the IWSLT2020 Offline SpeechTranslation TaskUniversity of Helsinki Submission to the IWSLT2020 Offline SpeechTranslation Task
Raúl Vázquez | Mikko Aulamo | Umut Sulubacak | Jörg Tiedemann

This paper describes the University of Helsinki Language Technology group’s participation in the IWSLT 2020 offline speech translation task, addressing the translation of English audio into German text. In line with this year’s task objective, we train both cascade and end-to-end systems for spoken language translation. We opt for an end-to-end multitasking architecture with shared internal representations and a cascade approach that follows a standard procedure consisting of ASR, correction, and MT stages. We also describe the experiments that served as a basis for the submitted systems. Our experiments reveal that multitasking training with shared internal representations is not only possible but allows for knowledge-transfer across modalities.

pdf bib
The AFRL IWSLT 2020 Systems : Work-From-Home EditionAFRL IWSLT 2020 Systems: Work-From-Home Edition
Brian Ore | Eric Hansen | Tim Anderson | Jeremy Gwinnup

This report summarizes the Air Force Research Laboratory (AFRL) submission to the offline spoken language translation (SLT) task as part of the IWSLT 2020 evaluation campaign. As in previous years, we chose to adopt the cascade approach of using separate systems to perform speech activity detection, automatic speech recognition, sentence segmentation, and machine translation. All systems were neural based, including a fully-connected neural network for speech activity detection, a Kaldi factorized time delay neural network with recurrent neural network (RNN) language model rescoring for speech recognition, a bidirectional RNN with attention mechanism for sentence segmentation, and transformer networks trained with OpenNMT and Marian for machine translation. Our primary submission yielded BLEU scores of 21.28 on tst2019 and 23.33 on tst2020.

pdf bib
OPPO’s Machine Translation System for the IWSLT 2020 Open Domain Translation TaskOPPO’s Machine Translation System for the IWSLT 2020 Open Domain Translation Task
Qian Zhang | Xiaopu Li | Dawei Dang | Tingxun Shi | Di Ai | Zhengshan Xue | Jie Hao

In this paper, we demonstrate our machine translation system applied for the Chinese-Japanese bidirectional translation task (aka. open domain translation task) for the IWSLT 2020. Our model is based on Transformer (Vaswani et al., 2017), with the help of many popular, widely proved effective data preprocessing and augmentation methods. Experiments show that these methods can improve the baseline model steadily and significantly.

pdf bib
Character Mapping and Ad-hoc Adaptation : Edinburgh’s IWSLT 2020 Open Domain Translation SystemEdinburgh’s IWSLT 2020 Open Domain Translation System
Pinzhen Chen | Nikolay Bogoychev | Ulrich Germann

This paper describes the University of Edinburgh’s neural machine translation systems submitted to the IWSLT 2020 open domain JapaneseChinese translation task. On top of commonplace techniques like tokenisation and corpus cleaning, we explore character mapping and unsupervised decoding-time adaptation. Our techniques focus on leveraging the provided data, and we show the positive impact of each technique through the gradual improvement of BLEU.\\leftrightarrowChinese translation task. On top of commonplace techniques like tokenisation and corpus cleaning, we explore character mapping and unsupervised decoding-time adaptation. Our techniques focus on leveraging the provided data, and we show the positive impact of each technique through the gradual improvement of BLEU.

pdf bib
CASIA’s System for IWSLT 2020 Open Domain TranslationCASIA’s System for IWSLT 2020 Open Domain Translation
Qian Wang | Yuchen Liu | Cong Ma | Yu Lu | Yining Wang | Long Zhou | Yang Zhao | Jiajun Zhang | Chengqing Zong

This paper describes the CASIA’s system for the IWSLT 2020 open domain translation task. This year we participate in both ChineseJapanese and JapaneseChinese translation tasks. Our system is neural machine translation system based on Transformer model. We augment the training data with knowledge distillation and back translation to improve the translation performance. Domain data classification and weighted domain model ensemble are introduced to generate the final translation result. We compare and analyze the performance on development data with different model settings and different data processing techniques.

pdf bib
Deep Blue Sonics’ Submission to IWSLT 2020 Open Domain Translation TaskIWSLT 2020 Open Domain Translation Task
Enmin Su | Yi Ren

We present in this report our submission to IWSLT 2020 Open Domain Translation Task. We built a data pre-processing pipeline to efficiently handle large noisy web-crawled corpora, which boosts the BLEU score of a widely used transformer model in this translation task. To tackle the open-domain nature of this task, back- translation is applied to further improve the translation performance.

pdf bib
ISTIC’s Neural Machine Translation System for IWSLT’2020ISTIC’s Neural Machine Translation System for IWSLT’2020
Jiaze Wei | Wenbin Liu | Zhenfeng Wu | You Pan | Yanqing He

This paper introduces technical details of machine translation system of Institute of Scientific and Technical Information of China (ISTIC) for the 17th International Conference on Spoken Language Translation (IWSLT 2020). ISTIC participated in both translation tasks of the Open Domain Translation track : Japanese-to-Chinese MT task and Chinese-to-Japanese MT task. The paper mainly elaborates on the model framework, data preprocessing methods and decoding strategies adopted in our system. In addition, the system performance on the development set are given under different settings.

pdf bib
The HW-TSC Video Speech Translation System at IWSLT 2020HW-TSC Video Speech Translation System at IWSLT 2020
Minghan Wang | Hao Yang | Yao Deng | Ying Qin | Lizhi Lei | Daimeng Wei | Hengchao Shang | Ning Xie | Xiaochun Li | Jiaxian Guo

The paper presents details of our system in the IWSLT Video Speech Translation evaluation. The system works in a cascade form, which contains three modules : 1) A proprietary ASR system. 2) A disfluency correction system aims to remove interregnums or other disfluent expressions with a fine-tuned BERT and a series of rule-based algorithms. 3) An NMT System based on the Transformer and trained with massive publicly available corpus.

pdf bib