North American Chapter of the Association for Computational Linguistics (2021)


Contents

up

pdf (full)
bib (full)
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Kristina Toutanova | Anna Rumshisky | Luke Zettlemoyer | Dilek Hakkani-Tur | Iz Beltagy | Steven Bethard | Ryan Cotterell | Tanmoy Chakraborty | Yichao Zhou

pdf bib
Knowledge Router : Learning Disentangled Representations for Knowledge Graphs
Shuai Zhang | Xi Rao | Yi Tay | Ce Zhang

The design of expressive representations of entities and relations in a knowledge graph is an important endeavor. While many of the existing approaches have primarily focused on learning from relational patterns and structural information, the intrinsic complexity of KG entities has been more or less overlooked. More concretely, we hypothesize KG entities may be more complex than we think, i.e., an entity may wear many hats and relational triplets may form due to more than a single reason. To this end, this paper proposes to learn disentangled representations of KG entities-a new method that disentangles the inner latent properties of KG entities. Our disentangled process operates at the graph level and a neighborhood mechanism is leveraged to disentangle the hidden properties of each entity. This disentangled representation learning approach is model agnostic and compatible with canonical KG embedding approaches. We conduct extensive experiments on several benchmark datasets, equipping a variety of models (DistMult, SimplE, and QuatE) with our proposed disentangling mechanism. Experimental results demonstrate that our proposed approach substantially improves performance on key metrics.

pdf bib
Cross-Task Instance Representation Interactions and Label Dependencies for Joint Information Extraction with Graph Convolutional Networks
Minh Van Nguyen | Viet Lai | Thien Huu Nguyen

Existing works on information extraction (IE) have mainly solved the four main tasks separately (entity mention recognition, relation extraction, event trigger detection, and argument extraction), thus failing to benefit from inter-dependencies between tasks. This paper presents a novel deep learning model to simultaneously solve the four tasks of IE in a single model (called FourIE). Compared to few prior work on jointly performing four IE tasks, FourIE features two novel contributions to capture inter-dependencies between tasks. First, at the representation level, we introduce an interaction graph between instances of the four tasks that is used to enrich the prediction representation for one instance with those from related instances of other tasks. Second, at the label level, we propose a dependency graph for the information types in the four IE tasks that captures the connections between the types expressed in an input sentence. A new regularization mechanism is introduced to enforce the consistency between the golden and predicted type dependency graphs to improve representation learning. We show that the proposed model achieves the state-of-the-art performance for joint IE on both monolingual and multilingual learning settings with three different languages.

pdf bib
Multilingual Language Models Predict Human Reading Behavior
Nora Hollenstein | Federico Pirovano | Ce Zhang | Lena Jäger | Lisa Beinborn

We analyze if large language models are able to predict patterns of human reading behavior. We compare the performance of language-specific and multilingual pretrained transformer models to predict reading time measures reflecting natural human sentence processing on Dutch, English, German, and Russian texts. This results in accurate models of human reading behavior, which indicates that transformer models implicitly encode relative importance in language in a way that is comparable to human processing mechanisms. We find that BERT and XLM models successfully predict a range of eye tracking features. In a series of experiments, we analyze the cross-domain and cross-language abilities of these models and show how they reflect human sentence processing.

pdf bib
A Non-Linear Structural Probe
Jennifer C. White | Tiago Pimentel | Naomi Saphra | Ryan Cotterell

Probes are models devised to investigate the encoding of knowledgee.g. syntactic structurein contextual representations. Probes are often designed for simplicity, which has led to restrictions on probe design that may not allow for the full exploitation of the structure of encoded information ; one such restriction is linearity. We examine the case of a structural probe (Hewitt and Manning, 2019), which aims to investigate the encoding of syntactic structure in contextual representations through learning only linear transformations. By observing that the structural probe learns a metric, we are able to kernelize it and develop a novel non-linear variant with an identical number of parameters. We test on 6 languages and find that the radial-basis function (RBF) kernel, in conjunction with regularization, achieves a statistically significant improvement over the baseline in all languagesimplying that at least part of the syntactic knowledge is encoded non-linearly. We conclude by discussing how the RBF kernel resembles BERT’s self-attention layers and speculate that this resemblance leads to the RBF-based probe’s stronger performance.

pdf bib
Concealed Data Poisoning Attacks on NLP ModelsNLP Models
Eric Wallace | Tony Zhao | Shi Feng | Sameer Singh

Adversarial attacks alter NLP model predictions by perturbing test-time inputs. However, it is much less understood whether, and how, predictions can be manipulated with small, concealed changes to the training data. In this work, we develop a new data poisoning attack that allows an adversary to control model predictions whenever a desired trigger phrase is present in the input. For instance, we insert 50 poison examples into a sentiment model’s training set that causes the model to frequently predict Positive whenever the input contains James Bond. Crucially, we craft these poison examples using a gradient-based procedure so that they do not mention the trigger phrase. We also apply our poison attack to language modeling (Apple iPhone triggers negative generations) and machine translation (iced coffee mistranslated as hot coffee). We conclude by proposing three defenses that can mitigate our attack at some cost in prediction accuracy or extra human annotation.

pdf bib
Backtranslation Feedback Improves User Confidence in MT, Not QualityMT, Not Quality
Vilém Zouhar | Michal Novák | Matúš Žilinec | Ondřej Bojar | Mateo Obregón | Robin L. Hill | Frédéric Blain | Marina Fomicheva | Lucia Specia | Lisa Yankovskaya

Translating text into a language unknown to the text’s author, dubbed outbound translation, is a modern need for which the user experience has significant room for improvement, beyond the basic machine translation facility. We demonstrate this by showing three ways in which user confidence in the outbound translation, as well as its overall final quality, can be affected : backward translation, quality estimation (with alignment) and source paraphrasing. In this paper, we describe an experiment on outbound translation from English to Czech and Estonian. We examine the effects of each proposed feedback module and further focus on how the quality of machine translation systems influence these findings and the user perception of success. We show that backward translation feedback has a mixed effect on the whole process : it increases user confidence in the produced translation, but not the objective quality.

pdf bib
Neural Machine Translation without Embeddings
Uri Shaham | Omer Levy

Many NLP models operate over sequences of subword tokens produced by hand-crafted tokenization rules and heuristic subword induction algorithms. A simple universal alternative is to represent every computerized text as a sequence of bytes via UTF-8, obviating the need for an embedding layer since there are fewer token types (256) than dimensions. Surprisingly, replacing the ubiquitous embedding layer with one-hot representations of each byte does not hurt performance ; experiments on byte-to-byte machine translation from English to 10 different languages show a consistent improvement in BLEU, rivaling character-level and even standard subword-level models. A deeper investigation reveals that the combination of embeddingless models with decoder-input dropout amounts to token dropout, which benefits byte-to-byte models in particular.

pdf bib
Counterfactual Data Augmentation for Neural Machine Translation
Qi Liu | Matt Kusner | Phil Blunsom

We propose a data augmentation method for neural machine translation. It works by interpreting language models and phrasal alignment causally. Specifically, it creates augmented parallel translation corpora by generating (path-specific) counterfactual aligned phrases. We generate these by sampling new source phrases from a masked language model, then sampling an aligned counterfactual target phrase by noting that a translation language model can be interpreted as a Gumbel-Max Structural Causal Model (Oberst and Sontag, 2019). Compared to previous work, our method takes both context and alignment into account to maintain the symmetry between source and target sequences. Experiments on IWSLT’15 English Vietnamese, WMT’17 English German, WMT’18 English Turkish, and WMT’19 robust English French show that the method can improve the performance of translation, backtranslation and translation robustness.

pdf bib
DATE : Detecting Anomalies in Text via Self-Supervision of TransformersDATE: Detecting Anomalies in Text via Self-Supervision of Transformers
Andrei Manolache | Florin Brad | Elena Burceanu

Leveraging deep learning models for Anomaly Detection (AD) has seen widespread use in recent years due to superior performances over traditional methods. Recent deep methods for anomalies in images learn better features of normality in an end-to-end self-supervised setting. These methods train a model to discriminate between different transformations applied to visual data and then use the output to compute an anomaly score. We use this approach for AD in text, by introducing a novel pretext task on text sequences. We learn our DATE model end-to-end, enforcing two independent and complementary self-supervision signals, one at the token-level and one at the sequence-level. Under this new task formulation, we show strong quantitative and qualitative results on the 20Newsgroups and AG News datasets. In the semi-supervised setting, we outperform state-of-the-art results by +13.5 % and +6.9 %, respectively (AUROC). In the unsupervised configuration, DATE surpasses all other methods even when 10 % of its training data is contaminated with outliers (compared with 0 % for the others).

pdf bib
Fast and Scalable Dialogue State Tracking with Explicit Modular Decomposition
Dingmin Wang | Chenghua Lin | Qi Liu | Kam-Fai Wong

We present a fast and scalable architecture called Explicit Modular Decomposition (EMD), in which we incorporate both classification-based and extraction-based methods and design four modules (for clas- sification and sequence labelling) to jointly extract dialogue states. Experimental results based on the MultiWoz 2.0 dataset validates the superiority of our proposed model in terms of both complexity and scalability when compared to the state-of-the-art methods, especially in the scenario of multi-domain dialogues entangled with many turns of utterances.

pdf bib
Augmented SBERT : Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring TasksSBERT: Data Augmentation Method for Improving Bi-Encoders for Pairwise Sentence Scoring Tasks
Nandan Thakur | Nils Reimers | Johannes Daxenberger | Iryna Gurevych

There are two approaches for pairwise sentence scoring : Cross-encoders, which perform full-attention over the input pair, and Bi-encoders, which map each input independently to a dense vector space. While cross-encoders often achieve higher performance, they are too slow for many practical use cases. Bi-encoders, on the other hand, require substantial training data and fine-tuning over the target task to achieve competitive performance. We present a simple yet efficient data augmentation strategy called Augmented SBERT, where we use the cross-encoder to label a larger set of input pairs to augment the training data for the bi-encoder. We show that, in this process, selecting the sentence pairs is non-trivial and crucial for the success of the method. We evaluate our approach on multiple tasks (in-domain) as well as on a domain adaptation task. Augmented SBERT achieves an improvement of up to 6 points for in-domain and of up to 37 points for domain adaptation tasks compared to the original bi-encoder performance.

pdf bib
SGL : Speaking the Graph Languages of Semantic Parsing via Multilingual TranslationSGL: Speaking the Graph Languages of Semantic Parsing via Multilingual Translation
Luigi Procopio | Rocco Tripodi | Roberto Navigli

Graph-based semantic parsing aims to represent textual meaning through directed graphs. As one of the most promising general-purpose meaning representations, these structures and their parsing have gained a significant interest momentum during recent years, with several diverse formalisms being proposed. Yet, owing to this very heterogeneity, most of the research effort has focused mainly on solutions specific to a given formalism. In this work, instead, we reframe semantic parsing towards multiple formalisms as Multilingual Neural Machine Translation (MNMT), and propose SGL, a many-to-many seq2seq architecture trained with an MNMT objective. Backed by several experiments, we show that this framework is indeed effective once the learning procedure is enhanced with large parallel corpora coming from Machine Translation : we report competitive performances on AMR and UCCA parsing, especially once paired with pre-trained architectures. Furthermore, we find that models trained under this configuration scale remarkably well to tasks such as cross-lingual AMR parsing : SGL outperforms all its competitors by a large margin without even explicitly seeing non-English to AMR examples at training time and, once these examples are included as well, sets an unprecedented state of the art in this task. We release our code and our models for research purposes at https://github.com/SapienzaNLP/sgl.

pdf bib
Meta-Learning for Domain Generalization in Semantic Parsing
Bailin Wang | Mirella Lapata | Ivan Titov

The importance of building semantic parsers which can be applied to new domains and generate programs unseen at training has long been acknowledged, and datasets testing out-of-domain performance are becoming increasingly available. However, little or no attention has been devoted to learning algorithms or objectives which promote domain generalization, with virtually all existing approaches relying on standard supervised learning. In this work, we use a meta-learning framework which targets zero-shot domain generalization for semantic parsing. We apply a model-agnostic training algorithm that simulates zero-shot parsing by constructing virtual train and test sets from disjoint domains. The learning objective capitalizes on the intuition that gradient steps that improve source-domain performance should also improve target-domain performance, thus encouraging a parser to generalize to unseen target domains. Experimental results on the (English) Spider and Chinese Spider datasets show that the meta-learning objective significantly boosts the performance of a baseline parser.

pdf bib
APo-VAE : Text Generation in Hyperbolic SpaceAPo-VAE: Text Generation in Hyperbolic Space
Shuyang Dai | Zhe Gan | Yu Cheng | Chenyang Tao | Lawrence Carin | Jingjing Liu

Natural language often exhibits inherent hierarchical structure ingrained with complex syntax and semantics. However, most state-of-the-art deep generative models learn embeddings only in Euclidean vector space, without accounting for this structural property of language. In this paper, we investigate text generation in a hyperbolic latent space to learn continuous hierarchical representations. An Adversarial Poincare Variational Autoencoder (APo-VAE) is presented, where both the prior and variational posterior of latent variables are defined over a Poincare ball via wrapped normal distributions. By adopting the primal-dual formulation of Kullback-Leibler divergence, an adversarial learning procedure is introduced to empower robust model training. Extensive experiments in language modeling, unaligned style transfer, and dialog-response generation demonstrate the effectiveness of the proposed APo-VAE model over VAEs in Euclidean latent space, thanks to its superb capabilities in capturing latent language hierarchies in hyperbolic space.

pdf bib
DART : Open-Domain Structured Data Record to Text GenerationDART: Open-Domain Structured Data Record to Text Generation
Linyong Nan | Dragomir Radev | Rui Zhang | Amrit Rau | Abhinand Sivaprasad | Chiachun Hsieh | Xiangru Tang | Aadit Vyas | Neha Verma | Pranav Krishna | Yangxiaokang Liu | Nadia Irwanto | Jessica Pan | Faiaz Rahman | Ahmad Zaidi | Mutethia Mutuma | Yasin Tarabar | Ankit Gupta | Tao Yu | Yi Chern Tan | Xi Victoria Lin | Caiming Xiong | Richard Socher | Nazneen Fatema Rajani

We present DART, an open domain structured DAta Record to Text generation dataset with over 82k instances (DARTs). Data-to-text annotations can be a costly process, especially when dealing with tables which are the major source of structured data and contain nontrivial structures. To this end, we propose a procedure of extracting semantic triples from tables that encodes their structures by exploiting the semantic dependencies among table headers and the table title. Our dataset construction framework effectively merged heterogeneous sources from open domain semantic parsing and spoken dialogue systems by utilizing techniques including tree ontology annotation, question-answer pair to declarative sentence conversion, and predicate unification, all with minimum post-editing. We present systematic evaluation on DART as well as new state-of-the-art results on WebNLG 2017 to show that DART (1) poses new challenges to existing data-to-text datasets and (2) facilitates out-of-domain generalization. Our data and code can be found at https://github.com/Yale-LILY/dart.

pdf bib
Multi-Adversarial Learning for Cross-Lingual Word Embeddings
Haozhou Wang | James Henderson | Paola Merlo

Generative adversarial networks (GANs) have succeeded in inducing cross-lingual word embeddings-maps of matching words across languages-without supervision. Despite these successes, GANs’ performance for the difficult case of distant languages is still not satisfactory. These limitations have been explained by GANs’ incorrect assumption that source and target embedding spaces are related by a single linear mapping and are approximately isomorphic. We assume instead that, especially across distant languages, the mapping is only piece-wise linear, and propose a multi-adversarial learning method. This novel method induces the seed cross-lingual dictionary through multiple mappings, each induced to fit the mapping for one subspace. Our experiments on unsupervised bilingual lexicon induction and cross-lingual document classification show that this method improves performance over previous single-mapping methods, especially for distant languages.

pdf bib
Multi-view Subword Regularization
Xinyi Wang | Sebastian Ruder | Graham Neubig

Multilingual pretrained representations generally rely on subword segmentation algorithms to create a shared multilingual vocabulary. However, standard heuristic algorithms often lead to sub-optimal segmentation, especially for languages with limited amounts of data. In this paper, we take two major steps towards alleviating this problem. First, we demonstrate empirically that applying existing subword regularization methods (Kudo, 2018 ; Provilkov et al., 2020) during fine-tuning of pre-trained multilingual representations improves the effectiveness of cross-lingual transfer. Second, to take full advantage of different possible input segmentations, we propose Multi-view Subword Regularization (MVR), a method that enforces the consistency of predictors between using inputs tokenized by the standard and probabilistic segmentations. Results on the XTREME multilingual benchmark (Hu et al., 2020) show that MVR brings consistent improvements of up to 2.5 points over using standard segmentation algorithms.

pdf bib
MetaXL : Meta Representation Transformation for Low-resource Cross-lingual LearningMetaXL: Meta Representation Transformation for Low-resource Cross-lingual Learning
Mengzhou Xia | Guoqing Zheng | Subhabrata Mukherjee | Milad Shokouhi | Graham Neubig | Ahmed Hassan Awadallah

The combination of multilingual pre-trained representations and cross-lingual transfer learning is one of the most effective methods for building functional NLP systems for low-resource languages. However, for extremely low-resource languages without large-scale monolingual corpora for pre-training or sufficient annotated data for fine-tuning, transfer learning remains an understudied and challenging task. Moreover, recent work shows that multilingual representations are surprisingly disjoint across languages, bringing additional challenges for transfer onto extremely low-resource languages. In this paper, we propose MetaXL, a meta-learning based framework that learns to transform representations judiciously from auxiliary languages to a target one and brings their representation spaces closer for effective transfer. Extensive experiments on real-world low-resource languages without access to large-scale monolingual corpora or large amounts of labeled data for tasks like cross-lingual sentiment analysis and named entity recognition show the effectiveness of our approach. Code for MetaXL is publicly available at github.com/microsoft/MetaXL.

pdf bib
Open Domain Question Answering over Tables via Dense Retrieval
Jonathan Herzig | Thomas Müller | Syrine Krichene | Julian Eisenschlos

Recent advances in open-domain QA have led to strong models based on dense retrieval, but only focused on retrieving textual passages. In this work, we tackle open-domain QA over tables for the first time, and show that retrieval can be improved by a retriever designed to handle tabular context. We present an effective pre-training procedure for our retriever and improve retrieval quality with mined hard negatives. As relevant datasets are missing, we extract a subset of Natural Questions (Kwiatkowski et al., 2019) into a Table QA dataset. We find that our retriever improves retrieval results from 72.0 to 81.1 recall@10 and end-to-end QA results from 33.8 to 37.7 exact match, over a BERT based retriever.

pdf bib
Open-Domain Question Answering Goes Conversational via Question Rewriting
Raviteja Anantha | Svitlana Vakulenko | Zhucheng Tu | Shayne Longpre | Stephen Pulman | Srinivas Chappidi

We introduce a new dataset for Question Rewriting in Conversational Context (QReCC), which contains 14 K conversations with 80 K question-answer pairs. The task in QReCC is to find answers to conversational questions within a collection of 10 M web pages (split into 54 M passages). Answers to questions in the same conversation may be distributed across several web pages. QReCC provides annotations that allow us to train and evaluate individual subtasks of question rewriting, passage retrieval and reading comprehension required for the end-to-end conversational question answering (QA) task. We report the effectiveness of a strong baseline approach that combines the state-of-the-art model for question rewriting, and competitive models for open-domain QA. Our results set the first baseline for the QReCC dataset with F1 of 19.10, compared to the human upper bound of 75.45, indicating the difficulty of the setup and a large room for improvement.

pdf bib
XOR QA : Cross-lingual Open-Retrieval Question AnsweringXOR QA: Cross-lingual Open-Retrieval Question Answering
Akari Asai | Jungo Kasai | Jonathan Clark | Kenton Lee | Eunsol Choi | Hannaneh Hajishirzi

Multilingual question answering tasks typically assume that answers exist in the same language as the question. Yet in practice, many languages face both information scarcitywhere languages have few reference articlesand information asymmetrywhere questions reference concepts from other cultures. This work extends open-retrieval question answering to a cross-lingual setting enabling questions from one language to be answered via answer content from another language. We construct a large-scale dataset built on 40 K information-seeking questions across 7 diverse non-English languages that TyDi QA could not find same-language answers for. Based on this dataset, we introduce a task framework, called Cross-lingual Open-Retrieval Question Answering (XOR QA), that consists of three new tasks involving cross-lingual document retrieval from multilingual and English resources. We establish baselines with state-of-the-art machine translation systems and cross-lingual pretrained models. Experimental results suggest that XOR QA is a challenging task that will facilitate the development of novel techniques for multilingual question answering. Our data and code are available at https://nlp.cs.washington.edu/xorqa/.

pdf bib
On learning and representing social meaning in NLP : a sociolinguistic perspectiveNLP: a sociolinguistic perspective
Dong Nguyen | Laura Rosseel | Jack Grieve

The field of NLP has made substantial progress in building meaning representations. However, an important aspect of linguistic meaning, social meaning, has been largely overlooked. We introduce the concept of social meaning to NLP and discuss how insights from sociolinguistics can inform work on representation learning in NLP. We also identify key challenges for this new line of research.

pdf bib
Representing Numbers in NLP : a Survey and a VisionNLP: a Survey and a Vision
Avijit Thawani | Jay Pujara | Filip Ilievski | Pedro Szekely

NLP systems rarely give special consideration to numbers found in text. This starkly contrasts with the consensus in neuroscience that, in the brain, numbers are represented differently from words. We arrange recent NLP work on numeracy into a comprehensive taxonomy of tasks and methods. We break down the subjective notion of numeracy into 7 subtasks, arranged along two dimensions : granularity (exact vs approximate) and units (abstract vs grounded). We analyze the myriad representational choices made by over a dozen previously published number encoders and decoders. We synthesize best practices for representing numbers in text and articulate a vision for holistic numeracy in NLP, comprised of design trade-offs and a unified evaluation.

pdf bib
Identifying Helpful Sentences in Product Reviews
Iftah Gamzu | Hila Gonen | Gilad Kutiel | Ran Levy | Eugene Agichtein

In recent years online shopping has gained momentum and became an important venue for customers wishing to save time and simplify their shopping process. A key advantage of shopping online is the ability to read what other customers are saying about products of interest. In this work, we aim to maintain this advantage in situations where extreme brevity is needed, for example, when shopping by voice. We suggest a novel task of extracting a single representative helpful sentence from a set of reviews for a given product. The selected sentence should meet two conditions : first, it should be helpful for a purchase decision and second, the opinion it expresses should be supported by multiple reviewers. This task is closely related to the task of Multi Document Summarization in the product reviews domain but differs in its objective and its level of conciseness. We collect a dataset in English of sentence helpfulness scores via crowd-sourcing and demonstrate its reliability despite the inherent subjectivity involved. Next, we describe a complete model that extracts representative helpful sentences with positive and negative sentiment towards the product and demonstrate that it outperforms several baselines.

pdf bib
Noisy Self-Knowledge Distillation for Text Summarization
Yang Liu | Sheng Shen | Mirella Lapata

In this paper we apply self-knowledge distillation to text summarization which we argue can alleviate problems with maximum-likelihood training on single reference and noisy datasets. Instead of relying on one-hot annotation labels, our student summarization model is trained with guidance from a teacher which generates smoothed labels to help regularize training. Furthermore, to better model uncertainty during training, we introduce multiple noise signals for both teacher and student models. We demonstrate experimentally on three benchmarks that our framework boosts the performance of both pretrained and non-pretrained summarizers achieving state-of-the-art results.

pdf bib
Improving Zero and Few-Shot Abstractive Summarization with Intermediate Fine-tuning and Data Augmentation
Alexander Fabbri | Simeng Han | Haoyuan Li | Haoran Li | Marjan Ghazvininejad | Shafiq Joty | Dragomir Radev | Yashar Mehdad

Models pretrained with self-supervised objectives on large text corpora achieve state-of-the-art performance on English text summarization tasks. However, these models are typically fine-tuned on hundreds of thousands of data points, an infeasible requirement when applying summarization to new, niche domains. In this work, we introduce a novel and generalizable method, called WikiTransfer, for fine-tuning pretrained models for summarization in an unsupervised, dataset-specific manner. WikiTransfer fine-tunes pretrained models on pseudo-summaries, produced from generic Wikipedia data, which contain characteristics of the target dataset, such as the length and level of abstraction of the desired summaries. WikiTransfer models achieve state-of-the-art, zero-shot abstractive summarization performance on the CNN-DailyMail dataset and demonstrate the effectiveness of our approach on three additional diverse datasets. These models are more robust to noisy data and also achieve better or comparable few-shot performance using 10 and 100 training examples when compared to few-shot transfer from other summarization datasets. To further boost performance, we employ data augmentation via round-trip translation as well as introduce a regularization term for improved few-shot transfer. To understand the role of dataset aspects in transfer performance and the quality of the resulting output summaries, we further study the effect of the components of our unsupervised fine-tuning data and analyze few-shot performance using both automatic and human evaluation.

pdf bib
Nice Try, Kiddo : Investigating Ad Hominems in Dialogue Responses
Emily Sheng | Kai-Wei Chang | Prem Natarajan | Nanyun Peng

Ad hominem attacks are those that target some feature of a person’s character instead of the position the person is maintaining. These attacks are harmful because they propagate implicit biases and diminish a person’s credibility. Since dialogue systems respond directly to user input, it is important to study ad hominems in dialogue responses. To this end, we propose categories of ad hominems, compose an annotated dataset, and build a classifier to analyze human and dialogue system responses to English Twitter posts. We specifically compare responses to Twitter topics about marginalized communities (# BlackLivesMatter, # MeToo) versus other topics (# Vegan, # WFH), because the abusive language of ad hominems could further amplify the skew of power away from marginalized populations. Furthermore, we propose a constrained decoding technique that uses salient n-gram similarity as a soft constraint for top-k sampling to reduce the amount of ad hominems generated. Our results indicate that 1) responses from both humans and DialoGPT contain more ad hominems for discussions around marginalized communities, 2) different quantities of ad hominems in the training data can influence the likelihood of generating ad hominems, and 3) we can use constrained decoding techniques to reduce ad hominems in generated dialogue responses.

pdf bib
Spoken Language Understanding for Task-oriented Dialogue Systems with Augmented Memory Networks
Jie Wu | Ian Harris | Hongzhi Zhao

Spoken language understanding, usually including intent detection and slot filling, is a core component to build a spoken dialog system. Recent research shows promising results by jointly learning of those two tasks based on the fact that slot filling and intent detection are sharing semantic knowledge. Furthermore, attention mechanism boosts joint learning to achieve state-of-the-art results. However, current joint learning models ignore the following important facts : 1. Long-term slot context is not traced effectively, which is crucial for future slot filling. Slot tagging and intent detection could be mutually rewarding, but bi-directional interaction between slot filling and intent detection remains seldom explored. In this paper, we propose a novel approach to model long-term slot context and to fully utilize the semantic correlation between slots and intents. We adopt a key-value memory network to model slot context dynamically and to track more important slot tags decoded before, which are then fed into our decoder for slot tagging. Furthermore, gated memory information is utilized to perform intent detection, mutually improving both tasks through global optimization. Experiments on benchmark ATIS and Snips datasets show that our model achieves state-of-the-art performance and outperforms other methods, especially for the slot filling task.

pdf bib
Self-Training with Weak Supervision
Giannis Karamanolakis | Subhabrata Mukherjee | Guoqing Zheng | Ahmed Hassan Awadallah

State-of-the-art deep neural networks require large-scale labeled training data that is often expensive to obtain or not available for many tasks. Weak supervision in the form of domain-specific rules has been shown to be useful in such settings to automatically generate weakly labeled training data. However, learning with weak rules is challenging due to their inherent heuristic and noisy nature. An additional challenge is rule coverage and overlap, where prior work on weak supervision only considers instances that are covered by weak rules, thus leaving valuable unlabeled data behind. In this work, we develop a weak supervision framework (ASTRA) that leverages all the available data for a given task. To this end, we leverage task-specific unlabeled data through self-training with a model (student) that considers contextualized representations and predicts pseudo-labels for instances that may not be covered by weak rules. We further develop a rule attention network (teacher) that learns how to aggregate student pseudo-labels with weak rule labels, conditioned on their fidelity and the underlying context of an instance. Finally, we construct a semi-supervised learning objective for end-to-end training with unlabeled data, domain-specific rules, and a small amount of labeled data. Extensive experiments on six benchmark datasets for text classification demonstrate the effectiveness of our approach with significant improvements over state-of-the-art baselines.

pdf bib
Probabilistic Box Embeddings for Uncertain Knowledge Graph Reasoning
Xuelu Chen | Michael Boratko | Muhao Chen | Shib Sankar Dasgupta | Xiang Lorraine Li | Andrew McCallum

Knowledge bases often consist of facts which are harvested from a variety of sources, many of which are noisy and some of which conflict, resulting in a level of uncertainty for each triple. Knowledge bases are also often incomplete, prompting the use of embedding methods to generalize from known facts, however, existing embedding methods only model triple-level uncertainty, and reasoning results lack global consistency. To address these shortcomings, we propose BEUrRE, a novel uncertain knowledge graph embedding method with calibrated probabilistic semantics. BEUrRE models each entity as a box (i.e. axis-aligned hyperrectangle) and relations between two entities as affine transforms on the head and tail entity boxes. The geometry of the boxes allows for efficient calculation of intersections and volumes, endowing the model with calibrated probabilistic semantics and facilitating the incorporation of relational constraints. Extensive experiments on two benchmark datasets show that BEUrRE consistently outperforms baselines on confidence prediction and fact ranking due to its probabilistic calibration and ability to capture high-order dependencies among facts.

pdf bib
Document-Level Event Argument Extraction by Conditional Generation
Sha Li | Heng Ji | Jiawei Han

Event extraction has long been treated as a sentence-level task in the IE community. We argue that this setting does not match human informative seeking behavior and leads to incomplete and uninformative extraction results. We propose a document-level neural event argument extraction model by formulating the task as conditional generation following event templates. We also compile a new document-level event extraction benchmark dataset WikiEvents which includes complete event and coreference annotation. On the task of argument extraction, we achieve an absolute gain of 7.6 % F1 and 5.7 % F1 over the next best model on the RAMS and WikiEvents dataset respectively. On the more challenging task of informative argument extraction, which requires implicit coreference reasoning, we achieve a 9.3 % F1 gain over the best baseline. To demonstrate the portability of our model, we also create the first end-to-end zero-shot event extraction framework and achieve 97 % of fully supervised model’s trigger extraction performance and 82 % of the argument extraction performance given only access to 10 out of the 33 types on ACE.

pdf bib
Template Filling with Generative Transformers
Xinya Du | Alexander Rush | Claire Cardie

Template filling is generally tackled by a pipeline of two separate supervised systems one for role-filler extraction and another for template / event recognition. Since pipelines consider events in isolation, they can suffer from error propagation. We introduce a framework based on end-to-end generative transformers for this task (i.e., GTT). It naturally models the dependence between entities both within a single event and across the multiple events described in a document. Experiments demonstrate that this framework substantially outperforms pipeline-based approaches, and other neural end-to-end baselines that do not model between-event dependencies. We further show that our framework specifically improves performance on documents containing multiple events.

pdf bib
On Attention Redundancy : A Comprehensive Study
Yuchen Bian | Jiaji Huang | Xingyu Cai | Jiahong Yuan | Kenneth Church

Multi-layer multi-head self-attention mechanism is widely applied in modern neural language models. Attention redundancy has been observed among attention heads but has not been deeply studied in the literature. Using BERT-base model as an example, this paper provides a comprehensive study on attention redundancy which is helpful for model interpretation and model compression. We analyze the attention redundancy with Five-Ws and How. (What) We define and focus the study on redundancy matrices generated from pre-trained and fine-tuned BERT-base model for GLUE datasets. (How) We use both token-based and sentence-based distance functions to measure the redundancy. (Where) Clear and similar redundancy patterns (cluster structure) are observed among attention heads. (When) Redundancy patterns are similar in both pre-training and fine-tuning phases. (Who) We discover that redundancy patterns are task-agnostic. Similar redundancy patterns even exist for randomly generated token sequences. (Why) We also evaluate influences of the pre-training dropout ratios on attention redundancy. Based on the phase-independent and task-agnostic attention redundancy patterns, we propose a simple zero-shot pruning method as a case study. Experiments on fine-tuning GLUE tasks verify its effectiveness. The comprehensive analyses on attention redundancy make model understanding and zero-shot model pruning promising.

pdf bib
Does BERT Pretrained on Clinical Notes Reveal Sensitive Data?BERT Pretrained on Clinical Notes Reveal Sensitive Data?
Eric Lehman | Sarthak Jain | Karl Pichotta | Yoav Goldberg | Byron Wallace

Large Transformers pretrained over clinical notes from Electronic Health Records (EHR) have afforded substantial gains in performance on predictive clinical tasks. The cost of training such models (and the necessity of data access to do so) coupled with their utility motivates parameter sharing, i.e., the release of pretrained models such as ClinicalBERT. While most efforts have used deidentified EHR, many researchers have access to large sets of sensitive, non-deidentified EHR with which they might train a BERT model (or similar). Would it be safe to release the weights of such a model if they did? In this work, we design a battery of approaches intended to recover Personal Health Information (PHI) from a trained BERT. Specifically, we attempt to recover patient names and conditions with which they are associated. We find that simple probing methods are not able to meaningfully extract sensitive information from BERT trained over the MIMIC-III corpus of EHR. However, more sophisticated attacks may succeed in doing so : To facilitate such research, we make our experimental setup and baseline probing models available at https://github.com/elehman16/exposing_patient_data_release.

pdf bib
Low-Complexity Probing via Finding Subnetworks
Steven Cao | Victor Sanh | Alexander Rush

The dominant approach in probing neural networks for linguistic properties is to train a new shallow multi-layer perceptron (MLP) on top of the model’s internal representations. This approach can detect properties encoded in the model, but at the cost of adding new parameters that may learn the task directly. We instead propose a subtractive pruning-based probe, where we find an existing subnetwork that performs the linguistic task of interest. Compared to an MLP, the subnetwork probe achieves both higher accuracy on pre-trained models and lower accuracy on random models, so it is both better at finding properties of interest and worse at learning on its own. Next, by varying the complexity of each probe, we show that subnetwork probing Pareto-dominates MLP probing in that it achieves higher accuracy given any budget of probe complexity. Finally, we analyze the resulting subnetworks across various tasks to locate where each task is encoded, and we find that lower-level tasks are captured in lower layers, reproducing similar findings in past work.

pdf bib
An Empirical Comparison of Instance Attribution Methods for NLPNLP
Pouya Pezeshkpour | Sarthak Jain | Byron Wallace | Sameer Singh

Widespread adoption of deep models has motivated a pressing need for approaches to interpret network outputs and to facilitate model debugging. Instance attribution methods constitute one means of accomplishing these goals by retrieving training instances that (may have) led to a particular prediction. Influence functions (IF ; Koh and Liang 2017) provide machinery for doing this by quantifying the effect that perturbing individual train instances would have on a specific test prediction. However, even approximating the IF is computationally expensive, to the degree that may be prohibitive in many cases. Might simpler approaches (e.g., retrieving train examples most similar to a given test point) perform comparably? In this work, we evaluate the degree to which different potential instance attribution agree with respect to the importance of training samples. We find that simple retrieval methods yield training instances that differ from those identified via gradient-based methods (such as IFs), but that nonetheless exhibit desirable characteristics similar to more complex attribution methods. Code for all methods and experiments in this paper is available at : https://github.com/successar/instance_attributions_NLP.

pdf bib
Generalization in Instruction Following Systems
Soham Dan | Michael Zhou | Dan Roth

Understanding and executing natural language instructions in a grounded domain is one of the hallmarks of artificial intelligence. In this paper, we focus on instruction understanding in the blocks world domain and investigate the language understanding abilities of two top-performing systems for the task. We aim to understand if the test performance of these models indicates an understanding of the spatial domain and of the natural language instructions relative to it, or whether they merely over-fit spurious signals in the dataset. We formulate a set of expectations one might have from an instruction following model and concretely characterize the different dimensions of robustness such a model should possess. Despite decent test performance, we find that state-of-the-art models fall short of these expectations and are extremely brittle. We then propose a learning strategy that involves data augmentation and show through extensive experiments that the proposed learning strategy yields models that are competitive on the original test set while satisfying our expectations much better.

pdf bib
MTAG : Modal-Temporal Attention Graph for Unaligned Human Multimodal Language SequencesMTAG: Modal-Temporal Attention Graph for Unaligned Human Multimodal Language Sequences
Jianing Yang | Yongxin Wang | Ruitao Yi | Yuying Zhu | Azaan Rehman | Amir Zadeh | Soujanya Poria | Louis-Philippe Morency

Human communication is multimodal in nature ; it is through multiple modalities such as language, voice, and facial expressions, that opinions and emotions are expressed. Data in this domain exhibits complex multi-relational and temporal interactions. Learning from this data is a fundamentally challenging research problem. In this paper, we propose Modal-Temporal Attention Graph (MTAG). MTAG is an interpretable graph-based neural model that provides a suitable framework for analyzing multimodal sequential data. We first introduce a procedure to convert unaligned multimodal sequence data into a graph with heterogeneous nodes and edges that captures the rich interactions across modalities and through time. Then, a novel graph fusion operation, called MTAG fusion, along with a dynamic pruning and read-out technique, is designed to efficiently process this modal-temporal graph and capture various interactions. By learning to focus only on the important interactions within the graph, MTAG achieves state-of-the-art performance on multimodal sentiment analysis and emotion recognition benchmarks, while utilizing significantly fewer model parameters.

pdf bib
Grounding Open-Domain Instructions to Automate Web Support Tasks
Nancy Xu | Sam Masling | Michael Du | Giovanni Campagna | Larry Heck | James Landay | Monica Lam

Grounding natural language instructions on the web to perform previously unseen tasks enables accessibility and automation. We introduce a task and dataset to train AI agents from open-domain, step-by-step instructions originally written for people. We build RUSS (Rapid Universal Support Service) to tackle this problem. RUSS consists of two models : First, a BERT-LSTM with pointers parses instructions to WebLang, a domain-specific language we design for grounding natural language on the web. Then, a grounding model retrieves the unique IDs of any webpage elements requested in the WebLang. RUSS may interact with the user through a dialogue (e.g. ask for an address) or execute a web operation (e.g. click a button) inside the web runtime. To augment training, we synthesize natural language instructions mapped to WebLang. Our dataset consists of 80 different customer service problems from help websites, with a total of 741 step-by-step instructions and their corresponding actions. RUSS achieves 76.7 % end-to-end accuracy predicting agent actions from single instructions. It outperforms state-of-the-art models that directly map instructions to actions without WebLang. Our user study shows that RUSS is preferred by actual users over web navigation.

pdf bib
Improving Cross-Modal Alignment in Vision Language Navigation via Syntactic Information
Jialu Li | Hao Tan | Mohit Bansal

Vision language navigation is the task that requires an agent to navigate through a 3D environment based on natural language instructions. One key challenge in this task is to ground instructions with the current visual information that the agent perceives. Most of the existing work employs soft attention over individual words to locate the instruction required for the next action. However, different words have different functions in a sentence (e.g., modifiers convey attributes, verbs convey actions). Syntax information like dependencies and phrase structures can aid the agent to locate important parts of the instruction. Hence, in this paper, we propose a navigation agent that utilizes syntax information derived from a dependency tree to enhance alignment between the instruction and the current visual scenes. Empirically, our agent outperforms the baseline model that does not use syntax information on the Room-to-Room dataset, especially in the unseen environment. Besides, our agent achieves the new state-of-the-art on Room-Across-Room dataset, which contains instructions in 3 languages (English, Hindi, and Telugu). We also show that our agent is better at aligning instructions with the current visual information via qualitative visualizations.

pdf bib
Understanding Hard Negatives in Noise Contrastive Estimation
Wenzheng Zhang | Karl Stratos

The choice of negative examples is important in noise contrastive estimation. Recent works find that hard negativeshighest-scoring incorrect examples under the modelare effective in practice, but they are used without a formal justification. We develop analytical tools to understand the role of hard negatives. Specifically, we view the contrastive loss as a biased estimator of the gradient of the cross-entropy loss, and show both theoretically and empirically that setting the negative distribution to be the model distribution results in bias reduction. We also derive a general form of the score function that unifies various architectures used in text retrieval. By combining hard negatives with appropriate score functions, we obtain strong results on the challenging task of zero-shot entity linking.

pdf bib
DReCa : A General Task Augmentation Strategy for Few-Shot Natural Language InferenceDReCa: A General Task Augmentation Strategy for Few-Shot Natural Language Inference
Shikhar Murty | Tatsunori B. Hashimoto | Christopher Manning

Meta-learning promises few-shot learners that can adapt to new distributions by repurposing knowledge acquired from previous training. However, we believe meta-learning has not yet succeeded in NLP due to the lack of a well-defined task distribution, leading to attempts that treat datasets as tasks. Such an ad hoc task distribution causes problems of quantity and quality. Since there’s only a handful of datasets for any NLP problem, meta-learners tend to overfit their adaptation mechanism and, since NLP datasets are highly heterogeneous, many learning episodes have poor transfer between their support and query sets, which discourages the meta-learner from adapting. To alleviate these issues, we propose DReCA (Decomposing datasets into Reasoning Categories), a simple method for discovering and using latent reasoning categories in a dataset, to form additional high quality tasks. DReCA works by splitting examples into label groups, embedding them with a finetuned BERT model and then clustering each group into reasoning categories. Across four few-shot NLI problems, we demonstrate that using DReCA improves the accuracy of meta-learners by 1.5-4 %

pdf bib
Harnessing Multilinguality in Unsupervised Machine Translation for Rare Languages
Xavier Garcia | Aditya Siddhant | Orhan Firat | Ankur Parikh

Unsupervised translation has reached impressive performance on resource-rich language pairs such as English-French and English-German. However, early studies have shown that in more realistic settings involving low-resource, rare languages, unsupervised translation performs poorly, achieving less than 3.0 BLEU. In this work, we show that multilinguality is critical to making unsupervised systems practical for low-resource settings. In particular, we present a single model for 5 low-resource languages (Gujarati, Kazakh, Nepali, Sinhala, and Turkish) to and from English directions, which leverages monolingual and auxiliary parallel data from other high-resource language pairs via a three-stage training scheme. We outperform all current state-of-the-art unsupervised baselines for these languages, achieving gains of up to 14.4 BLEU. Additionally, we outperform strong supervised baselines for various language pairs as well as match the performance of the current state-of-the-art supervised model for Nepali-English. We conduct a series of ablation studies to establish the robustness of our model under different degrees of data quality, as well as to analyze the factors which led to the superior performance of the proposed approach over traditional unsupervised models.

pdf bib
Assessing Reference-Free Peer Evaluation for Machine Translation
Sweta Agrawal | George Foster | Markus Freitag | Colin Cherry

Reference-free evaluation has the potential to make machine translation evaluation substantially more scalable, allowing us to pivot easily to new languages or domains. It has been recently shown that the probabilities given by a large, multilingual model can achieve state of the art results when used as a reference-free metric. We experiment with various modifications to this model, and demonstrate that by scaling it up we can match the performance of BLEU. We analyze various potential weaknesses of the approach, and find that it is surprisingly robust and likely to offer reasonable performance across a broad spectrum of domains and different system qualities.

pdf bib
The Curious Case of Hallucinations in Neural Machine Translation
Vikas Raunak | Arul Menezes | Marcin Junczys-Dowmunt

In this work, we study hallucinations in Neural Machine Translation (NMT), which lie at an extreme end on the spectrum of NMT pathologies. Firstly, we connect the phenomenon of hallucinations under source perturbation to the Long-Tail theory of Feldman, and present an empirically validated hypothesis that explains hallucinations under source perturbation. Secondly, we consider hallucinations under corpus-level noise (without any source perturbation) and demonstrate that two prominent types of natural hallucinations (detached and oscillatory outputs) could be generated and explained through specific corpus-level noise patterns. Finally, we elucidate the phenomenon of hallucination amplification in popular data-generation processes such as Backtranslation and sequence-level Knowledge Distillation. We have released the datasets and code to replicate our results.

pdf bib
Towards Modeling the Style of Translators in Neural Machine Translation
Yue Wang | Cuong Hoang | Marcello Federico

One key ingredient of neural machine translation is the use of large datasets from different domains and resources (e.g. Europarl, TED talks). These datasets contain documents translated by professional translators using different but consistent translation styles. Despite that, the model is usually trained in a way that neither explicitly captures the variety of translation styles present in the data nor translates new data in different and controllable styles. In this work, we investigate methods to augment the state of the art Transformer model with translator information that is available in part of the training data. We show that our style-augmented translation models are able to capture the style variations of translators and to generate translations with different styles on new data. Indeed, the generated variations differ significantly, up to +4.5 BLEU score difference. Despite that, human evaluation confirms that the translations are of the same quality.

pdf bib
Self-Supervised Test-Time Learning for Reading Comprehension
Pratyay Banerjee | Tejas Gokhale | Chitta Baral

Recent work on unsupervised question answering has shown that models can be trained with procedurally generated question-answer pairs and can achieve performance competitive with supervised methods. In this work, we consider the task of unsupervised reading comprehension and present a method that performs test-time learning (TTL) on a given context (text passage), without requiring training on large-scale human-authored datasets containing context-question-answer triplets. This method operates directly on a single test context, uses self-supervision to train models on synthetically generated question-answer pairs, and then infers answers to unseen human-authored questions for this context. Our method achieves accuracies competitive with fully supervised methods and significantly outperforms current unsupervised methods. TTL methods with a smaller model are also competitive with the current state-of-the-art in unsupervised reading comprehension.context-question-answer triplets. This method operates directly on a single test context, uses self-supervision to train models on synthetically generated question-answer pairs, and then infers answers to unseen human-authored questions for this context. Our method achieves accuracies competitive with fully supervised methods and significantly outperforms current unsupervised methods. TTL methods with a smaller model are also competitive with the current state-of-the-art in unsupervised reading comprehension.

pdf bib
Capturing Row and Column Semantics in Transformer Based Question Answering over Tables
Michael Glass | Mustafa Canim | Alfio Gliozzo | Saneem Chemmengath | Vishwajeet Kumar | Rishav Chakravarti | Avi Sil | Feifei Pan | Samarth Bharadwaj | Nicolas Rodolfo Fauceglia

Transformer based architectures are recently used for the task of answering questions over tables. In order to improve the accuracy on this task, specialized pre-training techniques have been developed and applied on millions of open-domain web tables. In this paper, we propose two novel approaches demonstrating that one can achieve superior performance on table QA task without even using any of these specialized pre-training techniques. The first model, called RCI interaction, leverages a transformer based architecture that independently classifies rows and columns to identify relevant cells. While this model yields extremely high accuracy at finding cell values on recent benchmarks, a second model we propose, called RCI representation, provides a significant efficiency advantage for online QA systems over tables by materializing embeddings for existing tables. Experiments on recent benchmarks prove that the proposed methods can effectively locate cell values on tables (up to ~98 % Hit@1 accuracy on WikiSQL lookup questions). Also, the interaction model outperforms the state-of-the-art transformer based approaches, pre-trained on very large table corpora (TAPAS and TaBERT), achieving ~3.4 % and ~18.86 % additional precision improvement on the standard WikiSQL benchmark.

pdf bib
Robust Question Answering Through Sub-part Alignment
Jifan Chen | Greg Durrett

Current textual question answering (QA) models achieve strong performance on in-domain test sets, but often do so by fitting surface-level patterns, so they fail to generalize to out-of-distribution settings. To make a more robust and understandable QA system, we model question answering as an alignment problem. We decompose both the question and context into smaller units based on off-the-shelf semantic representations (here, semantic roles), and align the question to a subgraph of the context in order to find the answer. We formulate our model as a structured SVM, with alignment scores computed via BERT, and we can train end-to-end despite using beam search for approximate inference. Our use of explicit alignments allows us to explore a set of constraints with which we can prohibit certain types of bad model behavior arising in cross-domain settings. Furthermore, by investigating differences in scores across different potential answers, we can seek to understand what particular aspects of the input lead the model to choose the answer without relying on post-hoc explanation techniques. We train our model on SQuAD v1.1 and test it on several adversarial and out-of-domain datasets. The results show that our model is more robust than the standard BERT QA model, and constraints derived from alignment scores allow us to effectively trade off coverage and accuracy.

pdf bib
RECONSIDER : Improved Re-Ranking using Span-Focused Cross-Attention for Open Domain Question AnsweringRECONSIDER: Improved Re-Ranking using Span-Focused Cross-Attention for Open Domain Question Answering
Srinivasan Iyer | Sewon Min | Yashar Mehdad | Wen-tau Yih

State-of-the-art Machine Reading Comprehension (MRC) models for Open-domain Question Answering (QA) are typically trained for span selection using distantly supervised positive examples and heuristically retrieved negative examples. This training scheme possibly explains empirical observations that these models achieve a high recall amongst their top few predictions, but a low overall accuracy, motivating the need for answer re-ranking. We develop a successful re-ranking approach (RECONSIDER) for span-extraction tasks that improves upon the performance of MRC models, even beyond large-scale pre-training. RECONSIDER is trained on positive and negative examples extracted from high confidence MRC model predictions, and uses in-passage span annotations to perform span-focused re-ranking over a smaller candidate set. As a result, RECONSIDER learns to eliminate close false positives, achieving a new extractive state of the art on four QA tasks, with 45.5 % Exact Match accuracy on Natural Questions with real user questions, and 61.7 % on TriviaQA. We will release all related data, models, and code.

pdf bib
Looking Beyond Sentence-Level Natural Language Inference for Question Answering and Text Summarization
Anshuman Mishra | Dhruvesh Patel | Aparna Vijayakumar | Xiang Lorraine Li | Pavan Kapanipathi | Kartik Talamadupula

Natural Language Inference (NLI) has garnered significant attention in recent years ; however, the promise of applying NLI breakthroughs to other downstream NLP tasks has remained unfulfilled. In this work, we use the multiple-choice reading comprehension (MCRC) and checking factual correctness of textual summarization (CFCS) tasks to investigate potential reasons for this. Our findings show that : (1) the relatively shorter length of premises in traditional NLI datasets is the primary challenge prohibiting usage in downstream applications (which do better with longer contexts) ; (2) this challenge can be addressed by automatically converting resource-rich reading comprehension datasets into longer-premise NLI datasets ; and (3) models trained on the converted, longer-premise datasets outperform those trained using short-premise traditional NLI datasets on downstream tasks primarily due to the difference in premise lengths.

pdf bib
A New Approach to Overgenerating and Scoring Abstractive Summaries
Kaiqiang Song | Bingqing Wang | Zhe Feng | Fei Liu

We propose a new approach to generate multiple variants of the target summary with diverse content and varying lengths, then score and select admissible ones according to users’ needs. Abstractive summarizers trained on single reference summaries may struggle to produce outputs that achieve multiple desirable properties, i.e., capturing the most important information, being faithful to the original, grammatical and fluent. In this paper, we propose a two-staged strategy to generate a diverse set of candidate summaries from the source text in stage one, then score and select admissible ones in stage two. Importantly, our generator gives a precise control over the length of the summary, which is especially well-suited when space is limited. Our selectors are designed to predict the optimal summary length and put special emphasis on faithfulness to the original text. Both stages can be effectively trained, optimized and evaluated. Our experiments on benchmark summarization datasets suggest that this paradigm can achieve state-of-the-art performance.

pdf bib
D2S : Document-to-Slide Generation Via Query-Based Text SummarizationD2S: Document-to-Slide Generation Via Query-Based Text Summarization
Edward Sun | Yufang Hou | Dakuo Wang | Yunfeng Zhang | Nancy X. R. Wang

Presentations are critical for communication in all areas of our lives, yet the creation of slide decks is often tedious and time-consuming. There has been limited research aiming to automate the document-to-slides generation process and all face a critical challenge : no publicly available dataset for training and benchmarking. In this work, we first contribute a new dataset, SciDuet, consisting of pairs of papers and their corresponding slides decks from recent years’ NLP and ML conferences (e.g., ACL). Secondly, we present D2S, a novel system that tackles the document-to-slides task with a two-step approach : 1) Use slide titles to retrieve relevant and engaging text, figures, and tables ; 2) Summarize the retrieved context into bullet points with long-form question answering. Our evaluation suggests that long-form QA outperforms state-of-the-art summarization baselines on both automated ROUGE metrics and qualitative human evaluation.

pdf bib
Efficient Attentions for Long Document Summarization
Luyang Huang | Shuyang Cao | Nikolaus Parulian | Heng Ji | Lu Wang

The quadratic computational and memory complexities of large Transformers have limited their scalability for long document summarization. In this paper, we propose Hepos, a novel efficient encoder-decoder attention with head-wise positional strides to effectively pinpoint salient information from the source. We further conduct a systematic study of existing efficient self-attentions. Combined with Hepos, we are able to process ten times more tokens than existing models that use full attentions. For evaluation, we present a new dataset, GovReport, with significantly longer documents and summaries. Results show that our models produce significantly higher ROUGE scores than competitive comparisons, including new state-of-the-art results on PubMed. Human evaluation also shows that our models generate more informative summaries with fewer unfaithful errors.

pdf bib
RefSum : Refactoring Neural SummarizationRefSum: Refactoring Neural Summarization
Yixin Liu | Zi-Yi Dou | Pengfei Liu

Although some recent works show potential complementarity among different state-of-the-art systems, few works try to investigate this problem in text summarization. Researchers in other areas commonly refer to the techniques of reranking or stacking to approach this problem. In this work, we highlight several limitations of previous methods, which motivates us to present a new framework Refactor that provides a unified view of text summarization and summaries combination. Experimentally, we perform a comprehensive evaluation that involves twenty-two base systems, four datasets, and three different application scenarios. Besides new state-of-the-art results on CNN / DailyMail dataset (46.18 ROUGE-1), we also elaborate on how our proposed method addresses the limitations of the traditional methods and the effectiveness of the Refactor model sheds light on insight for performance improvement. Our system can be directly used by other researchers as an off-the-shelf tool to achieve further performance improvements. We open-source all the code and provide a convenient interface to use it : https://github.com/yixinL7/Refactoring-Summarization.

pdf bib
Annotating and Modeling Fine-grained Factuality in Summarization
Tanya Goyal | Greg Durrett

Recent pre-trained abstractive summarization systems have started to achieve credible performance, but a major barrier to their use in practice is their propensity to output summaries that are not faithful to the input and that contain factual errors. While a number of annotated datasets and statistical models for assessing factuality have been explored, there is no clear picture of what errors are most important to target or where current techniques are succeeding and failing. We explore both synthetic and human-labeled data sources for training models to identify factual errors in summarization, and study factuality at the word-, dependency-, and sentence-level. Our observations are threefold. First, exhibited factual errors differ significantly across datasets, and commonly-used training sets of simple synthetic errors do not reflect errors made on abstractive datasets like XSum. Second, human-labeled data with fine-grained annotations provides a more effective training signal than sentence-level annotations or synthetic data. Finally, we show that our best factuality detection model enables training of more factual XSum summarization models by allowing us to identify non-factual tokens in the training data.

pdf bib
Larger-Context Tagging : When and Why Does It Work?
Jinlan Fu | Liangjing Feng | Qi Zhang | Xuanjing Huang | Pengfei Liu

The development of neural networks and pretraining techniques has spawned many sentence-level tagging systems that achieved superior performance on typical benchmarks. However, a relatively less discussed topic is what if more context information is introduced into current top-scoring tagging systems. Although several existing works have attempted to shift tagging systems from sentence-level to document-level, there is still no consensus conclusion about when and why it works, which limits the applicability of the larger-context approach in tagging tasks. In this paper, instead of pursuing a state-of-the-art tagging system by architectural exploration, we focus on investigating when and why the larger-context training, as a general strategy, can work. To this end, we conduct a thorough comparative study on four proposed aggregators for context information collecting and present an attribute-aided evaluation method to interpret the improvement brought by larger-context training. Experimentally, we set up a testbed based on four tagging tasks and thirteen datasets. Hopefully, our preliminary observations can deepen the understanding of larger-context training and enlighten more follow-up works on the use of contextual information.

pdf bib
Neural Sequence Segmentation as Determining the Leftmost Segments
Yangming Li | Lemao Liu | Kaisheng Yao

Prior methods to text segmentation are mostly at token level. Despite the adequacy, this nature limits their full potential to capture the long-term dependencies among segments. In this work, we propose a novel framework that incrementally segments natural language sentences at segment level. For every step in segmentation, it recognizes the leftmost segment of the remaining sequence. Implementations involve LSTM-minus technique to construct the phrase representations and recurrent neural networks (RNN) to model the iterations of determining the leftmost segments. We have conducted extensive experiments on syntactic chunking and Chinese part-of-speech (POS) tagging across 3 datasets, demonstrating that our methods have significantly outperformed previous all baselines and achieved new state-of-the-art results. Moreover, qualitative analysis and the study on segmenting long-length sentences verify its effectiveness in modeling long-term dependencies.

pdf bib
Put Chatbot into Its Interlocutor’s Shoes : New Framework to Learn Chatbot Responding with Intention
Hsuan Su | Jiun-Hao Jhan | Fan-yun Sun | Saurav Sahay | Hung-yi Lee

Most chatbot literature that focuses on improving the fluency and coherence of a chatbot, is dedicated to making chatbots more human-like. However, very little work delves into what really separates humans from chatbots humans intrinsically understand the effect their responses have on the interlocutor and often respond with an intention such as proposing an optimistic view to make the interlocutor feel better. This paper proposes an innovative framework to train chatbots to possess human-like intentions. Our framework includes a guiding chatbot and an interlocutor model that plays the role of humans. The guiding chatbot is assigned an intention and learns to induce the interlocutor to reply with responses matching the intention, for example, long responses, joyful responses, responses with specific words, etc. We examined our framework using three experimental setups and evaluated the guiding chatbot with four different metrics to demonstrate flexibility and performance advantages. Additionally, we performed trials with human interlocutors to substantiate the guiding chatbot’s effectiveness in influencing the responses of humans to a certain extent. Code will be made available to the public.

pdf bib
Adding Chit-Chat to Enhance Task-Oriented Dialogues
Kai Sun | Seungwhan Moon | Paul Crook | Stephen Roller | Becka Silvert | Bing Liu | Zhiguang Wang | Honglei Liu | Eunjoon Cho | Claire Cardie

Existing dialogue corpora and models are typically designed under two disjoint motives : while task-oriented systems focus on achieving functional goals (e.g., booking hotels), open-domain chatbots aim at making socially engaging conversations. In this work, we propose to integrate both types of systems by Adding Chit-Chat to ENhance Task-ORiented dialogues (ACCENTOR), with the goal of making virtual assistant conversations more engaging and interactive. Specifically, we propose a Human-AI collaborative data collection approach for generating diverse chit-chat responses to augment task-oriented dialogues with minimal annotation effort. We then present our new chit-chat-based annotations to 23.8 K dialogues from two popular task-oriented datasets (Schema-Guided Dialogue and MultiWOZ 2.1) and demonstrate their advantage over the originals via human evaluation. Lastly, we propose three new models for adding chit-chat to task-oriented dialogues, explicitly trained to predict user goals and to generate contextually relevant chit-chat responses. Automatic and human evaluations show that, compared with the state-of-the-art task-oriented baseline, our models can code-switch between task and chit-chat to be more engaging, interesting, knowledgeable, and humanlike, while maintaining competitive task performance.

pdf bib
Did they answer? Subjective acts and intents in conversational discourse
Elisa Ferracane | Greg Durrett | Junyi Jessy Li | Katrin Erk

Discourse signals are often implicit, leaving it up to the interpreter to draw the required inferences. At the same time, discourse is embedded in a social context, meaning that interpreters apply their own assumptions and beliefs when resolving these inferences, leading to multiple, valid interpretations. However, current discourse data and frameworks ignore the social aspect, expecting only a single ground truth. We present the first discourse dataset with multiple and subjective interpretations of English conversation in the form of perceived conversation acts and intents. We carefully analyze our dataset and create computational models to (1) confirm our hypothesis that taking into account the bias of the interpreters leads to better predictions of the interpretations, (2) and show disagreements are nuanced and require a deeper understanding of the different contextual factors. We share our dataset and code at http://github.com/elisaF/subjective_discourse.

pdf bib
Evaluating the Impact of a Hierarchical Discourse Representation on Entity Coreference Resolution Performance
Sopan Khosla | James Fiacco | Carolyn Rosé

Recent work on entity coreference resolution (CR) follows current trends in Deep Learning applied to embeddings and relatively simple task-related features. SOTA models do not make use of hierarchical representations of discourse structure. In this work, we leverage automatically constructed discourse parse trees within a neural approach and demonstrate a significant improvement on two benchmark entity coreference-resolution datasets. We explore how the impact varies depending upon the type of mention.

pdf bib
Bridging Resolution : Making Sense of the State of the Art
Hideo Kobayashi | Vincent Ng

While Yu and Poesio (2020) have recently demonstrated the superiority of their neural multi-task learning (MTL) model to rule-based approaches for bridging anaphora resolution, there is little understanding of (1) how it is better than the rule-based approaches (e.g., are the two approaches making similar or complementary mistakes?) and (2) what should be improved. To shed light on these issues, we (1) propose a hybrid rule-based and MTL approach that would enable a better understanding of their comparative strengths and weaknesses ; and (2) perform a manual analysis of the errors made by the MTL model.

pdf bib
Mask Attention Networks : Rethinking and Strengthen Transformer
Zhihao Fan | Yeyun Gong | Dayiheng Liu | Zhongyu Wei | Siyuan Wang | Jian Jiao | Nan Duan | Ruofei Zhang | Xuanjing Huang

Transformer is an attention-based neural network, which consists of two sublayers, namely, Self-Attention Network (SAN) and Feed-Forward Network (FFN). Existing research explores to enhance the two sublayers separately to improve the capability of Transformer for text representation. In this paper, we present a novel understanding of SAN and FFN as Mask Attention Networks (MANs) and show that they are two special cases of MANs with static mask matrices. However, their static mask matrices limit the capability for localness modeling in text representation learning. We therefore introduce a new layer named dynamic mask attention network (DMAN) with a learnable mask matrix which is able to model localness adaptively. To incorporate advantages of DMAN, SAN, and FFN, we propose a sequential layered structure to combine the three types of layers. Extensive experiments on various tasks, including neural machine translation and text summarization demonstrate that our model outperforms the original Transformer.

pdf bib
ERNIE-Gram : Pre-Training with Explicitly N-Gram Masked Language Modeling for Natural Language UnderstandingERNIE-Gram: Pre-Training with Explicitly N-Gram Masked Language Modeling for Natural Language Understanding
Dongling Xiao | Yu-Kun Li | Han Zhang | Yu Sun | Hao Tian | Hua Wu | Haifeng Wang

Coarse-grained linguistic information, such as named entities or phrases, facilitates adequately representation learning in pre-training. Previous works mainly focus on extending the objective of BERT’s Masked Language Modeling (MLM) from masking individual tokens to contiguous sequences of n tokens. We argue that such contiguously masking method neglects to model the intra-dependencies and inter-relation of coarse-grained linguistic information. As an alternative, we propose ERNIE-Gram, an explicitly n-gram masking method to enhance the integration of coarse-grained information into pre-training. In ERNIE-Gram, n-grams are masked and predicted directly using explicit n-gram identities rather than contiguous sequences of n tokens. Furthermore, ERNIE-Gram employs a generator model to sample plausible n-gram identities as optional n-gram masks and predict them in both coarse-grained and fine-grained manners to enable comprehensive n-gram prediction and relation modeling. We pre-train ERNIE-Gram on English and Chinese text corpora and fine-tune on 19 downstream tasks. Experimental results show that ERNIE-Gram outperforms previous pre-training models like XLNet and RoBERTa by a large margin, and achieves comparable results with state-of-the-art methods. The source codes and pre-trained models have been released at https://github.com/PaddlePaddle/ERNIE.

pdf bib
Lattice-BERT : Leveraging Multi-Granularity Representations in Chinese Pre-trained Language ModelsBERT: Leveraging Multi-Granularity Representations in Chinese Pre-trained Language Models
Yuxuan Lai | Yijia Liu | Yansong Feng | Songfang Huang | Dongyan Zhao

Chinese pre-trained language models usually process text as a sequence of characters, while ignoring more coarse granularity, e.g., words. In this work, we propose a novel pre-training paradigm for Chinese Lattice-BERT, which explicitly incorporates word representations along with characters, thus can model a sentence in a multi-granularity manner. Specifically, we construct a lattice graph from the characters and words in a sentence and feed all these text units into transformers. We design a lattice position attention mechanism to exploit the lattice structures in self-attention layers. We further propose a masked segment prediction task to push the model to learn from rich but redundant information inherent in lattices, while avoiding learning unexpected tricks. Experiments on 11 Chinese natural language understanding tasks show that our model can bring an average increase of 1.5 % under the 12-layer setting, which achieves new state-of-the-art among base-size models on the CLUE benchmarks. Further analysis shows that Lattice-BERT can harness the lattice structures, and the improvement comes from the exploration of redundant information and multi-granularity representations. Our code will be available at https://github.com/alibaba/pretrained-language-models/LatticeBERT.

pdf bib
UmlsBERT : Clinical Domain Knowledge Augmentation of Contextual Embeddings Using the Unified Medical Language System MetathesaurusUmlsBERT: Clinical Domain Knowledge Augmentation of Contextual Embeddings Using the Unified Medical Language System Metathesaurus
George Michalopoulos | Yuanxin Wang | Hussam Kaka | Helen Chen | Alexander Wong

Contextual word embedding models, such as BioBERT and Bio_ClinicalBERT, have achieved state-of-the-art results in biomedical natural language processing tasks by focusing their pre-training process on domain-specific corpora. However, such models do not take into consideration structured expert domain knowledge from a knowledge base. We introduce UmlsBERT, a contextual embedding model that integrates domain knowledge during the pre-training process via a novel knowledge augmentation strategy. More specifically, the augmentation on UmlsBERT with the Unified Medical Language System (UMLS) Metathesaurus is performed in two ways : i) connecting words that have the same underlying ‘concept’ in UMLS and ii) leveraging semantic type knowledge in UMLS to create clinically meaningful input embeddings. By applying these two strategies, UmlsBERT can encode clinical domain knowledge into word embeddings and outperform existing domain-specific models on common named-entity recognition (NER) and clinical natural language inference tasks.

pdf bib
Why Do Document-Level Polarity Classifiers Fail?
Karen Martins | Pedro O.S Vaz-de-Melo | Rodrygo Santos

Machine learning solutions are often criticized for the lack of explanation of their successes and failures. Understanding which instances are misclassified and why is essential to improve the learning process. This work helps to fill this gap by proposing a methodology to characterize, quantify and measure the impact of hard instances in the task of polarity classification of movie reviews. We characterize such instances into two categories : neutrality, where the text does not convey a clear polarity, and discrepancy, where the polarity of the text is the opposite of its true rating. We quantify the number of hard instances in polarity classification of movie reviews and provide empirical evidence about the need to pay attention to such problematic instances, as they are much harder to classify, for both machine and human classifiers. To the best of our knowledge, this is the first systematic analysis of the impact of hard instances in polarity detection from well-formed textual reviews.

pdf bib
Domain Divergences : A Survey and Empirical Analysis
Abhinav Ramesh Kashyap | Devamanyu Hazarika | Min-Yen Kan | Roger Zimmermann

Domain divergence plays a significant role in estimating the performance of a model in new domains. While there is a significant literature on divergence measures, researchers find it hard to choose an appropriate divergence for a given NLP application. We address this shortcoming by both surveying the literature and through an empirical study. We develop a taxonomy of divergence measures consisting of three classes Information-theoretic, Geometric, and Higher-order measures and identify the relationships between them. Further, to understand the common use-cases of these measures, we recognise three novel applications 1) Data Selection, 2) Learning Representation, and 3) Decisions in the Wild and use it to organise our literature. From this, we identify that Information-theoretic measures are prevalent for 1) and 3), and Higher-order measures are more common for 2). To further help researchers choose appropriate measures to predict drop in performance an important aspect of Decisions in the Wild, we perform correlation analysis spanning 130 domain adaptation scenarios, 3 varied NLP tasks and 12 divergence measures identified from our survey. To calculate these divergences, we consider the current contextual word representations (CWR) and contrast with the older distributed representations. We find that traditional measures over word distributions still serve as strong baselines, while higher-order measures with CWR are effective.

pdf bib
Target-Aware Data Augmentation for Stance Detection
Yingjie Li | Cornelia Caragea

The goal of stance detection is to identify whether the author of a text is in favor of, neutral or against a specific target. Despite substantial progress on this task, one of the remaining challenges is the scarcity of annotations. Data augmentation is commonly used to address annotation scarcity by generating more training samples. However, the augmented sentences that are generated by existing methods are either less diversified or inconsistent with the given target and stance label. In this paper, we formulate the data augmentation of stance detection as a conditional masked language modeling task and augment the dataset by predicting the masked word conditioned on both its context and the auxiliary sentence that contains target and label information. Moreover, we propose another simple yet effective method that generates target-aware sentence by replacing a target mention with the other. Experimental results show that our proposed methods significantly outperforms previous augmentation methods on 11 targets.

pdf bib
Searchable Hidden Intermediates for End-to-End Models of Decomposable Sequence Tasks
Siddharth Dalmia | Brian Yan | Vikas Raunak | Florian Metze | Shinji Watanabe

End-to-end approaches for sequence tasks are becoming increasingly popular. Yet for complex sequence tasks, like speech translation, systems that cascade several models trained on sub-tasks have shown to be superior, suggesting that the compositionality of cascaded systems simplifies learning and enables sophisticated search capabilities. In this work, we present an end-to-end framework that exploits compositionality to learn searchable hidden representations at intermediate stages of a sequence model using decomposed sub-tasks. These hidden intermediates can be improved using beam search to enhance the overall performance and can also incorporate external models at intermediate stages of the network to re-score or adapt towards out-of-domain data. One instance of the proposed framework is a Multi-Decoder model for speech translation that extracts the searchable hidden intermediates from a speech recognition sub-task. The model demonstrates the aforementioned benefits and outperforms the previous state-of-the-art by around +6 and +3 BLEU on the two test sets of Fisher-CallHome and by around +3 and +4 BLEU on the English-German and English-French test sets of MuST-C.

pdf bib
Worldly Wise (WoW)-Cross-Lingual Knowledge Fusion for Fact-based Visual Spoken-Question AnsweringWoW) - Cross-Lingual Knowledge Fusion for Fact-based Visual Spoken-Question Answering
Kiran Ramnath | Leda Sari | Mark Hasegawa-Johnson | Chang Yoo

Although Question-Answering has long been of research interest, its accessibility to users through a speech interface and its support to multiple languages have not been addressed in prior studies. Towards these ends, we present a new task and a synthetically-generated dataset to do Fact-based Visual Spoken-Question Answering (FVSQA). FVSQA is based on the FVQA dataset, which requires a system to retrieve an entity from Knowledge Graphs (KGs) to answer a question about an image. In FVSQA, the question is spoken rather than typed. Three sub-tasks are proposed : (1) speech-to-text based, (2) end-to-end, without speech-to-text as an intermediate component, and (3) cross-lingual, in which the question is spoken in a language different from that in which the KG is recorded. The end-to-end and cross-lingual tasks are the first to require world knowledge from a multi-relational KG as a differentiable layer in an end-to-end spoken language understanding task, hence the proposed reference implementation is called Worldly-Wise (WoW).WoW is shown to perform end-to-end cross-lingual FVSQA at same levels of accuracy across 3 languages-English, Hindi, and Turkish.

pdf bib
Align-Refine : Non-Autoregressive Speech Recognition via Iterative Realignment
Ethan A. Chi | Julian Salazar | Katrin Kirchhoff

Non-autoregressive encoder-decoder models greatly improve decoding speed over autoregressive models, at the expense of generation quality. To mitigate this, iterative decoding models repeatedly infill or refine the proposal of a non-autoregressive model. However, editing at the level of output sequences limits model flexibility. We instead propose * iterative realignment *, which by refining latent alignments allows more flexible edits in fewer steps. Our model, Align-Refine, is an end-to-end Transformer which iteratively realigns connectionist temporal classification (CTC) alignments. On the WSJ dataset, Align-Refine matches an autoregressive baseline with a 14x decoding speedup ; on LibriSpeech, we reach an LM-free test-other WER of 9.0 % (19 % relative improvement on comparable work) in three iterations. We release our code at https://github.com/amazon-research/align-refine.

pdf bib
Everything Has a Cause : Leveraging Causal Inference in Legal Text Analysis
Xiao Liu | Da Yin | Yansong Feng | Yuting Wu | Dongyan Zhao

Causal inference is the process of capturing cause-effect relationship among variables. Most existing works focus on dealing with structured data, while mining causal relationship among factors from unstructured data, like text, has been less examined, but is of great importance, especially in the legal domain. In this paper, we propose a novel Graph-based Causal Inference (GCI) framework, which builds causal graphs from fact descriptions without much human involvement and enables causal inference to facilitate legal practitioners to make proper decisions. We evaluate the framework on a challenging similar charge disambiguation task. Experimental results show that GCI can capture the nuance from fact descriptions among multiple confusing charges and provide explainable discrimination, especially in few-shot settings. We also observe that the causal knowledge contained in GCI can be effectively injected into powerful neural networks for better performance and interpretability.

pdf bib
Counterfactual Supporting Facts Extraction for Explainable Medical Record Based Diagnosis with Graph Network
Haoran Wu | Wei Chen | Shuang Xu | Bo Xu

Providing a reliable explanation for clinical diagnosis based on the Electronic Medical Record (EMR) is fundamental to the application of Artificial Intelligence in the medical field. Current methods mostly treat the EMR as a text sequence and provide explanations based on a precise medical knowledge base, which is disease-specific and difficult to obtain for experts in reality. Therefore, we propose a counterfactual multi-granularity graph supporting facts extraction (CMGE) method to extract supporting facts from irregular EMR itself without external knowledge bases in this paper. Specifically, we first structure the sequence of EMR into a hierarchical graph network and then obtain the causal relationship between multi-granularity features and diagnosis results through counterfactual intervention on the graph. Features having the strongest causal connection with the results provide interpretive support for the diagnosis. Experimental results on real Chinese EMR of the lymphedema demonstrate that our method can diagnose four types of EMR correctly, and can provide accurate supporting facts for the results. More importantly, the results on different diseases demonstrate the robustness of our approach, which represents the potential application in the medical field.

pdf bib
Personalized Response Generation via Generative Split Memory Network
Yuwei Wu | Xuezhe Ma | Diyi Yang

Despite the impressive successes of generation and dialogue systems, how to endow a text generation system with particular personality traits to deliver more personalized responses remains under-investigated. In this work, we look at how to generate personalized responses for questions on Reddit by utilizing personalized user profiles and posting histories. Specifically, we release an open-domain single-turn dialog dataset made up of 1.5 M conversation pairs together with 300k profiles of users and related comments. We then propose a memory network to generate personalized responses in dialogue that utilizes a novel mechanism of splitting memories : one for user profile meta attributes and the other for user-generated information like comment histories. Experimental results show the quantitative and qualitative improvements of our simple split memory network model over the state-of-the-art response generation baselines.single-turn dialog dataset made up of 1.5M conversation pairs together with 300k profiles of users and related comments. We then propose a memory network to generate personalized responses in dialogue that utilizes a novel mechanism of splitting memories: one for user profile meta attributes and the other for user-generated information like comment histories. Experimental results show the quantitative and qualitative improvements of our simple split memory network model over the state-of-the-art response generation baselines.

pdf bib
Towards Few-shot Fact-Checking via Perplexity
Nayeon Lee | Yejin Bang | Andrea Madotto | Pascale Fung

Few-shot learning has drawn researchers’ attention to overcome the problem of data scarcity. Recently, large pre-trained language models have shown great performance in few-shot learning for various downstream tasks, such as question answering and machine translation. Nevertheless, little exploration has been made to achieve few-shot learning for the fact-checking task. However, fact-checking is an important problem, especially when the amount of information online is growing exponentially every day. In this paper, we propose a new way of utilizing the powerful transfer learning ability of a language model via a perplexity score. The most notable strength of our methodology lies in its capability in few-shot learning. With only two training samples, our methodology can already outperform the Major Class baseline by more than an absolute 10 % on the F1-Macro metric across multiple datasets. Through experiments, we empirically verify the plausibility of the rather surprising usage of the perplexity score in the context of fact-checking and highlight the strength of our few-shot methodology by comparing it to strong fine-tuning-based baseline models. Moreover, we construct and publicly release two new fact-checking datasets related to COVID-19.

pdf bib
Model Extraction and Adversarial Transferability, Your BERT is Vulnerable !BERT is Vulnerable!
Xuanli He | Lingjuan Lyu | Lichao Sun | Qiongkai Xu

Natural language processing (NLP) tasks, ranging from text classification to text generation, have been revolutionised by the pretrained language models, such as BERT. This allows corporations to easily build powerful APIs by encapsulating fine-tuned BERT models for downstream tasks. However, when a fine-tuned BERT model is deployed as a service, it may suffer from different attacks launched by the malicious users. In this work, we first present how an adversary can steal a BERT-based API service (the victim / target model) on multiple benchmark datasets with limited prior knowledge and queries. We further show that the extracted model can lead to highly transferable adversarial attacks against the victim model. Our studies indicate that the potential vulnerabilities of BERT-based API services still hold, even when there is an architectural mismatch between the victim model and the attack model. Finally, we investigate two defence strategies to protect the victim model, and find that unless the performance of the victim model is sacrificed, both model extraction and adversarial transferability can effectively compromise the target models.

pdf bib
DA-Transformer : Distance-aware TransformerDA-Transformer: Distance-aware Transformer
Chuhan Wu | Fangzhao Wu | Yongfeng Huang

Transformer has achieved great success in the NLP field by composing various advanced models like BERT and GPT. However, Transformer and its existing variants may not be optimal in capturing token distances because the position or distance embeddings used by these methods usually can not keep the precise information of real distances, which may not be beneficial for modeling the orders and relations of contexts. In this paper, we propose DA-Transformer, which is a distance-aware Transformer that can exploit the real distance. We propose to incorporate the real distances between tokens to re-scale the raw self-attention weights, which are computed by the relevance between attention query and key. Concretely, in different self-attention heads the relative distance between each pair of tokens is weighted by different learnable parameters, which control the different preferences on long- or short-term information of these heads. Since the raw weighted real distances may not be optimal for adjusting self-attention weights, we propose a learnable sigmoid function to map them into re-scaled coefficients that have proper ranges. We first clip the raw self-attention weights via the ReLU function to keep non-negativity and introduce sparsity, and then multiply them with the re-scaled coefficients to encode real distance information into self-attention. Extensive experiments on five benchmark datasets show that DA-Transformer can effectively improve the performance of many tasks and outperform the vanilla Transformer and its several variants.

pdf bib
KPQA : A Metric for Generative Question Answering Using Keyphrase WeightsKPQA: A Metric for Generative Question Answering Using Keyphrase Weights
Hwanhee Lee | Seunghyun Yoon | Franck Dernoncourt | Doo Soon Kim | Trung Bui | Joongbo Shin | Kyomin Jung

In the automatic evaluation of generative question answering (GenQA) systems, it is difficult to assess the correctness of generated answers due to the free-form of the answer. Especially, widely used n-gram similarity metrics often fail to discriminate the incorrect answers since they equally consider all of the tokens. To alleviate this problem, we propose KPQA metric, a new metric for evaluating the correctness of GenQA. Specifically, our new metric assigns different weights to each token via keyphrase prediction, thereby judging whether a generated answer sentence captures the key meaning of the reference answer. To evaluate our metric, we create high-quality human judgments of correctness on two GenQA datasets. Using our human-evaluation datasets, we show that our proposed metric has a significantly higher correlation with human judgments than existing metrics in various datasets. Code for KPQA-metric will be available at https://github.com/hwanheelee1993/KPQA.

pdf bib
Modeling Framing in Immigration Discourse on Social Media
Julia Mendelsohn | Ceren Budak | David Jurgens

The framing of political issues can influence policy and public opinion. Even though the public plays a key role in creating and spreading frames, little is known about how ordinary people on social media frame political issues. By creating a new dataset of immigration-related tweets labeled for multiple framing typologies from political communication theory, we develop supervised models to detect frames. We demonstrate how users’ ideology and region impact framing choices, and how a message’s framing influences audience responses. We find that the more commonly-used issue-generic frames obscure important ideological and regional patterns that are only revealed by immigration-specific frames. Furthermore, frames oriented towards human interests, culture, and politics are associated with higher user engagement. This large-scale analysis of a complex social and linguistic phenomenon contributes to both NLP and social science research.

pdf bib
Learning to Recognize Dialect Features
Dorottya Demszky | Devyani Sharma | Jonathan Clark | Vinodkumar Prabhakaran | Jacob Eisenstein

Building NLP systems that serve everyone requires accounting for dialect differences. But dialects are not monolithic entities : rather, distinctions between and within dialects are captured by the presence, absence, and frequency of dozens of dialect features in speech and text, such as the deletion of the copula in He running. In this paper, we introduce the task of dialect feature detection, and present two multitask learning approaches, both based on pretrained transformers. For most dialects, large-scale annotated corpora for these features are unavailable, making it difficult to train recognizers. We train our models on a small number of minimal pairs, building on how linguists typically define dialect features. Evaluation on a test set of 22 dialect features of Indian English demonstrates that these models learn to recognize many features with high accuracy, and that a few minimal pairs can be as effective for training as thousands of labeled examples. We also demonstrate the downstream applicability of dialect feature detection both as a measure of dialect density and as a dialect classifier.

pdf bib
Highly Efficient Knowledge Graph Embedding Learning with Orthogonal Procrustes AnalysisOrthogonal Procrustes Analysis
Xutan Peng | Guanyi Chen | Chenghua Lin | Mark Stevenson

Knowledge Graph Embeddings (KGEs) have been intensively explored in recent years due to their promise for a wide range of applications. However, existing studies focus on improving the final model performance without acknowledging the computational cost of the proposed approaches, in terms of execution time and environmental impact. This paper proposes a simple yet effective KGE framework which can reduce the training time and carbon footprint by orders of magnitudes compared with state-of-the-art approaches, while producing competitive performance. We highlight three technical innovations : full batch learning via relational matrices, closed-form Orthogonal Procrustes Analysis for KGEs, and non-negative-sampling training. In addition, as the first KGE method whose entity embeddings also store full relation information, our trained models encode rich semantics and are highly interpretable. Comprehensive experiments and ablation studies involving 13 strong baselines and two standard datasets verify the effectiveness and efficiency of our algorithm.

pdf bib
Rethinking Network Pruning under the Pre-train and Fine-tune Paradigm
Dongkuan Xu | Ian En-Hsu Yen | Jinxi Zhao | Zhibin Xiao

Transformer-based pre-trained language models have significantly improved the performance of various natural language processing (NLP) tasks in the recent years. While effective and prevalent, these models are usually prohibitively large for resource-limited deployment scenarios. A thread of research has thus been working on applying network pruning techniques under the pretrain-then-finetune paradigm widely adopted in NLP. However, the existing pruning results on benchmark transformers, such as BERT, are not as remarkable as the pruning results in the literature of convolutional neural networks (CNNs). In particular, common wisdom in pruning CNN states that sparse pruning technique compresses a model more than that obtained by reducing number of channels and layers, while existing works on sparse pruning of BERT yields inferior results than its small-dense counterparts such as TinyBERT. In this work, we aim to fill this gap by studying how knowledge are transferred and lost during the pre-train, fine-tune, and pruning process, and proposing a knowledge-aware sparse pruning process that achieves significantly superior results than existing literature. We show for the first time that sparse pruning compresses a BERT model significantly more than reducing its number of channels and layers. Experiments on multiple data sets of GLUE benchmark show that our method outperforms the leading competitors with a 20-times weight / FLOPs compression and neglectable loss in prediction accuracy.

pdf bib
Detoxifying Language Models Risks Marginalizing Minority Voices
Albert Xu | Eshaan Pathak | Eric Wallace | Suchin Gururangan | Maarten Sap | Dan Klein

Language models (LMs) must be both safe and equitable to be responsibly deployed in practice. With safety in mind, numerous detoxification techniques (e.g., Dathathri et al. 2020 ; Krause et al. 2020) have been proposed to mitigate toxic LM generations. In this work, we show that these detoxification techniques hurt equity : they decrease the utility of LMs on language used by marginalized groups (e.g., African-American English and minority identity mentions). In particular, we perform automatic and human evaluations of text generation quality when LMs are conditioned on inputs with different dialects and group identifiers. We find that detoxification makes LMs more brittle to distribution shift, especially on language used by marginalized groups. We identify that these failures stem from detoxification methods exploiting spurious correlations in toxicity datasets. Overall, our results highlight the tension between the controllability and distributional robustness of LMs.

pdf bib
HONEST : Measuring Hurtful Sentence Completion in Language ModelsHONEST: Measuring Hurtful Sentence Completion in Language Models
Debora Nozza | Federico Bianchi | Dirk Hovy

Language models have revolutionized the field of NLP. However, language models capture and proliferate hurtful stereotypes, especially in text generation. Our results show that 4.3 % of the time, language models complete a sentence with a hurtful word. These cases are not random, but follow language and gender-specific patterns. We propose a score to measure hurtful sentence completions in language models (HONEST). It uses a systematic template- and lexicon-based bias evaluation methodology for six languages. Our findings suggest that these models replicate and amplify deep-seated societal stereotypes about gender roles. Sentence completions refer to sexual promiscuity when the target is female in 9 % of the time, and in 4 % to homosexuality when the target is male. The results raise questions about the use of these models in production settings.

pdf bib
DeCEMBERT : Learning from Noisy Instructional Videos via Dense Captions and Entropy MinimizationDeCEMBERT: Learning from Noisy Instructional Videos via Dense Captions and Entropy Minimization
Zineng Tang | Jie Lei | Mohit Bansal

Leveraging large-scale unlabeled web videos such as instructional videos for pre-training followed by task-specific finetuning has become the de facto approach for many video-and-language tasks. However, these instructional videos are very noisy, the accompanying ASR narrations are often incomplete, and can be irrelevant to or temporally misaligned with the visual content, limiting the performance of the models trained on such data. To address these issues, we propose an improved video-and-language pre-training method that first adds automatically-extracted dense region captions from the video frames as auxiliary text input, to provide informative visual cues for learning better video and language associations. Second, to alleviate the temporal misalignment issue, our method incorporates an entropy minimization-based constrained attention loss, to encourage the model to automatically focus on the correct caption from a pool of candidate ASR captions. Our overall approach is named DeCEMBERT (Dense Captions and Entropy Minimization). Comprehensive experiments on three video-and-language tasks (text-to-video retrieval, video captioning, and video question answering) across five datasets demonstrate that our approach outperforms previous state-of-the-art methods. Ablation studies on pre-training and downstream tasks show that adding dense captions and constrained attention loss help improve the model performance. Lastly, we also provide attention visualization to show the effect of applying the proposed constrained attention loss.

pdf bib
Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models
Po-Yao Huang | Mandela Patrick | Junjie Hu | Graham Neubig | Florian Metze | Alexander Hauptmann

This paper studies zero-shot cross-lingual transfer of vision-language models. Specifically, we focus on multilingual text-to-video search and propose a Transformer-based model that learns contextual multilingual multimodal embeddings. Under a zero-shot setting, we empirically demonstrate that performance degrades significantly when we query the multilingual text-video model with non-English sentences. To address this problem, we introduce a multilingual multimodal pre-training strategy, and collect a new multilingual instructional video dataset (Multi-HowTo100 M) for pre-training. Experiments on VTT show that our method significantly improves video search in non-English languages without additional annotations. Furthermore, when multilingual annotations are available, our method outperforms recent baselines by a large margin in multilingual text-to-video search on VTT and VATEX ; as well as in multilingual text-to-image search on Multi30K. Our model and Multi-HowTo100 M is available at http://github.com/berniebear/Multi-HT100M.

pdf bib
Video Question Answering with Phrases via Semantic Roles
Arka Sadhu | Kan Chen | Ram Nevatia

Video Question Answering (VidQA) evaluation metrics have been limited to a single-word answer or selecting a phrase from a fixed set of phrases. These metrics limit the VidQA models’ application scenario. In this work, we leverage semantic roles derived from video descriptions to mask out certain phrases, to introduce VidQAP which poses VidQA as a fill-in-the-phrase task. To enable evaluation of answer phrases, we compute the relative improvement of the predicted answer compared to an empty string. To reduce the influence of language bias in VidQA datasets, we retrieve a video having a different answer for the same question. To facilitate research, we construct ActivityNet-SRL-QA and Charades-SRL-QA and benchmark them by extending three vision-language models. We perform extensive analysis and ablative studies to guide future work. Code and data are public.

pdf bib
From Masked Language Modeling to Translation : Non-English Auxiliary Tasks Improve Zero-shot Spoken Language UnderstandingEnglish Auxiliary Tasks Improve Zero-shot Spoken Language Understanding
Rob van der Goot | Ibrahim Sharaf | Aizhan Imankulova | Ahmet Üstün | Marija Stepanović | Alan Ramponi | Siti Oryza Khairunnisa | Mamoru Komachi | Barbara Plank

The lack of publicly available evaluation data for low-resource languages limits progress in Spoken Language Understanding (SLU). As key tasks like intent classification and slot filling require abundant training data, it is desirable to reuse existing data in high-resource languages to develop models for low-resource scenarios. We introduce xSID, a new benchmark for cross-lingual (x) Slot and Intent Detection in 13 languages from 6 language families, including a very low-resource dialect. To tackle the challenge, we propose a joint learning approach, with English SLU training data and non-English auxiliary tasks from raw text, syntax and translation for transfer. We study two setups which differ by type and language coverage of the pre-trained embeddings. Our results show that jointly learning the main tasks with masked language modeling is effective for slots, while machine translation transfer works best for intent classification.

pdf bib
Challenging distributional models with a conceptual network of philosophical terms
Yvette Oortwijn | Jelke Bloem | Pia Sommerauer | Francois Meyer | Wei Zhou | Antske Fokkens

Computational linguistic research on language change through distributional semantic (DS) models has inspired researchers from fields such as philosophy and literary studies, who use these methods for the exploration and comparison of comparatively small datasets traditionally analyzed by close reading. Research on methods for small data is still in early stages and it is not clear which methods achieve the best results. We investigate the possibilities and limitations of using distributional semantic models for analyzing philosophical data by means of a realistic use-case. We provide a ground truth for evaluation created by philosophy experts and a blueprint for using DS models in a sound methodological setup. We compare three methods for creating specialized models from small datasets. Though the models do not perform well enough to directly support philosophers yet, we find that models designed for small data yield promising directions for future work.

pdf bib
KILT : a Benchmark for Knowledge Intensive Language TasksKILT: a Benchmark for Knowledge Intensive Language Tasks
Fabio Petroni | Aleksandra Piktus | Angela Fan | Patrick Lewis | Majid Yazdani | Nicola De Cao | James Thorne | Yacine Jernite | Vladimir Karpukhin | Jean Maillard | Vassilis Plachouras | Tim Rocktäschel | Sebastian Riedel

Challenging problems such as open-domain question answering, fact checking, slot filling and entity linking require access to large, external knowledge sources. While some models do well on individual tasks, developing general models is difficult as each task might require computationally expensive indexing of custom knowledge sources, in addition to dedicated infrastructure. To catalyze research on models that condition on specific information in large textual resources, we present a benchmark for knowledge-intensive language tasks (KILT). All tasks in KILT are grounded in the same snapshot of Wikipedia, reducing engineering turnaround through the re-use of components, as well as accelerating research into task-agnostic memory architectures. We test both task-specific and general baselines, evaluating downstream performance in addition to the ability of the models to provide provenance. We find that a shared dense vector index coupled with a seq2seq model is a strong baseline, outperforming more tailor-made approaches for fact checking, open-domain question answering and dialogue, and yielding competitive results on entity linking and slot filling, by generating disambiguated text. KILT data and code are available at https://github.com/facebookresearch/KILT.

pdf bib
UDALM : Unsupervised Domain Adaptation through Language ModelingUDALM: Unsupervised Domain Adaptation through Language Modeling
Constantinos Karouzos | Georgios Paraskevopoulos | Alexandros Potamianos

In this work we explore Unsupervised Domain Adaptation (UDA) of pretrained language models for downstream tasks. We introduce UDALM, a fine-tuning procedure, using a mixed classification and Masked Language Model loss, that can adapt to the target domain distribution in a robust and sample efficient manner. Our experiments show that performance of models trained with the mixed loss scales with the amount of available target data and the mixed loss can be effectively used as a stopping criterion during UDA training. Furthermore, we discuss the relationship between A-distance and the target error and explore some limitations of the Domain Adversarial Training approach. Our method is evaluated on twelve domain pairs of the Amazon Reviews Sentiment dataset, yielding 91.74 % accuracy, which is an 1.11 % absolute improvement over the state-of-the-art.

pdf bib
Beyond Black & White : Leveraging Annotator Disagreement via Soft-Label Multi-Task Learning
Tommaso Fornaciari | Alexandra Uma | Silviu Paun | Barbara Plank | Dirk Hovy | Massimo Poesio

Supervised learning assumes that a ground truth label exists. However, the reliability of this ground truth depends on human annotators, who often disagree. Prior work has shown that this disagreement can be helpful in training models. We propose a novel method to incorporate this disagreement as information : in addition to the standard error computation, we use soft-labels (i.e., probability distributions over the annotator labels) as an auxiliary task in a multi-task neural network. We measure the divergence between the predictions and the target soft-labels with several loss-functions and evaluate the models on various NLP tasks. We find that the soft-label prediction auxiliary task reduces the penalty for errors on ambiguous entities, and thereby mitigates overfitting. It significantly improves performance across tasks, beyond the standard approach and prior work.

pdf bib
Clustering-based Inference for Biomedical Entity Linking
Rico Angell | Nicholas Monath | Sunil Mohan | Nishant Yadav | Andrew McCallum

Due to large number of entities in biomedical knowledge bases, only a small fraction of entities have corresponding labelled training data. This necessitates entity linking models which are able to link mentions of unseen entities using learned representations of entities. Previous approaches link each mention independently, ignoring the relationships within and across documents between the entity mentions. These relations can be very useful for linking mentions in biomedical text where linking decisions are often difficult due mentions having a generic or a highly specialized form. In this paper, we introduce a model in which linking decisions can be made not merely by linking to a knowledge base entity but also by grouping multiple mentions together via clustering and jointly making linking predictions. In experiments on the largest publicly available biomedical dataset, we improve the best independent prediction for entity linking by 3.0 points of accuracy, and our clustering-based inference model further improves entity linking by 2.3 points.

pdf bib
Diversity-Aware Batch Active Learning for Dependency Parsing
Tianze Shi | Adrian Benton | Igor Malioutov | Ozan İrsoy

While the predictive performance of modern statistical dependency parsers relies heavily on the availability of expensive expert-annotated treebank data, not all annotations contribute equally to the training of the parsers. In this paper, we attempt to reduce the number of labeled examples needed to train a strong dependency parser using batch active learning (AL). In particular, we investigate whether enforcing diversity in the sampled batches, using determinantal point processes (DPPs), can improve over their diversity-agnostic counterparts. Simulation experiments on an English newswire corpus show that selecting diverse batches with DPPs is superior to strong selection strategies that do not enforce batch diversity, especially during the initial stages of the learning process. Additionally, our diversity-aware strategy is robust under a corpus duplication setting, where diversity-agnostic sampling strategies exhibit significant degradation.

pdf bib
Can Latent Alignments Improve Autoregressive Machine Translation?
Adi Haviv | Lior Vassertail | Omer Levy

Latent alignment objectives such as CTC and AXE significantly improve non-autoregressive machine translation models. Can they improve autoregressive models as well? We explore the possibility of training autoregressive machine translation models with latent alignment objectives, and observe that, in practice, this approach results in degenerate models. We provide a theoretical explanation for these empirical results, and prove that latent alignment objectives are incompatible with teacher forcing.

pdf bib
Smoothing and Shrinking the Sparse Seq2Seq Search SpaceSeq2Seq Search Space
Ben Peters | André F. T. Martins

Current sequence-to-sequence models are trained to minimize cross-entropy and use softmax to compute the locally normalized probabilities over target sequences. While this setup has led to strong results in a variety of tasks, one unsatisfying aspect is its length bias : models give high scores to short, inadequate hypotheses and often make the empty string the argmaxthe so-called cat got your tongue problem. Recently proposed entmax-based sparse sequence-to-sequence models present a possible solution, since they can shrink the search space by assigning zero probability to bad hypotheses, but their ability to handle word-level tasks with transformers has never been tested. In this work, we show that entmax-based models effectively solve the cat got your tongue problem, removing a major source of model error for neural machine translation. In addition, we generalize label smoothing, a critical regularization technique, to the broader family of Fenchel-Young losses, which includes both cross-entropy and the entmax losses. Our resulting label-smoothed entmax loss models set a new state of the art on multilingual grapheme-to-phoneme conversion and deliver improvements and better calibration properties on cross-lingual morphological inflection and machine translation for 7 language pairs.

pdf bib
Cross-Lingual Word Embedding Refinement by _ 1 Norm Optimisation1 Norm Optimisation
Xutan Peng | Chenghua Lin | Mark Stevenson

Cross-Lingual Word Embeddings (CLWEs) encode words from two or more languages in a shared high-dimensional space in which vectors representing words with similar meaning (regardless of language) are closely located. Existing methods for building high-quality CLWEs learn mappings that minimise the 2 norm loss function. However, this optimisation objective has been demonstrated to be sensitive to outliers. Based on the more robust Manhattan norm (aka. 1 norm) goodness-of-fit criterion, this paper proposes a simple post-processing step to improve CLWEs. An advantage of this approach is that it is fully agnostic to the training process of the original CLWEs and can therefore be applied widely. Extensive experiments are performed involving ten diverse languages and embeddings trained on different corpora. Evaluation results based on bilingual lexicon induction and cross-lingual transfer for natural language inference tasks show that the 1 refinement substantially outperforms four state-of-the-art baselines in both supervised and unsupervised settings. It is therefore recommended that this strategy be adopted as a standard for CLWE methods.

pdf bib
Learning to Synthesize Data for Semantic Parsing
Bailin Wang | Wenpeng Yin | Xi Victoria Lin | Caiming Xiong

Synthesizing data for semantic parsing has gained increasing attention recently. However, most methods require handcrafted (high-precision) rules in their generative process, hindering the exploration of diverse unseen data. In this work, we propose a generative model which features a (non-neural) PCFG that models the composition of programs (e.g., SQL), and a BART-based translation model that maps a program to an utterance. Due to the simplicity of PCFG and pre-trained BART, our generative model can be efficiently learned from existing data at hand. Moreover, explicitly modeling compositions using PCFG leads to better exploration of unseen programs, thus generate more diverse data. We evaluate our method in both in-domain and out-of-domain settings of text-to-SQL parsing on the standard benchmarks of GeoQuery and Spider, respectively. Our empirical results show that the synthesized data generated from our model can substantially help a semantic parser achieve better compositional and domain generalization.

pdf bib
Edge : Enriching Knowledge Graph Embeddings with External Text
Saed Rezayi | Handong Zhao | Sungchul Kim | Ryan Rossi | Nedim Lipka | Sheng Li

Knowledge graphs suffer from sparsity which degrades the quality of representations generated by various methods. While there is an abundance of textual information throughout the web and many existing knowledge bases, aligning information across these diverse data sources remains a challenge in the literature. Previous work has partially addressed this issue by enriching knowledge graph entities based on hard co-occurrence of words present in the entities of the knowledge graphs and external text, while we achieve soft augmentation by proposing a knowledge graph enrichment and embedding framework named Edge. Given an original knowledge graph, we first generate a rich but noisy augmented graph using external texts in semantic and structural level. To distill the relevant knowledge and suppress the introduced noise, we design a graph alignment term in a shared embedding space between the original graph and augmented graph. To enhance the embedding learning on the augmented graph, we further regularize the locality relationship of target entity based on negative sampling. Experimental results on four benchmark datasets demonstrate the robustness and effectiveness of Edge in link prediction and node classification.

pdf bib
Compositional Generalization for Neural Semantic Parsing via Span-level Supervised Attention
Pengcheng Yin | Hao Fang | Graham Neubig | Adam Pauls | Emmanouil Antonios Platanios | Yu Su | Sam Thomson | Jacob Andreas

We describe a span-level supervised attention loss that improves compositional generalization in semantic parsers. Our approach builds on existing losses that encourage attention maps in neural sequence-to-sequence models to imitate the output of classical word alignment algorithms. Where past work has used word-level alignments, we focus on spans ; borrowing ideas from phrase-based machine translation, we align subtrees in semantic parses to spans of input sentences, and encourage neural attention mechanisms to mimic these alignments. This method improves the performance of transformers, RNNs, and structured decoders on three benchmarks of compositional generalization.

pdf bib
Graph Ensemble Learning over Multiple Dependency Trees for Aspect-level Sentiment Classification
Xiaochen Hou | Peng Qi | Guangtao Wang | Rex Ying | Jing Huang | Xiaodong He | Bowen Zhou

Recent work on aspect-level sentiment classification has demonstrated the efficacy of incorporating syntactic structures such as dependency trees with graph neural networks (GNN), but these approaches are usually vulnerable to parsing errors. To better leverage syntactic information in the face of unavoidable errors, we propose a simple yet effective graph ensemble technique, GraphMerge, to make use of the predictions from different parsers. Instead of assigning one set of model parameters to each dependency tree, we first combine the dependency relations from different parses before applying GNNs over the resulting graph. This allows GNN models to be robust to parse errors at no additional computational cost, and helps avoid overparameterization and overfitting from GNN layer stacking by introducing more connectivity into the ensemble graph. Our experiments on the SemEval 2014 Task 4 and ACL 14 Twitter datasets show that our GraphMerge model not only outperforms models with single dependency tree, but also beats other ensemble models without adding model parameters.

pdf bib
Emotion-Infused Models for Explainable Psychological Stress Detection
Elsbeth Turcan | Smaranda Muresan | Kathleen McKeown

The problem of detecting psychological stress in online posts, and more broadly, of detecting people in distress or in need of help, is a sensitive application for which the ability to interpret models is vital. Here, we present work exploring the use of a semantically related task, emotion detection, for equally competent but more explainable and human-like psychological stress detection as compared to a black-box model. In particular, we explore the use of multi-task learning as well as emotion-based language model fine-tuning. With our emotion-infused models, we see comparable results to state-of-the-art BERT. Our analysis of the words used for prediction show that our emotion-infused models mirror psychological components of stress.

pdf bib
Aspect-based Sentiment Analysis with Type-aware Graph Convolutional Networks and Layer Ensemble
Yuanhe Tian | Guimin Chen | Yan Song

It is popular that neural graph-based models are applied in existing aspect-based sentiment analysis (ABSA) studies for utilizing word relations through dependency parses to facilitate the task with better semantic guidance for analyzing context and aspect words. However, most of these studies only leverage dependency relations without considering their dependency types, and are limited in lacking efficient mechanisms to distinguish the important relations as well as learn from different layers of graph based models. To address such limitations, in this paper, we propose an approach to explicitly utilize dependency types for ABSA with type-aware graph convolutional networks (T-GCN), where attention is used in T-GCN to distinguish different edges (relations) in the graph and attentive layer ensemble is proposed to comprehensively learn from different layers of T-GCN. The validity and effectiveness of our approach are demonstrated in the experimental results, where state-of-the-art performance is achieved on six English benchmark datasets. Further experiments are conducted to analyze the contributions of each component in our approach and illustrate how different layers in T-GCN help ABSA with quantitative and qualitative analysis.

pdf bib
Bot-Adversarial Dialogue for Safe Conversational Agents
Jing Xu | Da Ju | Margaret Li | Y-Lan Boureau | Jason Weston | Emily Dinan

Conversational agents trained on large unlabeled corpora of human interactions will learn patterns and mimic behaviors therein, which include offensive or otherwise toxic behavior. We introduce a new human-and-model-in-the-loop framework for evaluating the toxicity of such models, and compare a variety of existing methods in both the cases of non-adversarial and adversarial users that expose their weaknesses. We then go on to propose two novel methods for safe conversational agents, by either training on data from our new human-and-model-in-the-loop framework in a two-stage system, or baking-in safety to the generative model itself. We find our new techniques are (i) safer than existing models ; while (ii) maintaining usability metrics such as engagingness relative to state-of-the-art chatbots. In contrast, we expose serious safety issues in existing standard systems like GPT2, DialoGPT, and BlenderBot.

pdf bib
Action-Based Conversations Dataset : A Corpus for Building More In-Depth Task-Oriented Dialogue Systems
Derek Chen | Howard Chen | Yi Yang | Alexander Lin | Zhou Yu

Existing goal-oriented dialogue datasets focus mainly on identifying slots and values. However, customer support interactions in reality often involve agents following multi-step procedures derived from explicitly-defined company policies as well. To study customer service dialogue systems in more realistic settings, we introduce the Action-Based Conversations Dataset (ABCD), a fully-labeled dataset with over 10 K human-to-human dialogues containing 55 distinct user intents requiring unique sequences of actions constrained by policies to achieve task success. We propose two additional dialog tasks, Action State Tracking and Cascading Dialogue Success, and establish a series of baselines involving large-scale, pre-trained language models on this dataset. Empirical results demonstrate that while more sophisticated networks outperform simpler models, a considerable gap (50.8 % absolute accuracy) still exists to reach human-level performance on ABCD.

pdf bib
COIL : Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted ListCOIL: Revisit Exact Lexical Match in Information Retrieval with Contextualized Inverted List
Luyu Gao | Zhuyun Dai | Jamie Callan

Classical information retrieval systems such as BM25 rely on exact lexical match and can carry out search efficiently with inverted list index. Recent neural IR models shifts towards soft matching all query document terms, but they lose the computation efficiency of exact match systems. This paper presents COIL, a contextualized exact match retrieval architecture, where scoring is based on overlapping query document tokens’ contextualized representations. The new architecture stores contextualized token representations in inverted lists, bringing together the efficiency of exact match and the representation power of deep language models. Our experimental results show COIL outperforms classical lexical retrievers and state-of-the-art deep LM retrievers with similar or smaller latency.

pdf bib
Exploring the Relationship Between Algorithm Performance, Vocabulary, and Run-Time in Text Classification
Wilson Fearn | Orion Weller | Kevin Seppi

Text classification is a significant branch of natural language processing, and has many applications including document classification and sentiment analysis. Unsurprisingly, those who do text classification are concerned with the run-time of their algorithms, many of which depend on the size of the corpus’ vocabulary due to their bag-of-words representation. Although many studies have examined the effect of preprocessing techniques on vocabulary size and accuracy, none have examined how these methods affect a model’s run-time. To fill this gap, we provide a comprehensive study that examines how preprocessing techniques affect the vocabulary size, model performance, and model run-time, evaluating ten techniques over four models and two datasets. We show that some individual methods can reduce run-time with no loss of accuracy, while some combinations of methods can trade 2-5 % of the accuracy for up to a 65 % reduction of run-time. Furthermore, some combinations of preprocessing techniques can even provide a 15 % reduction in run-time while simultaneously improving model accuracy.

pdf bib
You Sound Like Someone Who Watches Drama Movies : Towards Predicting Movie Preferences from Conversational Interactions
Sergey Volokhin | Joyce Ho | Oleg Rokhlenko | Eugene Agichtein

The increasing popularity of voice-based personal assistants provides new opportunities for conversational recommendation. One particularly interesting area is movie recommendation, which can benefit from an open-ended interaction with the user, through a natural conversation. We explore one promising direction for conversational recommendation : mapping a conversational user, for whom there is limited or no data available, to most similar external reviewers, whose preferences are known, by representing the conversation as a user’s interest vector, and adapting collaborative filtering techniques to estimate the current user’s preferences for new movies. We call our proposed method ConvExtr (Conversational Collaborative Filtering using External Data), which 1) infers a user’s sentiment towards an entity from the conversation context, and 2) transforms the ratings of similar external reviewers to predict the current user’s preferences. We implement these steps by adapting contextual sentiment prediction techniques, and domain adaptation, respectively. To evaluate our method, we develop and make available a finely annotated dataset of movie recommendation conversations, which we call MovieSent. Our results demonstrate that ConvExtr can improve the accuracy of predicting users’ ratings for new movies by exploiting conversation content and external data.

pdf bib
Reading and Acting while Blindfolded : The Need for Semantics in Text Game Agents
Shunyu Yao | Karthik Narasimhan | Matthew Hausknecht

Text-based games simulate worlds and interact with players using natural language. Recent work has used them as a testbed for autonomous language-understanding agents, with the motivation being that understanding the meanings of words or semantics is a key component of how humans understand, reason, and act in these worlds. However, it remains unclear to what extent artificial agents utilize semantic understanding of the text. To this end, we perform experiments to systematically reduce the amount of semantic information available to a learning agent. Surprisingly, we find that an agent is capable of achieving high scores even in the complete absence of language semantics, indicating that the currently popular experimental setup and models may be poorly designed to understand and leverage game texts. To remedy this deficiency, we propose an inverse dynamics decoder to regularize the representation space and encourage exploration, which shows improved performance on several games including Zork I. We discuss the implications of our findings for designing future agents with stronger semantic understanding.

pdf bib
CaSiNo : A Corpus of Campsite Negotiation Dialogues for Automatic Negotiation SystemsCaSiNo: A Corpus of Campsite Negotiation Dialogues for Automatic Negotiation Systems
Kushal Chawla | Jaysa Ramirez | Rene Clever | Gale Lucas | Jonathan May | Jonathan Gratch

Automated systems that negotiate with humans have broad applications in pedagogy and conversational AI. To advance the development of practical negotiation systems, we present CaSiNo : a novel corpus of over a thousand negotiation dialogues in English. Participants take the role of campsite neighbors and negotiate for food, water, and firewood packages for their upcoming trip. Our design results in diverse and linguistically rich negotiations while maintaining a tractable, closed-domain environment. Inspired by the literature in human-human negotiations, we annotate persuasion strategies and perform correlation analysis to understand how the dialogue behaviors are associated with the negotiation performance. We further propose and evaluate a multi-task framework to recognize these strategies in a given utterance. We find that multi-task learning substantially improves the performance for all strategy labels, especially for the ones that are the most skewed. We release the dataset, annotations, and the code to propel future work in human-machine negotiations : https://github.com/kushalchawla/CaSiNo

pdf bib
News Headline Grouping as a Challenging NLU TaskNLU Task
Philippe Laban | Lucas Bandarkar | Marti A. Hearst

Recent progress in Natural Language Understanding (NLU) has seen the latest models outperform human performance on many standard tasks. These impressive results have led the community to introspect on dataset limitations, and iterate on more nuanced challenges. In this paper, we introduce the task of HeadLine Grouping (HLG) and a corresponding dataset (HLGD) consisting of 20,056 pairs of news headlines, each labeled with a binary judgement as to whether the pair belongs within the same group. On HLGD, human annotators achieve high performance of around 0.9 F-1, while current state-of-the art Transformer models only reach 0.75 F-1, opening the path for further improvements. We further propose a novel unsupervised Headline Generator Swap model for the task of HeadLine Grouping that achieves within 3 F-1 of the best supervised model. Finally, we analyze high-performing models with consistency tests, and find that models are not consistent in their predictions, revealing modeling limits of current architectures.

pdf bib
Ensemble of MRR and NDCG models for Visual DialogMRR and NDCG models for Visual Dialog
Idan Schwartz

Assessing an AI agent that can converse in human language and understand visual content is challenging. Generation metrics, such as BLEU scores favor correct syntax over semantics. Hence a discriminative approach is often used, where an agent ranks a set of candidate options. The mean reciprocal rank (MRR) metric evaluates the model performance by taking into account the rank of a single human-derived answer. This approach, however, raises a new challenge : the ambiguity and synonymy of answers, for instance, semantic equivalence (e.g., ‘yeah’ and ‘yes’). To address this, the normalized discounted cumulative gain (NDCG) metric has been used to capture the relevance of all the correct answers via dense annotations. However, the NDCG metric favors the usually applicable uncertain answers such as ‘I do n’t know.’ Crafting a model that excels on both MRR and NDCG metrics is challenging. Ideally, an AI agent should answer a human-like reply and validate the correctness of any answer. To address this issue, we describe a two-step non-parametric ranking approach that can merge strong MRR and NDCG models. Using our approach, we manage to keep most MRR state-of-the-art performance (70.41 % vs. 71.24 %) and the NDCG state-of-the-art performance (72.16 % vs. 75.35 %). Moreover, our approach won the recent Visual Dialog 2020 challenge. Source code is available at https://github.com/idansc/mrr-ndcg.

pdf bib
CREAD : Combined Resolution of Ellipses and Anaphora in DialoguesCREAD: Combined Resolution of Ellipses and Anaphora in Dialogues
Bo-Hsiang Tseng | Shruti Bhargava | Jiarui Lu | Joel Ruben Antony Moniz | Dhivya Piraviperumal | Lin Li | Hong Yu

Anaphora and ellipses are two common phenomena in dialogues. Without resolving referring expressions and information omission, dialogue systems may fail to generate consistent and coherent responses. Traditionally, anaphora is resolved by coreference resolution and ellipses by query rewrite. In this work, we propose a novel joint learning framework of modeling coreference resolution and query rewriting for complex, multi-turn dialogue understanding. Given an ongoing dialogue between a user and a dialogue assistant, for the user query, our joint learning model first predicts coreference links between the query and the dialogue context, and then generates a self-contained rewritten user query. To evaluate our model, we annotate a dialogue based coreference resolution dataset, MuDoCo, with rewritten queries. Results show that the performance of query rewrite can be substantially boosted (+2.3 % F1) with the aid of coreference modeling. Furthermore, our joint model outperforms the state-of-the-art coreference resolution model (+2 % F1) on this dataset.

pdf bib
Knowledge-Driven Slot Constraints for Goal-Oriented Dialogue Systems
Piyawat Lertvittayakumjorn | Daniele Bonadiman | Saab Mansour

In goal-oriented dialogue systems, users provide information through slot values to achieve specific goals. Practically, some combinations of slot values can be invalid according to external knowledge. For example, a combination of cheese pizza (a menu item) and oreo cookies (a topping) from an input utterance Can I order a cheese pizza with oreo cookies on top? exemplifies such invalid combinations according to the menu of a restaurant business. Traditional dialogue systems allow execution of validation rules as a post-processing step after slots have been filled which can lead to error accumulation. In this paper, we formalize knowledge-driven slot constraints and present a new task of constraint violation detection accompanied with benchmarking data. Then, we propose methods to integrate the external knowledge into the system and model constraint violation detection as an end-to-end classification task and compare it to the traditional rule-based pipeline approach. Experiments on two domains of the MultiDoGO dataset reveal challenges of constraint violation detection and sets the stage for future work and improvements.

pdf bib
Clipping Loops for Sample-Efficient Dialogue Policy Optimisation
Yen-Chen Wu | Carl Edward Rasmussen

Training dialogue agents requires a large number of interactions with users : agents have no idea about which responses are bad among a lengthy dialogue. In this paper, we propose loop-clipping policy optimisation (LCPO) to eliminate useless responses. LCPO consists of two stages : loop clipping and advantage clipping. In loop clipping, we clip off useless responses (called loops) from dialogue history (called trajectories). The clipped trajectories are more succinct than the original ones, and the estimation of state-value is more accurate. Second, in advantage clipping, we estimate and clip the advantages of useless responses and normal ones separately. The clipped advantage distinguish useless actions from others and reduce the probabilities of useless actions efficiently. In experiments on Cambridge Restaurant Dialogue System, LCPO uses only 260 training dialogues to achieve 80 % success rate, while PPO baseline requires 2160 dialogues. Besides, LCPO receives 3.7/5 scores in human evaluation where the agent interactively collects 100 real-user dialogues in training phase.

pdf bib
TABBIE : Pretrained Representations of Tabular DataTABBIE: Pretrained Representations of Tabular Data
Hiroshi Iida | Dung Thai | Varun Manjunatha | Mohit Iyyer

Existing work on tabular representation-learning jointly models tables and associated text using self-supervised objective functions derived from pretrained language models such as BERT. While this joint pretraining improves tasks involving paired tables and text (e.g., answering questions about tables), we show that it underperforms on tasks that operate over tables without any associated text (e.g., populating missing cells). We devise a simple pretraining objective (corrupt cell detection) that learns exclusively from tabular data and reaches the state-of-the-art on a suite of table-based prediction tasks. Unlike competing approaches, our model (TABBIE) provides embeddings of all table substructures (cells, rows, and columns), and it also requires far less compute to train. A qualitative analysis of our model’s learned cell, column, and row representations shows that it understands complex table semantics and numerical trends.

pdf bib
Multi-Style Transfer with Discriminative Feedback on Disjoint Corpus
Navita Goyal | Balaji Vasan Srinivasan | Anandhavelu N | Abhilasha Sancheti

Style transfer has been widely explored in natural language generation with non-parallel corpus by directly or indirectly extracting a notion of style from source and target domain corpus. A common shortcoming of existing approaches is the prerequisite of joint annotations across all the stylistic dimensions under consideration. Availability of such dataset across a combination of styles limits the extension of these setups to multiple style dimensions. While cascading single-dimensional models across multiple styles is a possibility, it suffers from content loss, especially when the style dimensions are not completely independent of each other. In our work, we relax this requirement of jointly annotated data across multiple styles by using independently acquired data across different style dimensions without any additional annotations. We initialize an encoder-decoder setup with transformer-based language model pre-trained on a generic corpus and enhance its re-writing capability to multiple target style dimensions by employing multiple style-aware language models as discriminators. Through quantitative and qualitative evaluation, we show the ability of our model to control styles across multiple style dimensions while preserving content of the input text. We compare it against baselines involving cascaded state-of-the-art uni-dimensional style transfer models.

pdf bib
InfoXLM : An Information-Theoretic Framework for Cross-Lingual Language Model Pre-TrainingInfoXLM: An Information-Theoretic Framework for Cross-Lingual Language Model Pre-Training
Zewen Chi | Li Dong | Furu Wei | Nan Yang | Saksham Singhal | Wenhui Wang | Xia Song | Xian-Ling Mao | Heyan Huang | Ming Zhou

In this work, we present an information-theoretic framework that formulates cross-lingual language model pre-training as maximizing mutual information between multilingual-multi-granularity texts. The unified view helps us to better understand the existing methods for learning cross-lingual representations. More importantly, inspired by the framework, we propose a new pre-training task based on contrastive learning. Specifically, we regard a bilingual sentence pair as two views of the same meaning and encourage their encoded representations to be more similar than the negative examples. By leveraging both monolingual and parallel corpora, we jointly train the pretext tasks to improve the cross-lingual transferability of pre-trained models. Experimental results on several benchmarks show that our approach achieves considerably better performance. The code and pre-trained models are available at https://aka.ms/infoxlm.

pdf bib
X-METRA-ADA : Cross-lingual Meta-Transfer learning Adaptation to Natural Language Understanding and Question AnsweringX-METRA-ADA: Cross-lingual Meta-Transfer learning Adaptation to Natural Language Understanding and Question Answering
Meryem M’hamdi | Doo Soon Kim | Franck Dernoncourt | Trung Bui | Xiang Ren | Jonathan May

Multilingual models, such as M-BERT and XLM-R, have gained increasing popularity, due to their zero-shot cross-lingual transfer learning capabilities. However, their generalization ability is still inconsistent for typologically diverse languages and across different benchmarks. Recently, meta-learning has garnered attention as a promising technique for enhancing transfer learning under low-resource scenarios : particularly for cross-lingual transfer in Natural Language Understanding (NLU). In this work, we propose X-METRA-ADA, a cross-lingual MEta-TRAnsfer learning ADAptation approach for NLU. Our approach adapts MAML, an optimization-based meta-learning approach, to learn to adapt to new languages. We extensively evaluate our framework on two challenging cross-lingual NLU tasks : multilingual task-oriented dialog and typologically diverse question answering. We show that our approach outperforms naive fine-tuning, reaching competitive performance on both tasks for most languages. Our analysis reveals that X-METRA-ADA can leverage limited data for faster adaptation.

pdf bib
Adaptable and Interpretable Neural MemoryOver Symbolic KnowledgeMemoryOver Symbolic Knowledge
Pat Verga | Haitian Sun | Livio Baldini Soares | William Cohen

Past research has demonstrated that large neural language models (LMs) encode surprising amounts of factual information : however, augmenting or modifying this information requires modifying a corpus and retraining, which is computationally expensive. To address this problem, we develop a neural LM that includes an interpretable neuro-symbolic KB in the form of a fact memory. Each element of the fact memory is formed from a triple of vectors, where each vector corresponds to a KB entity or relation. Our LM improves performance on knowledge-intensive question-answering tasks, sometimes dramatically, including a 27 point increase in one setting of WebQuestionsSP over a state-of-the-art open-book model, despite using 5 % of the parameters. Most interestingly, we demonstrate that the model can be modified, without any re-training, by updating the fact memory.any re-training, by updating the fact memory.

pdf bib
Refining Targeted Syntactic Evaluation of Language Models
Benjamin Newman | Kai-Siang Ang | Julia Gong | John Hewitt

Targeted syntactic evaluation of subject-verb number agreement in English (TSE) evaluates language models’ syntactic knowledge using hand-crafted minimal pairs of sentences that differ only in the main verb’s conjugation. The method evaluates whether language models rate each grammatical sentence as more likely than its ungrammatical counterpart. We identify two distinct goals for TSE. First, evaluating the systematicity of a language model’s syntactic knowledge : given a sentence, can it conjugate arbitrary verbs correctly? Second, evaluating a model’s likely behavior : given a sentence, does the model concentrate its probability mass on correctly conjugated verbs, even if only on a subset of the possible verbs? We argue that current implementations of TSE do not directly capture either of these goals, and propose new metrics to capture each goal separately. Under our metrics, we find that TSE overestimates systematicity of language models, but that models score up to 40 % better on verbs that they predict are likely in context.

pdf bib
Dynamically Disentangling Social Bias from Task-Oriented Representations with Adversarial Attack
Liwen Wang | Yuanmeng Yan | Keqing He | Yanan Wu | Weiran Xu

Representation learning is widely used in NLP for a vast range of tasks. However, representations derived from text corpora often reflect social biases. This phenomenon is pervasive and consistent across different neural models, causing serious concern. Previous methods mostly rely on a pre-specified, user-provided direction or suffer from unstable training. In this paper, we propose an adversarial disentangled debiasing model to dynamically decouple social bias attributes from the intermediate representations trained on the main task. We aim to denoise bias information while training on the downstream task, rather than completely remove social bias and pursue static unbiased representations. Experiments show the effectiveness of our method, both on the effect of debiasing and the main task performance.

pdf bib
On the Impact of Random Seeds on the Fairness of Clinical Classifiers
Silvio Amir | Jan-Willem van de Meent | Byron Wallace

Recent work has shown that fine-tuning large networks is surprisingly sensitive to changes in random seed(s). We explore the implications of this phenomenon for model fairness across demographic groups in clinical prediction tasks over electronic health records (EHR) in MIMIC-III the standard dataset in clinical NLP research. Apparent subgroup performance varies substantially for seeds that yield similar overall performance, although there is no evidence of a trade-off between overall and subgroup performance. However, we also find that the small sample sizes inherent to looking at intersections of minority groups and somewhat rare conditions limit our ability to accurately estimate disparities. Further, we find that jointly optimizing for high overall performance and low disparities does not yield statistically significant improvements. Our results suggest that fairness work using MIMIC-III should carefully account for variations in apparent differences that may arise from stochasticity and small sample sizes.

pdf bib
Topic Model or Topic Twaddle? Re-evaluating Semantic Interpretability Measures
Caitlin Doogan | Wray Buntine

When developing topic models, a critical question that should be asked is : How well will this model work in an applied setting? Because standard performance evaluation of topic interpretability uses automated measures modeled on human evaluation tests that are dissimilar to applied usage, these models’ generalizability remains in question. In this paper, we probe the issue of validity in topic model evaluation and assess how informative coherence measures are for specialized collections used in an applied setting. Informed by the literature, we propose four understandings of interpretability. We evaluate these using a novel experimental framework reflective of varied applied settings, including human evaluations using open labeling, typical of applied research. These evaluations show that for some specialized collections, standard coherence measures may not inform the most appropriate topic model or the optimal number of topics, and current interpretability performance validation methods are challenged as a means to confirm model quality in the absence of ground truth data.

pdf bib
Learning to Learn to be Right for the Right Reasons
Pride Kavumba | Benjamin Heinzerling | Ana Brassard | Kentaro Inui

Improving model generalization on held-out data is one of the core objectives in common- sense reasoning. Recent work has shown that models trained on the dataset with superficial cues tend to perform well on the easy test set with superficial cues but perform poorly on the hard test set without superficial cues. Previous approaches have resorted to manual methods of encouraging models not to overfit to superficial cues. While some of the methods have improved performance on hard instances, they also lead to degraded performance on easy in- stances. Here, we propose to explicitly learn a model that does well on both the easy test set with superficial cues and the hard test set without superficial cues. Using a meta-learning objective, we learn such a model that improves performance on both the easy test set and the hard test set. By evaluating our models on Choice of Plausible Alternatives (COPA) and Commonsense Explanation, we show that our proposed method leads to improved performance on both the easy test set and the hard test set upon which we observe up to 16.5 percentage points improvement over the baseline.

pdf bib
Double Perturbation : On the Robustness of Robustness and Counterfactual Bias Evaluation
Chong Zhang | Jieyu Zhao | Huan Zhang | Kai-Wei Chang | Cho-Jui Hsieh

Robustness and counterfactual bias are usually evaluated on a test dataset. However, are these evaluations robust? If the test dataset is perturbed slightly, will the evaluation results keep the same? In this paper, we propose a double perturbation framework to uncover model weaknesses beyond the test dataset. The framework first perturbs the test dataset to construct abundant natural sentences similar to the test data, and then diagnoses the prediction change regarding a single-word substitution. We apply this framework to study two perturbation-based approaches that are used to analyze models’ robustness and counterfactual bias in English. (1) For robustness, we focus on synonym substitutions and identify vulnerable examples where prediction can be altered. Our proposed attack attains high success rates (96.0%-99.8 %) in finding vulnerable examples on both original and robustly trained CNNs and Transformers. (2) For counterfactual bias, we focus on substituting demographic tokens (e.g., gender, race) and measure the shift of the expected prediction among constructed sentences. Our method is able to reveal the hidden model biases not directly shown in the test dataset. Our code is available at https://github.com/chong-z/nlp-second-order-attack.

pdf bib
Almost Free Semantic Draft for Neural Machine Translation
Xi Ai | Bin Fang

Translation quality can be improved by global information from the required target sentence because the decoder can understand both past and future information. However, the model needs additional cost to produce and consider such global information. In this work, to inject global information but also save cost, we present an efficient method to sample and consider a semantic draft as global information from semantic space for decoding with almost free of cost. Unlike other successful adaptations, we do not have to perform an EM-like process that repeatedly samples a possible semantic from the semantic space. Empirical experiments show that the presented method can achieve competitive performance in common language pairs with a clear advantage in inference efficiency. We will open all our source code on GitHub.

pdf bib
Pruning-then-Expanding Model for Domain Adaptation of Neural Machine Translation
Shuhao Gu | Yang Feng | Wanying Xie

Domain Adaptation is widely used in practical applications of neural machine translation, which aims to achieve good performance on both general domain and in-domain data. However, the existing methods for domain adaptation usually suffer from catastrophic forgetting, large domain divergence, and model explosion. To address these three problems, we propose a method of divide and conquer which is based on the importance of neurons or parameters for the translation model. In this method, we first prune the model and only keep the important neurons or parameters, making them responsible for both general-domain and in-domain translation. Then we further train the pruned model supervised by the original whole model with knowledge distillation. Last we expand the model to the original size and fine-tune the added parameters for the in-domain translation. We conducted experiments on different language pairs and domains and the results show that our method can achieve significant improvements compared with several strong baselines.

pdf bib
Continual Learning for Neural Machine Translation
Yue Cao | Hao-Ran Wei | Boxing Chen | Xiaojun Wan

Neural machine translation (NMT) models are data-driven and require large-scale training corpus. In practical applications, NMT models are usually trained on a general domain corpus and then fine-tuned by continuing training on the in-domain corpus. However, this bears the risk of catastrophic forgetting that the performance on the general domain is decreased drastically. In this work, we propose a new continual learning framework for NMT models. We consider a scenario where the training is comprised of multiple stages and propose a dynamic knowledge distillation technique to alleviate the problem of catastrophic forgetting systematically. We also find that the bias exists in the output linear projection when fine-tuning on the in-domain corpus, and propose a bias-correction module to eliminate the bias. We conduct experiments on three representative settings of NMT application. Experimental results show that the proposed method achieves superior performance compared to baseline models in all settings.

pdf bib
ER-AE : Differentially Private Text Generation for Authorship AnonymizationER-AE: Differentially Private Text Generation for Authorship Anonymization
Haohan Bo | Steven H. H. Ding | Benjamin C. M. Fung | Farkhund Iqbal

Most of privacy protection studies for textual data focus on removing explicit sensitive identifiers. However, personal writing style, as a strong indicator of the authorship, is often neglected. Recent studies, such as SynTF, have shown promising results on privacy-preserving text mining. However, their anonymization algorithm can only output numeric term vectors which are difficult for the recipients to interpret. We propose a novel text generation model with a two-set exponential mechanism for authorship anonymization. By augmenting the semantic information through a REINFORCE training reward function, the model can generate differentially private text that has a close semantic and similar grammatical structure to the original text while removing personal traits of the writing style. It does not assume any conditioned labels or paralleled text data for training. We evaluate the performance of the proposed model on the real-life peer reviews dataset and the Yelp review dataset. The result suggests that our model outperforms the state-of-the-art on semantic preservation, authorship obfuscation, and stylometric transformation.

pdf bib
A recipe for annotating grounded clarifications
Luciana Benotti | Patrick Blackburn

In order to interpret the communicative intents of an utterance, it needs to be grounded in something that is outside of language ; that is, grounded in world modalities. In this paper, we argue that dialogue clarification mechanisms make explicit the process of interpreting the communicative intents of the speaker’s utterances by grounding them in the various modalities in which the dialogue is situated. This paper frames dialogue clarification mechanisms as an understudied research problem and a key missing piece in the giant jigsaw puzzle of natural language understanding. We discuss both the theoretical background and practical challenges posed by this problem and propose a recipe for obtaining grounding annotations. We conclude by highlighting ethical issues that need to be addressed in future work.

pdf bib
Grey-box Adversarial Attack And Defence For Sentiment Classification
Ying Xu | Xu Zhong | Antonio Jimeno Yepes | Jey Han Lau

We introduce a grey-box adversarial attack and defence framework for sentiment classification. We address the issues of differentiability, label preservation and input reconstruction for adversarial attack and defence in one unified framework. Our results show that once trained, the attacking model is capable of generating high-quality adversarial examples substantially faster (one order of magnitude less in time) than state-of-the-art attacking methods. These examples also preserve the original sentiment according to human evaluation. Additionally, our framework produces an improved classifier that is robust in defending against multiple adversarial attacking methods. Code is available at : https://github.com/ibm-aur-nlp/adv-def-text-dist.

pdf bib
Dynabench : Rethinking Benchmarking in NLPNLP
Douwe Kiela | Max Bartolo | Yixin Nie | Divyansh Kaushik | Atticus Geiger | Zhengxuan Wu | Bertie Vidgen | Grusha Prasad | Amanpreet Singh | Pratik Ringshia | Zhiyi Ma | Tristan Thrush | Sebastian Riedel | Zeerak Waseem | Pontus Stenetorp | Robin Jia | Mohit Bansal | Christopher Potts | Adina Williams

We introduce Dynabench, an open-source platform for dynamic dataset creation and model benchmarking. Dynabench runs in a web browser and supports human-and-model-in-the-loop dataset creation : annotators seek to create examples that a target model will misclassify, but that another person will not. In this paper, we argue that Dynabench addresses a critical need in our community : contemporary models quickly achieve outstanding performance on benchmark tasks but nonetheless fail on simple challenge examples and falter in real-world scenarios. With Dynabench, dataset creation, model development, and model assessment can directly inform each other, leading to more robust and informative benchmarks. We report on four initial NLP tasks, illustrating these concepts and highlighting the promise of the platform, and address potential objections to dynamic benchmarking as a new standard for the field.

pdf bib
Predicting Discourse Trees from Transformer-based Neural Summarizers
Wen Xiao | Patrick Huber | Giuseppe Carenini

Previous work indicates that discourse information benefits summarization. In this paper, we explore whether this synergy between discourse and summarization is bidirectional, by inferring document-level discourse trees from pre-trained neural summarizers. In particular, we generate unlabeled RST-style discourse trees from the self-attention matrices of the transformer model. Experiments across models and datasets reveal that the summarizer learns both, dependency- and constituency-style discourse information, which is typically encoded in a single head, covering long- and short-distance discourse dependencies. Overall, the experimental results suggest that the learned discourse information is general and transferable inter-domain.

pdf bib
Stay Together : A System for Single and Split-antecedent Anaphora Resolution
Juntao Yu | Nafise Sadat Moosavi | Silviu Paun | Massimo Poesio

The state-of-the-art on basic, single-antecedent anaphora has greatly improved in recent years. Researchers have therefore started to pay more attention to more complex cases of anaphora such as split-antecedent anaphora, as in Time-Warner is considering a legal challenge to Telecommunications Inc’s plan to buy half of Showtime Networks Inca move that could lead to all-out war between the two powerful companies. Split-antecedent anaphora is rarer and more complex to resolve than single-antecedent anaphora ; as a result, it is not annotated in many datasets designed to test coreference, and previous work on resolving this type of anaphora was carried out in unrealistic conditions that assume gold mentions and/or gold split-antecedent anaphors are available. These systems also focus on split-antecedent anaphors only. In this work, we introduce a system that resolves both single and split-antecedent anaphors, and evaluate it in a more realistic setting that uses predicted mentions. We also start addressing the question of how to evaluate single and split-antecedent anaphors together using standard coreference evaluation metrics.

pdf bib
CoRT : Complementary Rankings from TransformersCoRT: Complementary Rankings from Transformers
Marco Wrzalik | Dirk Krechel

Many recent approaches towards neural information retrieval mitigate their computational costs by using a multi-stage ranking pipeline. In the first stage, a number of potentially relevant candidates are retrieved using an efficient retrieval model such as BM25. Although BM25 has proven decent performance as a first-stage ranker, it tends to miss relevant passages. In this context we propose CoRT, a simple neural first-stage ranking model that leverages contextual representations from pretrained language models such as BERT to complement term-based ranking functions while causing no significant delay at query time. Using the MS MARCO dataset, we show that CoRT significantly increases the candidate recall by complementing BM25 with missing candidates. Consequently, we find subsequent re-rankers achieve superior results with less candidates. We further demonstrate that passage retrieval using CoRT can be realized with surprisingly low latencies.

pdf bib
Multi-source Neural Topic Modeling in Multi-view Embedding Spaces
Pankaj Gupta | Yatin Chaudhary | Hinrich Schütze

Though word embeddings and topics are complementary representations, several past works have only used pretrained word embeddings in (neural) topic modeling to address data sparsity in short-text or small collection of documents. This work presents a novel neural topic modeling framework using multi-view embed ding spaces : (1) pretrained topic-embeddings, and (2) pretrained word-embeddings (context-insensitive from Glove and context-sensitive from BERT models) jointly from one or many sources to improve topic quality and better deal with polysemy. In doing so, we first build respective pools of pretrained topic (i.e., TopicPool) and word embeddings (i.e., WordPool). We then identify one or more relevant source domain(s) and transfer knowledge to guide meaningful learning in the sparse target domain. Within neural topic modeling, we quantify the quality of topics and document representations via generalization (perplexity), interpretability (topic coherence) and information retrieval (IR) using short-text, long-text, small and large document collections from news and medical domains. Introducing the multi-source multi-view embedding spaces, we have shown state-of-the-art neural topic modeling using 6 source (high-resource) and 5 target (low-resource) corpora.

pdf bib
Self-Alignment Pretraining for Biomedical Entity Representations
Fangyu Liu | Ehsan Shareghi | Zaiqiao Meng | Marco Basaldella | Nigel Collier

Despite the widespread success of self-supervised learning via masked language models (MLM), accurately capturing fine-grained semantic relationships in the biomedical domain remains a challenge. This is of paramount importance for entity-level tasks such as entity linking where the ability to model entity relations (especially synonymy) is pivotal. To address this challenge, we propose SapBERT, a pretraining scheme that self-aligns the representation space of biomedical entities. We design a scalable metric learning framework that can leverage UMLS, a massive collection of biomedical ontologies with 4M+ concepts. In contrast with previous pipeline-based hybrid systems, SapBERT offers an elegant one-model-for-all solution to the problem of medical entity linking (MEL), achieving a new state-of-the-art (SOTA) on six MEL benchmarking datasets. In the scientific domain, we achieve SOTA even without task-specific supervision. With substantial improvement over various domain-specific pretrained MLMs such as BioBERT, SciBERTand and PubMedBERT, our pretraining scheme proves to be both effective and robust.

pdf bib
TaxoClass : Hierarchical Multi-Label Text Classification Using Only Class NamesTaxoClass: Hierarchical Multi-Label Text Classification Using Only Class Names
Jiaming Shen | Wenda Qiu | Yu Meng | Jingbo Shang | Xiang Ren | Jiawei Han

Hierarchical multi-label text classification (HMTC) aims to tag each document with a set of classes from a taxonomic class hierarchy. Most existing HMTC methods train classifiers using massive human-labeled documents, which are often too costly to obtain in real-world applications. In this paper, we explore to conduct HMTC based on only class surface names as supervision signals. We observe that to perform HMTC, human experts typically first pinpoint a few most essential classes for the document as its core classes, and then check core classes’ ancestor classes to ensure the coverage. To mimic human experts, we propose a novel HMTC framework, named TaxoClass. Specifically, TaxoClass (1) calculates document-class similarities using a textual entailment model, (2) identifies a document’s core classes and utilizes confident core classes to train a taxonomy-enhanced classifier, and (3) generalizes the classifier via multi-label self-training. Our experiments on two challenging datasets show TaxoClass can achieve around 0.71 Example-F1 using only class names, outperforming the best previous method by 25 %.

pdf bib
MERMAID : Metaphor Generation with Symbolism and Discriminative DecodingMERMAID: Metaphor Generation with Symbolism and Discriminative Decoding
Tuhin Chakrabarty | Xurui Zhang | Smaranda Muresan | Nanyun Peng

Generating metaphors is a challenging task as it requires a proper understanding of abstract concepts, making connections between unrelated concepts, and deviating from the literal meaning. In this paper, we aim to generate a metaphoric sentence given a literal expression by replacing relevant verbs. Based on a theoretically-grounded connection between metaphors and symbols, we propose a method to automatically construct a parallel corpus by transforming a large number of metaphorical sentences from the Gutenberg Poetry corpus (CITATION) to their literal counterpart using recent advances in masked language modeling coupled with commonsense inference. For the generation task, we incorporate a metaphor discriminator to guide the decoding of a sequence to sequence model fine-tuned on our parallel data to generate high-quality metaphors. Human evaluation on an independent test set of literal statements shows that our best model generates metaphors better than three well-crafted baselines 66 % of the time on average. A task-based evaluation shows that human-written poems enhanced with metaphors proposed by our model are preferred 68 % of the time compared to poems without metaphors.

pdf bib
Ask what’s missing and what’s useful : Improving Clarification Question Generation using Global Knowledge
Bodhisattwa Prasad Majumder | Sudha Rao | Michel Galley | Julian McAuley

The ability to generate clarification questions i.e., questions that identify useful missing information in a given context, is important in reducing ambiguity. Humans use previous experience with similar contexts to form a global view and compare it to the given context to ascertain what is missing and what is useful in the context. Inspired by this, we propose a model for clarification question generation where we first identify what is missing by taking a difference between the global and the local view and then train a model to identify what is useful and generate a question about it. Our model outperforms several baselines as judged by both automatic metrics and humans.

pdf bib
I’m Not Mad : Commonsense Implications of Negation and ContradictionI’m Not Mad”: Commonsense Implications of Negation and Contradiction
Liwei Jiang | Antoine Bosselut | Chandra Bhagavatula | Yejin Choi

Natural language inference requires reasoning about contradictions, negations, and their commonsense implications. Given a simple premise (e.g., I’m mad at you), humans can reason about the varying shades of contradictory statements ranging from straightforward negations (I’m not mad at you) to commonsense contradictions (I’m happy). Moreover, these negated or contradictory statements shift the commonsense implications of the original premise in interesting and nontrivial ways. For example, while I’m mad implies I’m unhappy about something, negating the premise does not necessarily negate the corresponding commonsense implications. In this paper, we present the first comprehensive study focusing on commonsense implications of negated statements and contradictions. We introduce ANION, a new commonsense knowledge graph with 624 K if-then rules focusing on negated and contradictory events. We then present joint generative and discriminative inference models for this new resource, providing novel empirical insights on how logical negations and commonsense contradictions reshape the commonsense implications of their original premises.

pdf bib
Identifying Medical Self-Disclosure in Online Communities
Mina Valizadeh | Pardis Ranjbar-Noiey | Cornelia Caragea | Natalie Parde

Self-disclosure in online health conversations may offer a host of benefits, including earlier detection and treatment of medical issues that may have otherwise gone unaddressed. However, research analyzing medical self-disclosure in online communities is limited. We address this shortcoming by introducing a new dataset of health-related posts collected from online social platforms, categorized into three groups (No Self-Disclosure, Possible Self-Disclosure, and Clear Self-Disclosure) with high inter-annotator agreement (_ k_=0.88). We make this data available to the research community. We also release a predictive model trained on this dataset that achieves an accuracy of 81.02 %, establishing a strong performance benchmark for this task.

pdf bib
Language in a (Search) Box : Grounding Language Learning in Real-World Human-Machine Interaction
Federico Bianchi | Ciro Greco | Jacopo Tagliabue

We investigate grounded language learning through real-world data, by modelling a teacher-learner dynamics through the natural interactions occurring between users and search engines ; in particular, we explore the emergence of semantic generalization from unsupervised dense representations outside of synthetic environments. A grounding domain, a denotation function and a composition function are learned from user data only. We show how the resulting semantics for noun phrases exhibits compositional properties while being fully learnable without any explicit labelling. We benchmark our grounded semantics on compositionality and zero-shot inference tasks, and we show that it provides better results and better generalizations than SOTA non-grounded models, such as word2vec and BERT.

pdf bib
Finding Concept-specific Biases in FormMeaning Associations
Tiago Pimentel | Brian Roark | Søren Wichmann | Ryan Cotterell | Damián Blasi

This work presents an information-theoretic operationalisation of cross-linguistic non-arbitrariness. It is not a new idea that there are small, cross-linguistic associations between the forms and meanings of words. For instance, it has been claimed (Blasi et al., 2016) that the word for tongue is more likely than chance to contain the phone [ l ]. By controlling for the influence of language family and geographic proximity within a very large concept-aligned, cross-lingual lexicon, we extend methods previously used to detect within language non-arbitrariness (Pimentel et al., 2019) to measure cross-linguistic associations. We find that there is a significant effect of non-arbitrariness, but it is unsurprisingly small (less than 0.5 % on average according to our information-theoretic estimate). We also provide a concept-level analysis which shows that a quarter of the concepts considered in our work exhibit a significant level of cross-linguistic non-arbitrariness. In sum, the paper provides new methods to detect cross-linguistic associations at scale, and confirms their effects are minor.

pdf bib
Linguistic Complexity Loss in Text-Based Therapy
Jason Wei | Kelly Finn | Emma Templeton | Thalia Wheatley | Soroush Vosoughi

The complexity loss paradox, which posits that individuals suffering from disease exhibit surprisingly predictable behavioral dynamics, has been observed in a variety of both human and animal physiological systems. The recent advent of online text-based therapy presents a new opportunity to analyze the complexity loss paradox in a novel operationalization : linguistic complexity loss in text-based therapy conversations. In this paper, we analyze linguistic complexity correlates of mental health in the online therapy messages sent between therapists and 7,170 clients who provided 30,437 corresponding survey responses on their anxiety. We found that when clients reported more anxiety, they showed reduced lexical diversity as estimated by the moving average type-token ratio. Therapists, on the other hand, used language of higher reading difficulty, syntactic complexity, and age of acquisition when clients were more anxious. Finally, we found that clients, and to an even greater extent, therapists, exhibited consistent levels of many linguistic complexity measures. These results demonstrate how linguistic analysis of text-based communication can be leveraged as a marker for anxiety, an exciting prospect in a time of both increased online communication and increased mental health issues.

pdf bib
Ab Antiquo : Neural Proto-language Reconstruction
Carlo Meloni | Shauli Ravfogel | Yoav Goldberg

Historical linguists have identified regularities in the process of historic sound change. The comparative method utilizes those regularities to reconstruct proto-words based on observed forms in daughter languages. Can this process be efficiently automated? We address the task of proto-word reconstruction, in which the model is exposed to cognates in contemporary daughter languages, and has to predict the proto word in the ancestor language. We provide a novel dataset for this task, encompassing over 8,000 comparative entries, and show that neural sequence models outperform conventional methods applied to this task so far. Error analysis reveals a variability in the ability of neural model to capture different phonological changes, correlating with the complexity of the changes. Analysis of learned embeddings reveals the models learn phonologically meaningful generalizations, corresponding to well-attested phonological shifts documented by historical linguistics.

pdf bib
Adapting Coreference Resolution for Processing Violent Death Narratives
Ankith Uppunda | Susan Cochran | Jacob Foster | Alina Arseniev-Koehler | Vickie Mays | Kai-Wei Chang

Coreference resolution is an important compo-nent in analyzing narrative text from admin-istrative data (e.g., clinical or police sources).However, existing coreference models trainedon general language corpora suffer from poortransferability due to domain gaps, especiallywhen they are applied to gender-inclusive datawith lesbian, gay, bisexual, and transgender(LGBT) individuals. In this paper, we an-alyzed the challenges of coreference resolu-tion in an exemplary form of administrativetext written in English : violent death nar-ratives from the USA’s Centers for DiseaseControl’s (CDC) National Violent Death Re-porting System. We developed a set of dataaugmentation rules to improve model perfor-mance using a probabilistic data programmingframework. Experiments on narratives froman administrative database, as well as existinggender-inclusive coreference datasets, demon-strate the effectiveness of data augmentationin training coreference models that can betterhandle text data about LGBT individuals.

pdf bib
Does Structure Matter? Encoding Documents for Machine Reading Comprehension
Hui Wan | Song Feng | Chulaka Gunasekara | Siva Sankalp Patel | Sachindra Joshi | Luis Lastras

Machine reading comprehension is a challenging task especially for querying documents with deep and interconnected contexts. Transformer-based methods have shown advanced performances on this task ; however, most of them still treat documents as a flat sequence of tokens. This work proposes a new Transformer-based method that reads a document as tree slices. It contains two modules for identifying more relevant text passage and the best answer span respectively, which are not only jointly trained but also jointly consulted at inference time. Our evaluation results show that our proposed method outperforms several competitive baseline approaches on two datasets from varied domains.

pdf bib
Constructing Taxonomies from Pretrained Language Models
Catherine Chen | Kevin Lin | Dan Klein

We present a method for constructing taxonomic trees (e.g., WordNet) using pretrained language models. Our approach is composed of two modules, one that predicts parenthood relations and another that reconciles those pairwise predictions into trees. The parenthood prediction module produces likelihood scores for each potential parent-child pair, creating a graph of parent-child relation scores. The tree reconciliation module treats the task as a graph optimization problem and outputs the maximum spanning tree of this graph. We train our model on subtrees sampled from WordNet, and test on nonoverlapping WordNet subtrees. We show that incorporating web-retrieved glosses can further improve performance. On the task of constructing subtrees of English WordNet, the model achieves 66.7 ancestor F1, a 20.0 % relative increase over the previous best published result on this task. In addition, we convert the original English dataset into nine other languages using Open Multilingual WordNet and extend our results across these languages.

pdf bib
Adapting BERT for Continual Learning of a Sequence of Aspect Sentiment Classification TasksBERT for Continual Learning of a Sequence of Aspect Sentiment Classification Tasks
Zixuan Ke | Hu Xu | Bing Liu

This paper studies continual learning (CL) of a sequence of aspect sentiment classification (ASC) tasks. Although some CL techniques have been proposed for document sentiment classification, we are not aware of any CL work on ASC. A CL system that incrementally learns a sequence of ASC tasks should address the following two issues : (1) transfer knowledge learned from previous tasks to the new task to help it learn a better model, and (2) maintain the performance of the models for previous tasks so that they are not forgotten. This paper proposes a novel capsule network based model called B-CL to address these issues. B-CL markedly improves the ASC performance on both the new task and the old tasks via forward and backward knowledge transfer. The effectiveness of B-CL is demonstrated through extensive experiments.

pdf bib
Enriching Transformers with Structured Tensor-Product Representations for Abstractive Summarization
Yichen Jiang | Asli Celikyilmaz | Paul Smolensky | Paul Soulos | Sudha Rao | Hamid Palangi | Roland Fernandez | Caitlin Smith | Mohit Bansal | Jianfeng Gao

Abstractive summarization, the task of generating a concise summary of input documents, requires : (1) reasoning over the source document to determine the salient pieces of information scattered across the long document, and (2) composing a cohesive text by reconstructing these salient facts into a shorter summary that faithfully reflects the complex relations connecting these facts. In this paper, we adapt TP-Transformer (Schlag et al., 2019), an architecture that enriches the original Transformer (Vaswani et al., 2017) with the explicitly compositional Tensor Product Representation (TPR), for the task of abstractive summarization. The key feature of our model is a structural bias that we introduce by encoding two separate representations for each token to represent the syntactic structure (with role vectors) and semantic content (with filler vectors) separately. The model then binds the role and filler vectors into the TPR as the layer output. We argue that the structured intermediate representations enable the model to take better control of the contents (salient facts) and structures (the syntax that connects the facts) when generating the summary. Empirically, we show that our TP-Transformer outperforms the Transformer and the original TP-Transformer significantly on several abstractive summarization datasets based on both automatic and human evaluations. On several syntactic and semantic probing tasks, we demonstrate the emergent structural information in the role vectors and the performance gain by information specificity of the role vectors and improved syntactic interpretability in the TPR layer outputs. (Code and models are available at https://github.com/jiangycTarheel/TPT-Summ)

pdf bib
Understanding Factuality in Abstractive Summarization with FRANK : A Benchmark for Factuality MetricsFRANK: A Benchmark for Factuality Metrics
Artidoro Pagnoni | Vidhisha Balachandran | Yulia Tsvetkov

Modern summarization models generate highly fluent but often factually unreliable outputs. This motivated a surge of metrics attempting to measure the factuality of automatically generated summaries. Due to the lack of common benchmarks, these metrics can not be compared. Moreover, all these methods treat factuality as a binary concept and fail to provide deeper insights on the kinds of inconsistencies made by different systems. To address these limitations, we devise a typology of factual errors and use it to collect human annotations of generated summaries from state-of-the-art summarization systems for the CNN / DM and XSum datasets. Through these annotations we identify the proportion of different categories of factual errors and benchmark factuality metrics, showing their correlation with human judgement as well as their specific strengths and weaknesses.

pdf bib
GSum : A General Framework for Guided Neural Abstractive SummarizationGSum: A General Framework for Guided Neural Abstractive Summarization
Zi-Yi Dou | Pengfei Liu | Hiroaki Hayashi | Zhengbao Jiang | Graham Neubig

Neural abstractive summarization models are flexible and can produce coherent summaries, but they are sometimes unfaithful and can be difficult to control. While previous studies attempt to provide different types of guidance to control the output and increase faithfulness, it is not clear how these strategies compare and contrast to each other. In this paper, we propose a general and extensible guided summarization framework (GSum) that can effectively take different kinds of external guidance as input, and we perform experiments across several different varieties. Experiments demonstrate that this model is effective, achieving state-of-the-art performance according to ROUGE on 4 popular summarization datasets when using highlighted sentences as guidance. In addition, we show that our guided model can generate more faithful summaries and demonstrate how different types of guidance generate qualitatively different summaries, lending a degree of controllability to the learned models.GSum) that can effectively take different kinds of external guidance as input, and we perform experiments across several different varieties. Experiments demonstrate that this model is effective, achieving state-of-the-art performance according to ROUGE on 4 popular summarization datasets when using highlighted sentences as guidance. In addition, we show that our guided model can generate more faithful summaries and demonstrate how different types of guidance generate qualitatively different summaries, lending a degree of controllability to the learned models.

pdf bib
TuringAdvice : A Generative and Dynamic Evaluation of Language UseTuringAdvice: A Generative and Dynamic Evaluation of Language Use
Rowan Zellers | Ari Holtzman | Elizabeth Clark | Lianhui Qin | Ali Farhadi | Yejin Choi

We propose TuringAdvice, a new challenge task and dataset for language understanding models. Given a written situation that a real person is currently facing, a model must generate helpful advice in natural language. Our evaluation framework tests a fundamental aspect of human language understanding : our ability to use language to resolve open-ended situations by communicating with each other. Empirical results show that today’s models struggle at TuringAdvice, even multibillion parameter models finetuned on 600k in-domain training examples. The best model, T5, writes advice that is at least as helpful as human-written advice in only 14 % of cases ; a much larger non-finetunable GPT3 model does even worse at 4 %. This low performance reveals language understanding errors that are hard to spot outside of a generative setting, showing much room for progress.

pdf bib
Identifying inherent disagreement in natural language inference
Xinliang Frederick Zhang | Marie-Catherine de Marneffe

Natural language inference (NLI) is the task of determining whether a piece of text is entailed, contradicted by or unrelated to another piece of text. In this paper, we investigate how to tease systematic inferences (i.e., items for which people agree on the NLI label) apart from disagreement items (i.e., items which lead to different annotations), which most prior work has overlooked. To distinguish systematic inferences from disagreement items, we propose Artificial Annotators (AAs) to simulate the uncertainty in the annotation process by capturing the modes in annotations. Results on the CommitmentBank, a corpus of naturally occurring discourses in English, confirm that our approach performs statistically significantly better than all baselines. We further show that AAs learn linguistic patterns and context-dependent reasoning.

pdf bib
Modeling Human Mental States with an Entity-based Narrative Graph
I-Ta Lee | Maria Leonor Pacheco | Dan Goldwasser

Understanding narrative text requires capturing characters’ motivations, goals, and mental states. This paper proposes an Entity-based Narrative Graph (ENG) to model the internal- states of characters in a story. We explicitly model entities, their interactions and the context in which they appear, and learn rich representations for them. We experiment with different task-adaptive pre-training objectives, in-domain training, and symbolic inference to capture dependencies between different decisions in the output space. We evaluate our model on two narrative understanding tasks : predicting character mental states, and desire fulfillment, and conduct a qualitative analysis.

pdf bib
Hurdles to Progress in Long-form Question Answering
Kalpesh Krishna | Aurko Roy | Mohit Iyyer

The task of long-form question answering (LFQA) involves retrieving documents relevant to a given question and using them to generate a paragraph-length answer. While many models have recently been proposed for LFQA, we show in this paper that the task formulation raises fundamental challenges regarding evaluation and dataset creation that currently preclude meaningful modeling progress. To demonstrate these challenges, we first design a new system that relies on sparse attention and contrastive retriever learning to achieve state-of-the-art performance on the ELI5 LFQA dataset. While our system tops the public leaderboard, a detailed analysis reveals several troubling trends : (1) our system’s generated answers are not actually grounded in the documents that it retrieves ; (2) ELI5 contains significant train / validation overlap, as at least 81 % of ELI5 validation questions occur in paraphrased form in the training set ; (3) ROUGE-L is not an informative metric of generated answer quality and can be easily gamed ; and (4) human evaluations used for other text generation tasks are unreliable for LFQA. We offer suggestions to mitigate each of these issues, which we hope will lead to more rigorous LFQA research and meaningful progress in the future.

pdf bib
Attention Head Masking for Inference Time Content Selection in Abstractive Summarization
Shuyang Cao | Lu Wang

How can we effectively inform content selection in Transformer-based abstractive summarization models? In this work, we present a simple-yet-effective attention head masking technique, which is applied on encoder-decoder attentions to pinpoint salient content at inference time. Using attention head masking, we are able to reveal the relation between encoder-decoder attentions and content selection behaviors of summarization models. We then demonstrate its effectiveness on three document summarization datasets based on both in-domain and cross-domain settings. Importantly, our models outperform prior state-of-the-art models on CNN / Daily Mail and New York Times datasets. Moreover, our inference-time masking technique is also data-efficient, requiring only 20 % of the training samples to outperform BART fine-tuned on the full CNN / DailyMail dataset.

pdf bib
Factual Probing Is [ MASK ] : Learning vs. Learning to RecallMASK]: Learning vs. Learning to Recall
Zexuan Zhong | Dan Friedman | Danqi Chen

Petroni et al. (2019) demonstrated that it is possible to retrieve world facts from a pre-trained language model by expressing them as cloze-style prompts and interpret the model’s prediction accuracy as a lower bound on the amount of factual information it encodes. Subsequent work has attempted to tighten the estimate by searching for better prompts, using a disjoint set of facts as training data. In this work, we make two complementary contributions to better understand these factual probing techniques. First, we propose OptiPrompt, a novel and efficient method which directly optimizes in continuous embedding space. We find this simple method is able to predict an additional 6.4 % of facts in the LAMA benchmark. Second, we raise a more important question : Can we really interpret these probing results as a lower bound? Is it possible that these prompt-search methods learn from the training data too? We find, somewhat surprisingly, that the training data used by these methods contains certain regularities of the underlying fact distribution, and all the existing prompt methods, including ours, are able to exploit them for better fact prediction. We conduct a set of control experiments to disentangle learning from learning to recall, providing a more detailed picture of what different prompts can reveal about pre-trained language models.

pdf bib
Contextualized Perturbation for Textual Adversarial Attack
Dianqi Li | Yizhe Zhang | Hao Peng | Liqun Chen | Chris Brockett | Ming-Ting Sun | Bill Dolan

Adversarial examples expose the vulnerabilities of natural language processing (NLP) models, and can be used to evaluate and improve their robustness. Existing techniques of generating such examples are typically driven by local heuristic rules that are agnostic to the context, often resulting in unnatural and ungrammatical outputs. This paper presents CLARE, a ContextuaLized AdversaRial Example generation model that produces fluent and grammatical outputs through a mask-then-infill procedure. CLARE builds on a pre-trained masked language model and modifies the inputs in a context-aware manner. We propose three contextualized perturbations, Replace, Insert and Merge, that allow for generating outputs of varied lengths. CLARE can flexibly combine these perturbations and apply them at any position in the inputs, and is thus able to attack the victim model more effectively with fewer edits. Extensive experiments and human evaluation demonstrate that CLARE outperforms the baselines in terms of attack success rate, textual similarity, fluency and grammaticality.

pdf bib
Evaluating the Values of Sources in Transfer Learning
Md Rizwan Parvez | Kai-Wei Chang

Transfer learning that adapts a model trained on data-rich sources to low-resource targets has been widely applied in natural language processing (NLP). However, when training a transfer model over multiple sources, not every source is equally useful for the target. To better transfer a model, it is essential to understand the values of the sources. In this paper, we develop, an efficient source valuation framework for quantifying the usefulness of the sources (e.g.,) in transfer learning based on the Shapley value method. Experiments and comprehensive analyses on both cross-domain and cross-lingual transfers demonstrate that our framework is not only effective in choosing useful transfer sources but also the source values match the intuitive source-target similarity.

pdf bib
Limitations of Autoregressive Models and Their Alternatives
Chu-Cheng Lin | Aaron Jaech | Xin Li | Matthew R. Gormley | Jason Eisner

Standard autoregressive language models perform only polynomial-time computation to compute the probability of the next symbol. While this is attractive, it means they can not model distributions whose next-symbol probability is hard to compute. Indeed, they can not even model them well enough to solve associated easy decision problems for which an engineer might want to consult a language model. These limitations apply no matter how much computation and data are used to train the model, unless the model is given access to oracle parameters that grow superpolynomially in sequence length. Thus, simply training larger autoregressive language models is not a panacea for NLP. Alternatives include energy-based models (which give up efficient sampling) and latent-variable autoregressive models (which give up efficient scoring of a given string). Both are powerful enough to escape the above limitations.hard to compute. Indeed, they cannot even model them well enough to solve associated easy decision problems for which an engineer might want to consult a language model. These limitations apply no matter how much computation and data are used to train the model, unless the model is given access to oracle parameters that grow superpolynomially in sequence length. Thus, simply training larger autoregressive language models is not a panacea for NLP. Alternatives include energy-based models (which give up efficient sampling) and latent-variable autoregressive models (which give up efficient scoring of a given string). Both are powerful enough to escape the above limitations.

pdf bib
On the Transformer Growth for Progressive BERT TrainingBERT Training
Xiaotao Gu | Liyuan Liu | Hongkun Yu | Jing Li | Chen Chen | Jiawei Han

As the excessive pre-training cost arouses the need to improve efficiency, considerable efforts have been made to train BERT progressivelystart from an inferior but low-cost model and gradually increase the computational complexity. Our objective is to help advance the understanding of such Transformer growth and discover principles that guide progressive training. First, we find that similar to network architecture selection, Transformer growth also favors compound scaling. Specifically, while existing methods only conduct network growth in a single dimension, we observe that it is beneficial to use compound growth operators and balance multiple dimensions (e.g., depth, width, and input length of the model). Moreover, we explore alternative growth operators in each dimension via controlled comparison to give practical guidance for operator selection. In light of our analyses, the proposed method CompoundGrow speeds up BERT pre-training by 73.6 % and 82.2 % for the base and large models respectively while achieving comparable performances.

pdf bib
ReadTwice : Reading Very Large Documents with MemoriesReadTwice: Reading Very Large Documents with Memories
Yury Zemlyanskiy | Joshua Ainslie | Michiel de Jong | Philip Pham | Ilya Eckstein | Fei Sha

Knowledge-intensive tasks such as question answering often require assimilating information from different sections of large inputs such as books or article collections. We propose ReadTwice, a simple and effective technique that combines several strengths of prior approaches to model long-range dependencies with Transformers. The main idea is to read text in small segments, in parallel, summarizing each segment into a memory table to be used in a second read of the text. We show that the method outperforms models of comparable size on several question answering (QA) datasets and sets a new state of the art on the challenging NarrativeQA task, with questions about entire books.

pdf bib
Learning How to Ask : Querying LMs with Mixtures of Soft PromptsLMs with Mixtures of Soft Prompts
Guanghui Qin | Jason Eisner

Natural-language prompts have recently been used to coax pretrained language models into performing other AI tasks, using a fill-in-the-blank paradigm (Petroni et al., 2019) or a few-shot extrapolation paradigm (Brown et al., 2020). For example, language models retain factual knowledge from their training corpora that can be extracted by asking them to fill in the blank in a sentential prompt. However, where does this prompt come from? We explore the idea of learning prompts by gradient descenteither fine-tuning prompts taken from previous work, or starting from random initialization. Our prompts consist of soft words, i.e., continuous vectors that are not necessarily word type embeddings from the language model. Furthermore, for each task, we optimize a mixture of prompts, learning which prompts are most effective and how to ensemble them. Across multiple English LMs and tasks, our approach hugely outperforms previous methods, showing that the implicit factual knowledge in language models was previously underestimated. Moreover, this knowledge is cheap to elicit : random initialization is nearly as good as informed initialization.

pdf bib
SpanPredict : Extraction of Predictive Document Spans with Neural AttentionSpanPredict: Extraction of Predictive Document Spans with Neural Attention
Vivek Subramanian | Matthew Engelhard | Sam Berchuck | Liqun Chen | Ricardo Henao | Lawrence Carin

In many natural language processing applications, identifying predictive text can be as important as the predictions themselves. When predicting medical diagnoses, for example, identifying predictive content in clinical notes not only enhances interpretability, but also allows unknown, descriptive (i.e., text-based) risk factors to be identified. We here formalize this problem as predictive extraction and address it using a simple mechanism based on linear attention. Our method preserves differentiability, allowing scalable inference via stochastic gradient descent. Further, the model decomposes predictions into a sum of contributions of distinct text spans. Importantly, we require only document labels, not ground-truth spans. Results show that our model identifies semantically-cohesive spans and assigns them scores that agree with human ratings, while preserving classification performance.

pdf bib
Improving Factual Completeness and Consistency of Image-to-Text Radiology Report Generation
Yasuhide Miura | Yuhao Zhang | Emily Tsai | Curtis Langlotz | Dan Jurafsky

Neural image-to-text radiology report generation systems offer the potential to improve radiology reporting by reducing the repetitive process of report drafting and identifying possible medical errors. However, existing report generation systems, despite achieving high performances on natural language generation metrics such as CIDEr or BLEU, still suffer from incomplete and inconsistent generations. Here we introduce two new simple rewards to encourage the generation of factually complete and consistent radiology reports : one that encourages the system to generate radiology domain entities consistent with the reference, and one that uses natural language inference to encourage these entities to be described in inferentially consistent ways. We combine these with the novel use of an existing semantic equivalence metric (BERTScore). We further propose a report generation system that optimizes these rewards via reinforcement learning. On two open radiology report datasets, our system substantially improved the F1 score of a clinical information extraction performance by +22.1 (Delta +63.9 %). We further show via a human evaluation and a qualitative analysis that our system leads to generations that are more factually complete and consistent compared to the baselines.

pdf bib
MIMOQA : Multimodal Input Multimodal Output Question AnsweringMIMOQA: Multimodal Input Multimodal Output Question Answering
Hrituraj Singh | Anshul Nasery | Denil Mehta | Aishwarya Agarwal | Jatin Lamba | Balaji Vasan Srinivasan

Multimodal research has picked up significantly in the space of question answering with the task being extended to visual question answering, charts question answering as well as multimodal input question answering. However, all these explorations produce a unimodal textual output as the answer. In this paper, we propose a novel task-MIMOQA-Multimodal Input Multimodal Output Question Answering in which the output is also multimodal. Through human experiments, we empirically show that such multimodal outputs provide better cognitive understanding of the answers. We also propose a novel multimodal question-answering framework, MExBERT, that incorporates a joint textual and visual attention towards producing such a multimodal output. Our method relies on a novel multimodal dataset curated for this problem from publicly available unimodal datasets. We show the superior performance of MExBERT against strong baselines on both the automatic as well as human metrics.

pdf bib
Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions
Liunian Harold Li | Haoxuan You | Zhecan Wang | Alireza Zareian | Shih-Fu Chang | Kai-Wei Chang

Pre-trained contextual vision-and-language (V&L) models have achieved impressive performance on various benchmarks. However, existing models require a large amount of parallel image-caption data for pre-training. Such data are costly to collect and require cumbersome curation. Inspired by unsupervised machine translation, we investigate if a strong V&L representation model can be learned through unsupervised pre-training without image-caption corpora. In particular, we propose to conduct mask-and-predict pre-training on text-only and image-only corpora and introduce the object tags detected by an object recognition model as anchor points to bridge two modalities. We find that such a simple approach achieves performance close to a model pre-trained with aligned data, on four English V&L benchmarks. Our work challenges the widely held notion that aligned data is necessary for V&L pre-training, while significantly reducing the amount of supervision needed for V&L models.

pdf bib
Multitasking Inhibits Semantic Drift
Athul Paul Jacob | Mike Lewis | Jacob Andreas

When intelligent agents communicate to accomplish shared goals, how do these goals shape the agents’ language? We study the dynamics of learning in latent language policies (LLPs), in which instructor agents generate natural-language subgoal descriptions and executor agents map these descriptions to low-level actions. LLPs can solve challenging long-horizon reinforcement learning problems and provide a rich model for studying task-oriented language use. But previous work has found that LLP training is prone to semantic drift (use of messages in ways inconsistent with their original natural language meanings). Here, we demonstrate theoretically and empirically that multitask training is an effective counter to this problem : we prove that multitask training eliminates semantic drift in a well-studied family of signaling games, and show that multitask training of neural LLPs in a complex strategy game reduces drift and while improving sample efficiency.

pdf bib
Neural Quality Estimation with Multiple Hypotheses for Grammatical Error Correction
Zhenghao Liu | Xiaoyuan Yi | Maosong Sun | Liner Yang | Tat-Seng Chua

Grammatical Error Correction (GEC) aims to correct writing errors and help language learners improve their writing skills. However, existing GEC models tend to produce spurious corrections or fail to detect lots of errors. The quality estimation model is necessary to ensure learners get accurate GEC results and avoid misleading from poorly corrected sentences. Well-trained GEC models can generate several high-quality hypotheses through decoding, such as beam search, which provide valuable GEC evidence and can be used to evaluate GEC quality. However, existing models neglect the possible GEC evidence from different hypotheses. This paper presents the Neural Verification Network (VERNet) for GEC quality estimation with multiple hypotheses. VERNet establishes interactions among hypotheses with a reasoning graph and conducts two kinds of attention mechanisms to propagate GEC evidence to verify the quality of generated hypotheses. Our experiments on four GEC datasets show that VERNet achieves state-of-the-art grammatical error detection performance, achieves the best quality estimation results, and significantly improves GEC performance by reranking hypotheses. All data and source codes are available at https://github.com/thunlp/VERNet.

pdf bib
Few-Shot Text Classification with Triplet Networks, Data Augmentation, and Curriculum Learning
Jason Wei | Chengyu Huang | Soroush Vosoughi | Yu Cheng | Shiqi Xu

Few-shot text classification is a fundamental NLP task in which a model aims to classify text into a large number of categories, given only a few training examples per category. This paper explores data augmentationa technique particularly suitable for training with limited datafor this few-shot, highly-multiclass text classification setting. On four diverse text classification tasks, we find that common data augmentation techniques can improve the performance of triplet networks by up to 3.0 % on average. To further boost performance, we present a simple training strategy called curriculum data augmentation, which leverages curriculum learning by first training on only original examples and then introducing augmented data as training progresses. We explore a two-stage and a gradual schedule, and find that, compared with standard single-stage training, curriculum data augmentation trains faster, improves performance, and remains robust to high amounts of noising from augmentation.

pdf bib
User-Generated Text Corpus for Evaluating Japanese Morphological Analysis and Lexical NormalizationJapanese Morphological Analysis and Lexical Normalization
Shohei Higashiyama | Masao Utiyama | Taro Watanabe | Eiichiro Sumita

Morphological analysis (MA) and lexical normalization (LN) are both important tasks for Japanese user-generated text (UGT). To evaluate and compare different MA / LN systems, we have constructed a publicly available Japanese UGT corpus. Our corpus comprises 929 sentences annotated with morphological and normalization information, along with category information we classified for frequent UGT-specific phenomena. Experiments on the corpus demonstrated the low performance of existing MA / LN methods for non-general words and non-standard forms, indicating that the corpus would be a challenging benchmark for further research on UGT.

pdf bib
Contextualized and Generalized Sentence Representations by Contrastive Self-Supervised Learning : A Case Study on Discourse Relation Analysis
Hirokazu Kiyomaru | Sadao Kurohashi

We propose a method to learn contextualized and generalized sentence representations using contrastive self-supervised learning. In the proposed method, a model is given a text consisting of multiple sentences. One sentence is randomly selected as a target sentence. The model is trained to maximize the similarity between the representation of the target sentence with its context and that of the masked target sentence with the same context. Simultaneously, the model minimizes the similarity between the latter representation and the representation of a random sentence with the same context. We apply our method to discourse relation analysis in English and Japanese and show that it outperforms strong baseline methods based on BERT, XLNet, and RoBERTa.

pdf bib
Unsupervised Concept Representation Learning for Length-Varying Text Similarity
Xuchao Zhang | Bo Zong | Wei Cheng | Jingchao Ni | Yanchi Liu | Haifeng Chen

Measuring document similarity plays an important role in natural language processing tasks. Most existing document similarity approaches suffer from the information gap caused by context and vocabulary mismatches when comparing varying-length texts. In this paper, we propose an unsupervised concept representation learning approach to address the above issues. Specifically, we propose a novel Concept Generation Network (CGNet) to learn concept representations from the perspective of the entire text corpus. Moreover, a concept-based document matching method is proposed to leverage advances in the recognition of local phrase features and corpus-level concept features. Extensive experiments on real-world data sets demonstrate that new method can achieve a considerable improvement in comparing length-varying texts. In particular, our model achieved 6.5 % better F1 Score compared to the best of the baseline models for a concept-project benchmark dataset.

pdf bib
Adversarial Self-Supervised Learning for Out-of-Domain Detection
Zhiyuan Zeng | Keqing He | Yuanmeng Yan | Hong Xu | Weiran Xu

Detecting out-of-domain (OOD) intents is crucial for the deployed task-oriented dialogue system. Previous unsupervised OOD detection methods only extract discriminative features of different in-domain intents while supervised counterparts can directly distinguish OOD and in-domain intents but require extensive labeled OOD data. To combine the benefits of both types, we propose a self-supervised contrastive learning framework to model discriminative semantic features of both in-domain intents and OOD intents from unlabeled data. Besides, we introduce an adversarial augmentation neural module to improve the efficiency and robustness of contrastive learning. Experiments on two public benchmark datasets show that our method can consistently outperform the baselines with a statistically significant margin.

pdf bib
Hierarchical Transformer for Task Oriented Dialog Systems
Bishal Santra | Potnuru Anusha | Pawan Goyal

Generative models for dialog systems have gained much interest because of the recent success of RNN and Transformer based models in tasks like question answering and summarization. Although the task of dialog response generation is generally seen as a sequence to sequence (Seq2Seq) problem, researchers in the past have found it challenging to train dialog systems using the standard Seq2Seq models. Therefore, to help the model learn meaningful utterance and conversation level features, Sordoni et al. (2015b), Serban et al. (2016) proposed Hierarchical RNN architecture, which was later adopted by several other RNN based dialog systems. With the transformer-based models dominating the seq2seq problems lately, the natural question to ask is the applicability of the notion of hierarchy in transformer-based dialog systems. In this paper, we propose a generalized framework for Hierarchical Transformer Encoders and show how a standard transformer can be morphed into any hierarchical encoder, including HRED and HIBERT like models, by using specially designed attention masks and positional encodings. We demonstrate that Hierarchical Encoding helps achieve better natural language understanding of the contexts in transformer-based models for task-oriented dialog systems through a wide range of experiments.

pdf bib
RTFE : A Recursive Temporal Fact Embedding Framework for Temporal Knowledge Graph CompletionRTFE: A Recursive Temporal Fact Embedding Framework for Temporal Knowledge Graph Completion
Youri Xu | Haihong E | Meina Song | Wenyu Song | Xiaodong Lv | Wang Haotian | Yang Jinrui

Static knowledge graph (SKG) embedding (SKGE) has been studied intensively in the past years. Recently, temporal knowledge graph (TKG) embedding (TKGE) has emerged. In this paper, we propose a Recursive Temporal Fact Embedding (RTFE) framework to transplant SKGE models to TKGs and to enhance the performance of existing TKGE models for TKG completion. Different from previous work which ignores the continuity of states of TKG in time evolution, we treat the sequence of graphs as a Markov chain, which transitions from the previous state to the next state. RTFE takes the SKGE to initialize the embeddings of TKG. Then it recursively tracks the state transition of TKG by passing updated parameters / features between timestamps. Specifically, at each timestamp, we approximate the state transition as the gradient update process. Since RTFE learns each timestamp recursively, it can naturally transit to future timestamps. Experiments on five TKG datasets show the effectiveness of RTFE.

pdf bib
Multi-Grained Knowledge Distillation for Named Entity Recognition
Xuan Zhou | Xiao Zhang | Chenyang Tao | Junya Chen | Bing Xu | Wei Wang | Jing Xiao

Although pre-trained big models (e.g., BERT, ERNIE, XLNet, GPT3 etc.) have delivered top performance in Seq2seq modeling, their deployments in real-world applications are often hindered by the excessive computations and memory demand involved. For many applications, including named entity recognition (NER), matching the state-of-the-art result under budget has attracted considerable attention. Drawing power from the recent advance in knowledge distillation (KD), this work presents a novel distillation scheme to efficiently transfer the knowledge learned from big models to their more affordable counterpart. Our solution highlights the construction of surrogate labels through the k-best Viterbi algorithm to distill knowledge from the teacher model. To maximally assimilate knowledge into the student model, we propose a multi-grained distillation scheme, which integrates cross entropy involved in conditional random field (CRF) and fuzzy learning. To validate the effectiveness of our proposal, we conducted a comprehensive evaluation on five NER benchmarks, reporting cross-the-board performance gains relative to competing prior-arts. We further discuss ablation results to dissect our gains.

pdf bib
TR-BERT : Dynamic Token Reduction for Accelerating BERT InferenceTR-BERT: Dynamic Token Reduction for Accelerating BERT Inference
Deming Ye | Yankai Lin | Yufei Huang | Maosong Sun

Existing pre-trained language models (PLMs) are often computationally expensive in inference, making them impractical in various resource-limited real-world applications. To address this issue, we propose a dynamic token reduction approach to accelerate PLMs’ inference, named TR-BERT, which could flexibly adapt the layer number of each token in inference to avoid redundant calculation. Specially, TR-BERT formulates the token reduction process as a multi-step token selection problem and automatically learns the selection strategy via reinforcement learning. The experimental results on several downstream NLP tasks show that TR-BERT is able to speed up BERT by 2-5 times to satisfy various performance demands. Moreover, TR-BERT can also achieve better performance with less computation in a suite of long-text tasks since its token-level layer number adaption greatly accelerates the self-attention operation in PLMs. The source code and experiment details of this paper can be obtained from https://github.com/thunlp/TR-BERT.

pdf bib
Breadth First Reasoning Graph for Multi-hop Question Answering
Yongjie Huang | Meng Yang

Recently Graph Neural Network (GNN) has been used as a promising tool in multi-hop question answering task. However, the unnecessary updations and simple edge constructions prevent an accurate answer span extraction in a more direct and interpretable way. In this paper, we propose a novel model of Breadth First Reasoning Graph (BFR-Graph), which presents a new message passing way that better conforms to the reasoning process. In BFR-Graph, the reasoning message is required to start from the question node and pass to the next sentences node hop by hop until all the edges have been passed, which can effectively prevent each node from over-smoothing or being updated multiple times unnecessarily. To introduce more semantics, we also define the reasoning graph as a weighted graph with considering the number of co-occurrence entities and the distance between sentences. Then we present a more direct and interpretable way to aggregate scores from different levels of granularity based on the GNN. On HotpotQA leaderboard, the proposed BFR-Graph achieves state-of-the-art on answer span prediction.

pdf bib
Unsupervised Multi-hop Question Answering by Question Generation
Liangming Pan | Wenhu Chen | Wenhan Xiong | Min-Yen Kan | William Yang Wang

Obtaining training data for multi-hop question answering (QA) is time-consuming and resource-intensive. We explore the possibility to train a well-performed multi-hop QA model without referencing any human-labeled multi-hop question-answer pairs, i.e., unsupervised multi-hop QA. We propose MQA-QG, an unsupervised framework that can generate human-like multi-hop training data from both homogeneous and heterogeneous data sources. MQA-QG generates questions by first selecting / generating relevant information from each data source and then integrating the multiple information to form a multi-hop question. Using only generated training data, we can train a competent multi-hop QA which achieves 61 % and 83 % of the supervised learning performance for the HybridQA and the HotpotQA dataset, respectively. We also show that pretraining the QA system with the generated data would greatly reduce the demand for human-annotated training data. Our codes are publicly available at https://github.com/teacherpeterpan/Unsupervised-Multi-hop-QA.

pdf bib
Sliding Selector Network with Dynamic Memory for Extractive Summarization of Long Documents
Peng Cui | Le Hu

Neural-based summarization models suffer from the length limitation of text encoder. Long documents have to been truncated before they are sent to the model, which results in huge loss of summary-relevant contents. To address this issue, we propose the sliding selector network with dynamic memory for extractive summarization of long-form documents, which employs a sliding window to extract summary sentences segment by segment. Moreover, we adopt memory mechanism to preserve and update the history information dynamically, allowing the semantic flow across different windows. Experimental results on two large-scale datasets that consist of scientific papers demonstrate that our model substantially outperforms previous state-of-the-art models. Besides, we perform qualitative and quantitative investigations on how our model works and where the performance gain comes from.

up

pdf (full)
bib (full)
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations

pdf bib
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations
Avi Sil | Xi Victoria Lin

pdf bib
PhoNLP : A joint multi-task learning model for Vietnamese part-of-speech tagging, named entity recognition and dependency parsingPhoNLP: A joint multi-task learning model for Vietnamese part-of-speech tagging, named entity recognition and dependency parsing
Linh The Nguyen | Dat Quoc Nguyen

We present the first multi-task learning model named PhoNLP for joint Vietnamese part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing. Experiments on Vietnamese benchmark datasets show that PhoNLP produces state-of-the-art results, outperforming a single-task learning approach that fine-tunes the pre-trained Vietnamese language model PhoBERT (Nguyen and Nguyen, 2020) for each task independently. We publicly release PhoNLP as an open-source toolkit under the Apache License 2.0. Although we specify PhoNLP for Vietnamese, our PhoNLP training and evaluation command scripts in fact can directly work for other languages that have a pre-trained BERT-based language model and gold annotated corpora available for the three tasks of POS tagging, NER and dependency parsing. We hope that PhoNLP can serve as a strong baseline and useful toolkit for future NLP research and applications to not only Vietnamese but also the other languages. Our PhoNLP is available at https://github.com/VinAIResearch/PhoNLP

pdf bib
NAMER : A Node-Based Multitasking Framework for Multi-Hop Knowledge Base Question AnsweringNAMER: A Node-Based Multitasking Framework for Multi-Hop Knowledge Base Question Answering
Minhao Zhang | Ruoyu Zhang | Lei Zou | Yinnian Lin | Sen Hu

We present NAMER, an open-domain Chinese knowledge base question answering system based on a novel node-based framework that better grasps the structural mapping between questions and KB queries by aligning the nodes in a query with their corresponding mentions in question. Equipped with techniques including data augmentation and multitasking, we show that the proposed framework outperforms the previous SoTA on CCKS CKBQA dataset. Moreover, we develop a novel data annotation strategy that facilitates the node-to-mention alignment, a dataset (https://github.com/ridiculouz/CKBQA) with such strategy is also published to promote further research. An online demo of NAMER (http://kbqademo.gstore.cn) is provided to visualize our framework and supply extra information for users, a video illustration (https://youtu.be/yetnVye_hg4) of NAMER is also available.

pdf bib
FITAnnotator : A Flexible and Intelligent Text Annotation SystemFITAnnotator: A Flexible and Intelligent Text Annotation System
Yanzeng Li | Bowen Yu | Li Quangang | Tingwen Liu

In this paper, we introduce FITAnnotator, a generic web-based tool for efficient text annotation. Benefiting from the fully modular architecture design, FITAnnotator provides a systematic solution for the annotation of a variety of natural language processing tasks, including classification, sequence tagging and semantic role annotation, regardless of the language. Three kinds of interfaces are developed to annotate instances, evaluate annotation quality and manage the annotation task for annotators, reviewers and managers, respectively. FITAnnotator also gives intelligent annotations by introducing task-specific assistant to support and guide the annotators based on active learning and incremental learning strategies. This assistant is able to effectively update from the annotator feedbacks and easily handle the incremental labeling scenarios.

pdf bib
Robustness Gym : Unifying the NLP Evaluation LandscapeNLP Evaluation Landscape
Karan Goel | Nazneen Fatema Rajani | Jesse Vig | Zachary Taschdjian | Mohit Bansal | Christopher Ré

Despite impressive performance on standard benchmarks, natural language processing (NLP) models are often brittle when deployed in real-world systems. In this work, we identify challenges with evaluating NLP systems and propose a solution in the form of Robustness Gym (RG), a simple and extensible evaluation toolkit that unifies 4 standard evaluation paradigms : subpopulations, transformations, evaluation sets, and adversarial attacks. By providing a common platform for evaluation, RG enables practitioners to compare results from disparate evaluation paradigms with a single click, and to easily develop and share novel evaluation methods using a built-in set of abstractions. RG is under active development and we welcome feedback & contributions from the community.

pdf bib
EventPlus : A Temporal Event Understanding PipelineEventPlus: A Temporal Event Understanding Pipeline
Mingyu Derek Ma | Jiao Sun | Mu Yang | Kung-Hsiang Huang | Nuan Wen | Shikhar Singh | Rujun Han | Nanyun Peng

We present EventPlus, a temporal event understanding pipeline that integrates various state-of-the-art event understanding components including event trigger and type detection, event argument detection, event duration and temporal relation extraction. Event information, especially event temporal knowledge, is a type of common sense knowledge that helps people understand how stories evolve and provides predictive hints for future events. EventPlus as the first comprehensive temporal event understanding pipeline provides a convenient tool for users to quickly obtain annotations about events and their temporal information for any user-provided document. Furthermore, we show EventPlus can be easily adapted to other domains (e.g., biomedical domain). We make EventPlus publicly available to facilitate event-related information extraction and downstream applications.

pdf bib
ActiveAnno : General-Purpose Document-Level Annotation Tool with Active Learning IntegrationActiveAnno: General-Purpose Document-Level Annotation Tool with Active Learning Integration
Max Wiechmann | Seid Muhie Yimam | Chris Biemann

ActiveAnno is an annotation tool focused on document-level annotation tasks developed both for industry and research settings. It is designed to be a general-purpose tool with a wide variety of use cases. It features a modern and responsive web UI for creating annotation projects, conducting annotations, adjudicating disagreements, and analyzing annotation results. ActiveAnno embeds a highly configurable and interactive user interface. The tool also integrates a RESTful API that enables integration into other software systems, including an API for machine learning integration. ActiveAnno is built with extensible design and easy deployment in mind, all to enable users to perform annotation tasks with high efficiency and high-quality annotation results.

pdf bib
TextEssence : A Tool for Interactive Analysis of Semantic Shifts Between CorporaTextEssence: A Tool for Interactive Analysis of Semantic Shifts Between Corpora
Denis Newman-Griffis | Venkatesh Sivaraman | Adam Perer | Eric Fosler-Lussier | Harry Hochheiser

Embeddings of words and concepts capture syntactic and semantic regularities of language ; however, they have seen limited use as tools to study characteristics of different corpora and how they relate to one another. We introduce TextEssence, an interactive system designed to enable comparative analysis of corpora using embeddings. TextEssence includes visual, neighbor-based, and similarity-based modes of embedding analysis in a lightweight, web-based interface. We further propose a new measure of embedding confidence based on nearest neighborhood overlap, to assist in identifying high-quality embeddings for corpus analysis. A case study on COVID-19 scientific literature illustrates the utility of the system. TextEssence can be found at https://textessence.github.io.

pdf bib
RESIN : A Dockerized Schema-Guided Cross-document Cross-lingual Cross-media Information Extraction and Event Tracking SystemRESIN: A Dockerized Schema-Guided Cross-document Cross-lingual Cross-media Information Extraction and Event Tracking System
Haoyang Wen | Ying Lin | Tuan Lai | Xiaoman Pan | Sha Li | Xudong Lin | Ben Zhou | Manling Li | Haoyu Wang | Hongming Zhang | Xiaodong Yu | Alexander Dong | Zhenhailong Wang | Yi Fung | Piyush Mishra | Qing Lyu | Dídac Surís | Brian Chen | Susan Windisch Brown | Martha Palmer | Chris Callison-Burch | Carl Vondrick | Jiawei Han | Dan Roth | Shih-Fu Chang | Heng Ji

We present a new information extraction system that can automatically construct temporal event graphs from a collection of news documents from multiple sources, multiple languages (English and Spanish for our experiment), and multiple data modalities (speech, text, image and video). The system advances state-of-the-art from two aspects : (1) extending from sentence-level event extraction to cross-document cross-lingual cross-media event extraction, coreference resolution and temporal event tracking ; (2) using human curated event schema library to match and enhance the extraction output. We have made the dockerlized system publicly available for research purpose at GitHub, with a demo video.

pdf bib
MUDES : Multilingual Detection of Offensive SpansMUDES: Multilingual Detection of Offensive Spans
Tharindu Ranasinghe | Marcos Zampieri

The interest in offensive content identification in social media has grown substantially in recent years. Previous work has dealt mostly with post level annotations. However, identifying offensive spans is useful in many ways. To help coping with this important challenge, we present MUDES, a multilingual system to detect offensive spans in texts. MUDES features pre-trained models, a Python API for developers, and a user-friendly web-based interface. A detailed description of MUDES’ components is presented in this paper.

up

pdf (full)
bib (full)
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop

pdf bib
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop
Esin Durmus | Vivek Gupta | Nelson Liu | Nanyun Peng | Yu Su

pdf bib
Shuffled-token Detection for Refining Pre-trained RoBERTaRoBERTa
Subhadarshi Panda | Anjali Agrawal | Jeewon Ha | Benjamin Bloch

State-of-the-art transformer models have achieved robust performance on a variety of NLP tasks. Many of these approaches have employed domain agnostic pre-training tasks to train models that yield highly generalized sentence representations that can be fine-tuned for specific downstream tasks. We propose refining a pre-trained NLP model using the objective of detecting shuffled tokens. We use a sequential approach by starting with the pre-trained RoBERTa model and training it using our approach. Applying random shuffling strategy on the word-level, we found that our approach enables the RoBERTa model achieve better performance on 4 out of 7 GLUE tasks. Our results indicate that learning to detect shuffled tokens is a promising approach to learn more coherent sentence representations.

pdf bib
Morphology-Aware Meta-Embeddings for TamilTamil
Arjun Sai Krishnan | Seyoon Ragavan

In this work, we explore generating morphologically enhanced word embeddings for Tamil, a highly agglutinative South Indian language with rich morphology that remains low-resource with regards to NLP tasks. We present here the first-ever word analogy dataset for Tamil, consisting of 4499 hand-curated word tetrads across 10 semantic and 13 morphological relation types. Using a rules-based segmenter to capture morphology as well as meta-embedding techniques, we train meta-embeddings that outperform existing baselines by 16 % on our analogy task and appear to mitigate a previously observed trade-off between semantic and morphological accuracy.

pdf bib
Seed Word Selection for Weakly-Supervised Text Classification with Unsupervised Error Estimation
Yiping Jin | Akshay Bhatia | Dittaya Wanvarie

Weakly-supervised text classification aims to induce text classifiers from only a few user-provided seed words. The vast majority of previous work assumes high-quality seed words are given. However, the expert-annotated seed words are sometimes non-trivial to come up with. Furthermore, in the weakly-supervised learning setting, we do not have any labeled document to measure the seed words’ efficacy, making the seed word selection process a walk in the dark. In this work, we remove the need for expert-curated seed words by first mining (noisy) candidate seed words associated with the category names. We then train interim models with individual candidate seed words. Lastly, we estimate the interim models’ error rate in an unsupervised manner. The seed words that yield the lowest estimated error rates are added to the final seed word set. A comprehensive evaluation of six binary classification tasks on four popular datasets demonstrates that the proposed method outperforms a baseline using only category name seed words and obtained comparable performance as a counterpart using expert-annotated seed words.

pdf bib
Multi-Task Learning of Generation and Classification for Emotion-Aware Dialogue Response Generation
Tatsuya Ide | Daisuke Kawahara

For a computer to naturally interact with a human, it needs to be human-like. In this paper, we propose a neural response generation model with multi-task learning of generation and classification, focusing on emotion. Our model based on BART (Lewis et al., 2020), a pre-trained transformer encoder-decoder model, is trained to generate responses and recognize emotions simultaneously. Furthermore, we weight the losses for the tasks to control the update of parameters. Automatic evaluations and crowdsourced manual evaluations show that the proposed model makes generated responses more emotionally aware.

pdf bib
Comparison of Grammatical Error Correction Using Back-Translation Models
Aomi Koyama | Kengo Hotate | Masahiro Kaneko | Mamoru Komachi

Grammatical error correction (GEC) suffers from a lack of sufficient parallel data. Studies on GEC have proposed several methods to generate pseudo data, which comprise pairs of grammatical and artificially produced ungrammatical sentences. Currently, a mainstream approach to generate pseudo data is back-translation (BT). Most previous studies using BT have employed the same architecture for both the GEC and BT models. However, GEC models have different correction tendencies depending on the architecture of their models. Thus, in this study, we compare the correction tendencies of GEC models trained on pseudo data generated by three BT models with different architectures, namely, Transformer, CNN, and LSTM. The results confirm that the correction tendencies for each error type are different for every BT model. In addition, we investigate the correction tendencies when using a combination of pseudo data generated by different BT models. As a result, we find that the combination of different BT models improves or interpolates the performance of each error type compared with using a single BT model with different seeds.

pdf bib
Hie-BART : Document Summarization with Hierarchical BARTBART: Document Summarization with Hierarchical BART
Kazuki Akiyama | Akihiro Tamura | Takashi Ninomiya

This paper proposes a new abstractive document summarization model, hierarchical BART (Hie-BART), which captures hierarchical structures of a document (i.e., sentence-word structures) in the BART model. Although the existing BART model has achieved a state-of-the-art performance on document summarization tasks, the model does not have the interactions between sentence-level information and word-level information. In machine translation tasks, the performance of neural machine translation models has been improved by incorporating multi-granularity self-attention (MG-SA), which captures the relationships between words and phrases. Inspired by the previous work, the proposed Hie-BART model incorporates MG-SA into the encoder of the BART model for capturing sentence-word structures. Evaluations on the CNN / Daily Mail dataset show that the proposed Hie-BART model outperforms some strong baselines and improves the performance of a non-hierarchical BART model (+0.23 ROUGE-L).

up

pdf (full)
bib (full)
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Tutorials

pdf bib
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Tutorials
Greg Kondrak | Kalina Bontcheva | Dan Gillick

pdf bib
A Tutorial on Evaluation Metrics used in Natural Language Generation
Mitesh M. Khapra | Ananya B. Sai

The advent of Deep Learning and the availability of large scale datasets has accelerated research on Natural Language Generation with a focus on newer tasks and better models. With such rapid progress, it is vital to assess the extent of scientific progress made and identify the areas / components that need improvement. To accomplish this in an automatic and reliable manner, the NLP community has actively pursued the development of automatic evaluation metrics. Especially in the last few years, there has been an increasing focus on evaluation metrics, with several criticisms of existing metrics and proposals for several new metrics. This tutorial presents the evolution of automatic evaluation metrics to their current state along with the emerging trends in this field by specifically addressing the following questions : (i) What makes NLG evaluation challenging? (ii) Why do we need automatic evaluation metrics? (iii) What are the existing automatic evaluation metrics and how can they be organised in a coherent taxonomy? (iv) What are the criticisms and shortcomings of existing metrics? (v) What are the possible future directions of research?

pdf bib
Crowdsourcing Natural Language Data at Scale : A Hands-On Tutorial
Alexey Drutsa | Dmitry Ustalov | Valentina Fedorova | Olga Megorskaya | Daria Baidakova

In this tutorial, we present a portion of unique industry experience in efficient natural language data annotation via crowdsourcing shared by both leading researchers and engineers from Yandex. We will make an introduction to data labeling via public crowdsourcing marketplaces and will present the key components of efficient label collection. This will be followed by a practical session, where participants address a real-world language resource production task, experiment with selecting settings for the labeling process, and launch their label collection project on one of the largest crowdsourcing marketplaces. The projects will be run on real crowds within the tutorial session and we will present useful quality control techniques and provide the attendees with an opportunity to discuss their own annotation ideas.

up

pdf (full)
bib (full)
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers

pdf bib
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers
Young-bum Kim | Yunyao Li | Owen Rambow

pdf bib
When does text prediction benefit from additional context? An exploration of contextual signals for chat and email messages
Stojan Trajanovski | Chad Atalla | Kunho Kim | Vipul Agarwal | Milad Shokouhi | Chris Quirk

Email and chat communication tools are increasingly important for completing daily tasks. Accurate real-time phrase completion can save time and bolster productivity. Modern text prediction algorithms are based on large language models which typically rely on the prior words in a message to predict a completion. We examine how additional contextual signals (from previous messages, time, and subject) affect the performance of a commercial text prediction model. We compare contextual text prediction in chat and email messages from two of the largest commercial platforms Microsoft Teams and Outlook, finding that contextual signals contribute to performance differently between these scenarios. On emails, time context is most beneficial with small relative gains of 2 % over baseline. Whereas, in chat scenarios, using a tailored set of previous messages as context yields relative improvements over the baseline between 9.3 % and 18.6 % across various critical service-oriented text prediction metrics.

pdf bib
Proteno : Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems
Shubhi Tyagi | Antonio Bonafonte | Jaime Lorenzo-Trueba | Javier Latorre

Developing Text Normalization (TN) systems for Text-to-Speech (TTS) on new languages is hard. We propose a novel architecture to facilitate it for multiple languages while using data less than 3 % of the size of the data used by the state of the art results on English. We treat TN as a sequence classification problem and propose a granular tokenization mechanism that enables the system to learn majority of the classes and their normalizations from the training data itself. This is further combined with minimal precoded linguistic knowledge for other classes. We publish the first results on TN for TTS in Spanish and Tamil and also demonstrate that the performance of the approach is comparable with the previous work done on English. All annotated datasets used for experimentation will be released.

pdf bib
Autocorrect in the Process of Translation Multi-task Learning Improves Dialogue Machine Translation
Tao Wang | Chengqi Zhao | Mingxuan Wang | Lei Li | Deyi Xiong

Automatic translation of dialogue texts is a much needed demand in many real life scenarios. However, the currently existing neural machine translation delivers unsatisfying results. In this paper, we conduct a deep analysis of a dialogue corpus and summarize three major issues on dialogue translation, including pronoun dropping (), punctuation dropping (), and typos (). In response to these challenges, we propose a joint learning method to identify omission and typo, and utilize context to translate dialogue utterances. To properly evaluate the performance, we propose a manually annotated dataset with 1,931 Chinese-English parallel utterances from 300 dialogues as a benchmark testbed for dialogue translation. Our experiments show that the proposed method improves translation quality by 3.2 BLEU over the baselines. It also elevates the recovery rate of omitted pronouns from 26.09 % to 47.16 %. We will publish the code and dataset publicly at https://xxx.xx.

pdf bib
Practical Transformer-based Multilingual Text Classification
Cindy Wang | Michele Banko

Transformer-based methods are appealing for multilingual text classification, but common research benchmarks like XNLI (Conneau et al., 2018) do not reflect the data availability and task variety of industry applications. We present an empirical comparison of transformer-based text classification models in a variety of practical monolingual and multilingual pretraining and fine-tuning settings. We evaluate these methods on two distinct tasks in five different languages. Departing from prior work, our results show that multilingual language models can outperform monolingual ones in some downstream tasks and target languages. We additionally show that practical modifications such as task- and domain-adaptive pretraining and data augmentation can improve classification performance without the need for additional labeled data.

pdf bib
Graph-based Multilingual Product Retrieval in E-Commerce SearchE-Commerce Search
Hanqing Lu | Youna Hu | Tong Zhao | Tony Wu | Yiwei Song | Bing Yin

Nowadays, with many e-commerce platforms conducting global business, e-commerce search systems are required to handle product retrieval under multilingual scenarios. Moreover, comparing with maintaining per-country specific e-commerce search systems, having an universal system across countries can further reduce the operational and computational costs, and facilitate business expansion to new countries. In this paper, we introduce an universal end-to-end multilingual retrieval system, and discuss our learnings and technical details when training and deploying the system to serve billion-scale product retrieval for e-commerce search. In particular, we propose a multilingual graph attention based retrieval network by leveraging recent advances in transformer-based multilingual language models and graph neural network architectures to capture the interactions between search queries and items in e-commerce search. Offline experiments on five countries data show that our algorithm outperforms the state-of-the-art baselines by 35 % recall and 25 % mAP on average. Moreover, the proposed model shows significant increase of conversion / revenue in online A / B experiments and has been deployed in production for multiple countries.

pdf bib
Query2Prod2Vec : Grounded Word Embeddings for eCommerceQuery2Prod2Vec: Grounded Word Embeddings for eCommerce
Federico Bianchi | Jacopo Tagliabue | Bingqing Yu

We present Query2Prod2Vec, a model that grounds lexical representations for product search in product embeddings : in our model, meaning is a mapping between words and a latent space of products in a digital shop. We leverage shopping sessions to learn the underlying space and use merchandising annotations to build lexical analogies for evaluation : our experiments show that our model is more accurate than known techniques from the NLP and IR literature. Finally, we stress the importance of data efficiency for product search outside of retail giants, and highlight how Query2Prod2Vec fits with practical constraints faced by most practitioners.

pdf bib
An Architecture for Accelerated Large-Scale Inference of Transformer-Based Language Models
Amir Ganiev | Colton Chapin | Anderson De Andrade | Chen Liu

This work demonstrates the development process of a machine learning architecture for inference that can scale to a large volume of requests. We used a BERT model that was fine-tuned for emotion analysis, returning a probability distribution of emotions given a paragraph. The model was deployed as a gRPC service on Kubernetes. Apache Spark was used to perform inference in batches by calling the service. We encountered some performance and concurrency challenges and created solutions to achieve faster running time. Starting with 200 successful inference requests per minute, we were able to achieve as high as 18 thousand successful requests per minute with the same batch job resource allocation. As a result, we successfully stored emotion probabilities for 95 million paragraphs within 96 hours.

pdf bib
Cost-effective Deployment of BERT Models in Serverless EnvironmentBERT Models in Serverless Environment
Marek Suppa | Katarína Benešová | Andrej Švec

In this study, we demonstrate the viability of deploying BERT-style models to AWS Lambda in a production environment. Since the freely available pre-trained models are too large to be deployed in this environment, we utilize knowledge distillation and fine-tune the models on proprietary datasets for two real-world tasks : sentiment analysis and semantic textual similarity. As a result, we obtain models that are tuned for a specific domain and deployable in the serverless environment. The subsequent performance analysis shows that this solution does not only report latency levels acceptable for production use but that it is also a cost-effective alternative to small-to-medium size deployments of BERT models, all without any infrastructure overhead.

pdf bib
Noise Robust Named Entity Understanding for Voice Assistants
Deepak Muralidharan | Joel Ruben Antony Moniz | Sida Gao | Xiao Yang | Justine Kao | Stephen Pulman | Atish Kothari | Ray Shen | Yinying Pan | Vivek Kaul | Mubarak Seyed Ibrahim | Gang Xiang | Nan Dun | Yidan Zhou | Andy O | Yuan Zhang | Pooja Chitkara | Xuan Wang | Alkesh Patel | Kushal Tayal | Roger Zheng | Peter Grasch | Jason D Williams | Lin Li

Named Entity Recognition (NER) and Entity Linking (EL) play an essential role in voice assistant interaction, but are challenging due to the special difficulties associated with spoken user queries. In this paper, we propose a novel architecture that jointly solves the NER and EL tasks by combining them in a joint reranking module. We show that our proposed framework improves NER accuracy by up to 3.13 % and EL accuracy by up to 3.6 % in F1 score. The features used also lead to better accuracies in other natural language understanding tasks, such as domain classification and semantic parsing.

pdf bib
Intent Features for Rich Natural Language Understanding
Brian Lester | Sagnik Ray Choudhury | Rashmi Prasad | Srinivas Bangalore

Complex natural language understanding modules in dialog systems have a richer understanding of user utterances, and thus are critical in providing a better user experience. However, these models are often created from scratch, for specific clients and use cases and require the annotation of large datasets. This encourages the sharing of annotated data across multiple clients. To facilitate this we introduce the idea of intent features : domain and topic agnostic properties of intents that can be learnt from the syntactic cues only, and hence can be shared. We introduce a new neural network architecture, the Global-Local model, that shows significant improvement over strong baselines for identifying these features in a deployed, multi-intent natural language understanding module, and more generally in a classification setting where a part of an utterance has to be classified utilizing the whole context.intent features: domain and topic agnostic properties of intents that can be learnt from the syntactic cues only, and hence can be shared. We introduce a new neural network architecture, the Global-Local model, that shows significant improvement over strong baselines for identifying these features in a deployed, multi-intent natural language understanding module, and more generally in a classification setting where a part of an utterance has to be classified utilizing the whole context.

pdf bib
Ad Headline Generation using Self-Critical Masked Language Model
Yashal Shakti Kanungo | Sumit Negi | Aruna Rajan

For any E-commerce website it is a nontrivial problem to build enduring advertisements that attract shoppers. It is hard to pass the creative quality bar of the website, especially at a large scale. We thus propose a programmatic solution to generate product advertising headlines using retail content. We propose a state of the art application of Reinforcement Learning (RL) Policy gradient methods on Transformer (Vaswani et al., 2017) based Masked Language Models (Devlin et al., 2019). Our method creates the advertising headline by jointly conditioning on multiple products that a seller wishes to advertise. We demonstrate that our method outperforms existing Transformer and LSTM + RL methods in overlap metrics and quality audits. We also show that our model generated headlines outperform human submitted headlines in terms of both grammar and creative quality as determined by audits.

pdf bib
LATEX-Numeric : Language Agnostic Text Attribute Extraction for Numeric AttributesLATEX-Numeric: Language Agnostic Text Attribute Extraction for Numeric Attributes
Kartik Mehta | Ioana Oprea | Nikhil Rasiwasia

In this paper, we present LATEX-Numeric-a high-precision fully-automated scalable framework for extracting E-commerce numeric attributes from unstructured product text like product description. Most of the past work on attribute extraction is not scalable as they rely on manually curated training data, either with or without use of active learning. We rely on distant supervision for training data generation, removing dependency on manual labels. One issue with distant supervision is that it leads to incomplete training annotation due to missing attribute values while matching. We propose a multi-task learning architecture to deal with missing labels in the training data, leading to F1 improvement of 9.2 % for numeric attributes over state-of-the-art single-task architecture. While multi-task architecture benefits both numeric and non-numeric attributes, we present automated techniques to further improve the numeric attributes extraction models. Numeric attributes require a list of units (or aliases) for better matching with distant supervision. We propose an automated algorithm for alias creation using unstructured text and attribute values, leading to a 20.2 % F1 improvement. Extensive experiments on real world datasets for 20 numeric attributes across 5 product categories and 3 English marketplaces show that LATEX-numeric achieves a high F1-score, without any manual intervention, making it suitable for practical applications. Finally we show that the improvements are language-agnostic and LATEX-Numeric achieves 13.9 % F1 improvement for 3 non-English languages.

pdf bib
Training Language Models under Resource Constraints for Adversarial Advertisement Detection
Eshwar Shamanna Girishekar | Shiv Surya | Nishant Nikhil | Dyut Kumar Sil | Sumit Negi | Aruna Rajan

Advertising on e-commerce and social media sites deliver ad impressions at web scale on a daily basis driving value to both shoppers and advertisers. This scale necessitates programmatic ways of detecting unsuitable content in ads to safeguard customer experience and trust. This paper focusses on techniques for training text classification models under resource constraints, built as part of automated solutions for advertising content moderation. We show how weak supervision, curriculum learning and multi-lingual training can be applied effectively to fine-tune BERT and its variants for text classification tasks in conjunction with different data augmentation strategies. Our extensive experiments on multiple languages show that these techniques detect adversarial ad categories with a substantial gain in precision at high recall threshold over the baseline.

pdf bib
Industry Scale Semi-Supervised Learning for Natural Language Understanding
Luoxin Chen | Francisco Garcia | Varun Kumar | He Xie | Jianhua Lu

This paper presents a production Semi-Supervised Learning (SSL) pipeline based on the student-teacher framework, which leverages millions of unlabeled examples to improve Natural Language Understanding (NLU) tasks. We investigate two questions related to the use of unlabeled data in production SSL context : 1) how to select samples from a huge unlabeled data pool that are beneficial for SSL training, and 2) how does the selected data affect the performance of different state-of-the-art SSL techniques. We compare four widely used SSL techniques, Pseudo-label (PL), Knowledge Distillation (KD), Virtual Adversarial Training (VAT) and Cross-View Training (CVT) in conjunction with two data selection methods including committee-based selection and submodular optimization based selection. We further examine the benefits and drawbacks of these techniques when applied to intent classification (IC) and named entity recognition (NER) tasks, and provide guidelines specifying when each of these methods might be beneficial to improve large scale NLU systems.

up

pdf (full)
bib (full)
Proceedings of the Second Workshop on Advances in Language and Vision Research

pdf bib
Proceedings of the Second Workshop on Advances in Language and Vision Research
Xin | Ronghang Hu | Drew Hudson | Tsu-Jui Fu | Marcus Rohrbach | Daniel Fried

pdf bib
Leveraging Partial Dependency Trees to Control Image Captions
Wenjie Zhong | Yusuke Miyao

Controlling the generation of image captions attracts lots of attention recently. In this paper, we propose a framework leveraging partial syntactic dependency trees as control signals to make image captions include specified words and their syntactic structures. To achieve this purpose, we propose a Syntactic Dependency Structure Aware Model (SDSAM), which explicitly learns to generate the syntactic structures of image captions to include given partial dependency trees. In addition, we come up with a metric to evaluate how many specified words and their syntactic dependencies are included in generated captions. We carry out experiments on two standard datasets : Microsoft COCO and Flickr30k. Empirical results show that image captions generated by our model are effectively controlled in terms of specified words and their syntactic structures. The code is available on GitHub.

pdf bib
Grounding Plural Phrases : Countering Evaluation Biases by Individuation
Julia Suter | Letitia Parcalabescu | Anette Frank

Phrase grounding (PG) is a multimodal task that grounds language in images. PG systems are evaluated on well-known benchmarks, using Intersection over Union (IoU) as evaluation metric. This work highlights a disconcerting bias in the evaluation of grounded plural phrases, which arises from representing sets of objects as a union box covering all component bounding boxes, in conjunction with the IoU metric. We detect, analyze and quantify an evaluation bias in the grounding of plural phrases and define a novel metric, c-IoU, based on a union box’s component boxes. We experimentally show that our new metric greatly alleviates this bias and recommend using it for fairer evaluation of plural phrases in PG tasks.

pdf bib
Learning to Learn Semantic Factors in Heterogeneous Image Classification
Boyue Fan | Zhenting Liu

Few-shot learning is to recognize novel classes with a few labeled samples per class. Although numerous meta-learning methods have made significant progress, they struggle to directly address the heterogeneity of training and evaluating task distributions, resulting in the domain shift problem when transitioning to new tasks with disjoint spaces. In this paper, we propose a novel method to deal with the heterogeneity. Specifically, by simulating class-difference domain shift during the meta-train phase, a bilevel optimization procedure is applied to learn a transferable representation space that can rapidly adapt to heterogeneous tasks. Experiments demonstrate the effectiveness of our proposed method.

up

pdf (full)
bib (full)
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

pdf bib
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas
Manuel Mager | Arturo Oncevay | Annette Rios | Ivan Vladimir Meza Ruiz | Alexis Palmer | Graham Neubig | Katharina Kann

pdf bib
Investigating variation in written forms of Nahuatl using character-based language modelsNahuatl using character-based language models
Robert Pugh | Francis Tyers

We describe experiments with character-based language modeling for written variants of Nahuatl. Using a standard LSTM model and publicly available Bible translations, we explore how character language models can be applied to the tasks of estimating mutual intelligibility, identifying genetic similarity, and distinguishing written variants. We demonstrate that these simple language models are able to capture similarities and differences that have been described in the linguistic literature.

pdf bib
Morphological Segmentation for SenecaSeneca
Zoey Liu | Robert Jimerson | Emily Prud’hommeaux

This study takes up the task of low-resource morphological segmentation for Seneca, a critically endangered and morphologically complex Native American language primarily spoken in what is now New York State and Ontario. The labeled data in our experiments comes from two sources : one digitized from a publicly available grammar book and the other collected from informal sources. We treat these two sources as distinct domains and investigate different evaluation designs for model selection. The first design abides by standard practices and evaluate models with the in-domain development set, while the second one carries out evaluation using a development domain, or the out-of-domain development set. Across a series of monolingual and crosslinguistic training settings, our results demonstrate the utility of neural encoder-decoder architecture when coupled with multi-task learning.

pdf bib
Representation of Yine [ Arawak ] Morphology by Finite State Transducer FormalismYine [Arawak] Morphology by Finite State Transducer Formalism
Adriano Ingunza Torres | John Miller | Arturo Oncevay | Roberto Zariquiey Biondi

We represent the complexity of Yine (Arawak) morphology with a finite state transducer (FST) based morphological analyzer. Yine is a low-resource indigenous polysynthetic Peruvian language spoken by approximately 3,000 people and is classified as ‘definitely endangered’ by UNESCO. We review Yine morphology focusing on morphophonology, possessive constructions and verbal predicates. Then we develop FSTs to model these components proposing techniques to solve challenging problems such as complex patterns of incorporating open and closed category arguments. This is a work in progress and we still have more to do in the development and verification of our analyzer. Our analyzer will serve both as a tool to better document the Yine language and as a component of natural language processing (NLP) applications such as spell checking and correction.

pdf bib
Expanding Universal Dependencies for Polysynthetic Languages : A Case of St. Lawrence Island YupikUniversal Dependencies for Polysynthetic Languages: A Case of St. Lawrence Island Yupik
Hyunji Hayley Park | Lane Schwartz | Francis Tyers

This paper describes the development of the first Universal Dependencies (UD) treebank for St. Lawrence Island Yupik, an endangered language spoken in the Bering Strait region. While the UD guidelines provided a general framework for our annotations, language-specific decisions were made necessary by the rich morphology of the polysynthetic language. Most notably, we annotated a corpus at the morpheme level as well as the word level. The morpheme level annotation was conducted using an existing morphological analyzer and manual disambiguation. By comparing the two resulting annotation schemes, we argue that morpheme-level annotation is essential for polysynthetic languages like St. Lawrence Island Yupik. Word-level annotation results in degenerate trees for some Yupik sentences and often fails to capture syntactic relations that can be manifested at the morpheme level. Dependency parsing experiments provide further support for morpheme-level annotation. Implications for UD annotation of other polysynthetic languages are discussed.

pdf bib
Findings of the AmericasNLP 2021 Shared Task on Open Machine Translation for Indigenous Languages of the AmericasAmericasNLP 2021 Shared Task on Open Machine Translation for Indigenous Languages of the Americas
Manuel Mager | Arturo Oncevay | Abteen Ebrahimi | John Ortega | Annette Rios | Angela Fan | Ximena Gutierrez-Vasques | Luis Chiruzzo | Gustavo Giménez-Lugo | Ricardo Ramos | Ivan Vladimir Meza Ruiz | Rolando Coto-Solano | Alexis Palmer | Elisabeth Mager-Hois | Vishrav Chaudhary | Graham Neubig | Ngoc Thang Vu | Katharina Kann

This paper presents the results of the 2021 Shared Task on Open Machine Translation for Indigenous Languages of the Americas. The shared task featured two independent tracks, and participants submitted machine translation systems for up to 10 indigenous languages. Overall, 8 teams participated with a total of 214 submissions. We provided training sets consisting of data collected from various sources, as well as manually translated sentences for the development and test sets. An official baseline trained on this data was also provided. Team submissions featured a variety of architectures, including both statistical and neural models, and for the majority of languages, many teams were able to considerably improve over the baseline. The best performing systems achieved 12.97 ChrF higher than baseline, when averaged across languages.

pdf bib
The REPU CS’ SpanishQuechua Submission to the AmericasNLP 2021 Shared Task on Open Machine TranslationREPU CSSpanish–Quechua Submission to the AmericasNLP 2021 Shared Task on Open Machine Translation
Oscar Moreno

We present the submission of REPUcs to the AmericasNLP machine translation shared task for the low resource language pair SpanishQuechua. Our neural machine translation system ranked first in Track two (development set not used for training) and third in Track one (training includes development data). Our contribution is focused on : (i) the collection of new parallel data from different web sources (poems, lyrics, lexicons, handbooks), and (ii) using large SpanishEnglish data for pre-training and then fine-tuning the SpanishQuechua system. This paper describes the new parallel corpora and our approach in detail.

pdf bib
The Helsinki submission to the AmericasNLP shared taskHelsinki submission to the AmericasNLP shared task
Raúl Vázquez | Yves Scherrer | Sami Virpioja | Jörg Tiedemann

The University of Helsinki participated in the AmericasNLP shared task for all ten language pairs. Our multilingual NMT models reached the first rank on all language pairs in track 1, and first rank on nine out of ten language pairs in track 2. We focused our efforts on three aspects : (1) the collection of additional data from various sources such as Bibles and political constitutions, (2) the cleaning and filtering of training data with the OpusFilter toolkit, and (3) different multilingual training techniques enabled by the latest version of the OpenNMT-py toolkit to make the most efficient use of the scarce data. This paper describes our efforts in detail.

up

pdf (full)
bib (full)
Proceedings of the Second Workshop on Automatic Simultaneous Translation

pdf bib
Proceedings of the Second Workshop on Automatic Simultaneous Translation
Hua Wu | Colin Cherry | Liang Huang | Zhongjun He | Qun Liu | Maha Elbayad | Mark Liberman | Haifeng Wang | Mingbo Ma | Ruiqing Zhang

pdf bib
ICT’s System for AutoSimTrans 2021 : Robust Char-Level Simultaneous TranslationICT’s System for AutoSimTrans 2021: Robust Char-Level Simultaneous Translation
Shaolei Zhang | Yang Feng

Simultaneous translation (ST) outputs the translation simultaneously while reading the input sentence, which is an important component of simultaneous interpretation. In this paper, we describe our submitted ST system, which won the first place in the streaming transcription input track of the Chinese-English translation task of AutoSimTrans 2021. Aiming at the robustness of ST, we first propose char-level simultaneous translation and applied wait-k policy on it. Meanwhile, we apply two data processing methods and combine two training methods for domain adaptation. Our method enhance the ST model with stronger robustness and domain adaptability. Experiments on streaming transcription show that our method outperforms the baseline at all latency, especially at low latency, the proposed method improves about 6 BLEU. Besides, ablation studies we conduct verify the effectiveness of each module in the proposed method.

pdf bib
XMU’s Simultaneous Translation System at NAACL 2021XMU’s Simultaneous Translation System at NAACL 2021
Shuangtao Li | Jinming Hu | Boli Wang | Xiaodong Shi | Yidong Chen

This paper describes our two systems submitted to the simultaneous translation evaluation at the 2nd automatic simultaneous translation workshop.

up

pdf (full)
bib (full)
Proceedings of the 20th Workshop on Biomedical Language Processing

pdf bib
Proceedings of the 20th Workshop on Biomedical Language Processing
Dina Demner-Fushman | Kevin Bretonnel Cohen | Sophia Ananiadou | Junichi Tsujii

pdf bib
Scalable Few-Shot Learning of Robust Biomedical Name Representations
Pieter Fivez | Simon Suster | Walter Daelemans

Recent research on robust representations of biomedical names has focused on modeling large amounts of fine-grained conceptual distinctions using complex neural encoders. In this paper, we explore the opposite paradigm : training a simple encoder architecture using only small sets of names sampled from high-level biomedical concepts. Our encoder post-processes pretrained representations of biomedical names, and is effective for various types of input representations, both domain-specific or unsupervised. We validate our proposed few-shot learning approach on multiple biomedical relatedness benchmarks, and show that it allows for continual learning, where we accumulate information from various conceptual hierarchies to consistently improve encoder performance. Given these findings, we propose our approach as a low-cost alternative for exploring the impact of conceptual distinctions on robust biomedical name representations.

pdf bib
SAFFRON : tranSfer leArning For Food-disease RelatiOn extractioNSAFFRON: tranSfer leArning For Food-disease RelatiOn extractioN
Gjorgjina Cenikj | Tome Eftimov | Barbara Koroušić Seljak

The accelerating growth of big data in the biomedical domain, with an endless amount of electronic health records and more than 30 million citations and abstracts in PubMed, introduces the need for automatic structuring of textual biomedical data. In this paper, we develop a method for detecting relations between food and disease entities from raw text. Due to the lack of annotated data on food with respect to health, we explore the feasibility of transfer learning by training BERT-based models on existing datasets annotated for the presence of cause and treat relations among different types of biomedical entities, and using them to recognize the same relations between food and disease entities in a dataset created for the purposes of this study. The best models achieve macro averaged F1 scores of 0.847 and 0.900 for the cause and treat relations, respectively.

pdf bib
Overview of the MEDIQA 2021 Shared Task on Summarization in the Medical DomainMEDIQA 2021 Shared Task on Summarization in the Medical Domain
Asma Ben Abacha | Yassine Mrabet | Yuhao Zhang | Chaitanya Shivade | Curtis Langlotz | Dina Demner-Fushman

The MEDIQA 2021 shared tasks at the BioNLP 2021 workshop addressed three tasks on summarization for medical text : (i) a question summarization task aimed at exploring new approaches to understanding complex real-world consumer health queries, (ii) a multi-answer summarization task that targeted aggregation of multiple relevant answers to a biomedical question into one concise and relevant answer, and (iii) a radiology report summarization task addressing the development of clinically relevant impressions from radiology report findings. Thirty-five teams participated in these shared tasks with sixteen working notes submitted (fifteen accepted) describing a wide variety of models developed and tested on the shared and external datasets. In this paper, we describe the tasks, the datasets, the models and techniques developed by various teams, the results of the evaluation, and a study of correlations among various summarization evaluation measures. We hope that these shared tasks will bring new research and insights in biomedical text summarization and evaluation.

pdf bib
WBI at MEDIQA 2021 : Summarizing Consumer Health Questions with Generative TransformersWBI at MEDIQA 2021: Summarizing Consumer Health Questions with Generative Transformers
Mario Sänger | Leon Weber | Ulf Leser

This paper describes our contribution for the MEDIQA-2021 Task 1 question summarization competition. We model the task as conditional generation problem. Our concrete pipeline performs a finetuning of the large pretrained generative transformers PEGASUS (Zhang et al.,2020a) and BART (Lewis et al.,2020). We used the resulting models as strong baselines and experimented with (i) integrating structured knowledge via entity embeddings, (ii) ensembling multiple generative models with the generator-discriminator framework and (iii) disentangling summarization and interrogative prediction to achieve further improvements. Our best performing model, a fine-tuned vanilla PEGASUS, reached the second place in the competition with an ROUGE-2-F1 score of 15.99. We observed that all of our additional measures hurt performance (up to 5.2 pp) on the official test set. In course of a post-hoc experimental analysis which uses a larger validation set results indicate slight performance improvements through the proposed extensions. However, further analysis is need to provide stronger evidence.

pdf bib
BDKG at MEDIQA 2021 : System Report for the Radiology Report Summarization TaskBDKG at MEDIQA 2021: System Report for the Radiology Report Summarization Task
Songtai Dai | Quan Wang | Yajuan Lyu | Yong Zhu

This paper presents our winning system at the Radiology Report Summarization track of the MEDIQA 2021 shared task. Radiology report summarization automatically summarizes radiology findings into free-text impressions. This year’s task emphasizes the generalization and transfer ability of participating systems. Our system is built upon a pre-trained Transformer encoder-decoder architecture, i.e., PEGASUS, deployed with an additional domain adaptation module to particularly handle the transfer and generalization issue. Heuristics like ensemble and text normalization are also used. Our system is conceptually simple yet highly effective, achieving a ROUGE-2 score of 0.436 on test set and ranked the 1st place among all participating systems.

pdf bib
damo_nlp at MEDIQA 2021 : Knowledge-based Preprocessing and Coverage-oriented Reranking for Medical Question SummarizationMEDIQA 2021: Knowledge-based Preprocessing and Coverage-oriented Reranking for Medical Question Summarization
Yifan He | Mosha Chen | Songfang Huang

Medical question summarization is an important but difficult task, where the input is often complex and erroneous while annotated data is expensive to acquire. We report our participation in the MEDIQA 2021 question summarization task in which we are required to address these challenges. We start from pre-trained conditional generative language models, use knowledge bases to help correct input errors, and rerank single system outputs to boost coverage. Experimental results show significant improvement in string-based metrics.

pdf bib
Stress Test Evaluation of Biomedical Word Embeddings
Vladimir Araujo | Andrés Carvallo | Carlos Aspillaga | Camilo Thorne | Denis Parra

The success of pretrained word embeddings has motivated their use in the biomedical domain, with contextualized embeddings yielding remarkable results in several biomedical NLP tasks. However, there is a lack of research on quantifying their behavior under severe stress scenarios. In this work, we systematically evaluate three language models with adversarial examples automatically constructed tests that allow us to examine how robust the models are. We propose two types of stress scenarios focused on the biomedical named entity recognition (NER) task, one inspired by spelling errors and another based on the use of synonyms for medical terms. Our experiments with three benchmarks show that the performance of the original models decreases considerably, in addition to revealing their weaknesses and strengths. Finally, we show that adversarial training causes the models to improve their robustness and even to exceed the original performance in some cases.

pdf bib
BioELECTRA : Pretrained Biomedical text Encoder using DiscriminatorsBioELECTRA:Pretrained Biomedical text Encoder using Discriminators
Kamal raj Kanakarajan | Bhuvana Kundumani | Malaikannan Sankarasubbu

Recent advancements in pretraining strategies in NLP have shown a significant improvement in the performance of models on various text mining tasks. We apply ‘replaced token detection’ pretraining technique proposed by ELECTRA and pretrain a biomedical language model from scratch using biomedical text and vocabulary. We introduce BioELECTRA, a biomedical domain-specific language encoder model that adapts ELECTRA for the Biomedical domain. WE evaluate our model on the BLURB and BLUE biomedical NLP benchmarks. BioELECTRA outperforms the previous models and achieves state of the art (SOTA) on all the 13 datasets in BLURB benchmark and on all the 4 Clinical datasets from BLUE Benchmark across 7 different NLP tasks. BioELECTRA pretrained on PubMed and PMC full text articles performs very well on Clinical datasets as well. BioELECTRA achieves new SOTA 86.34%(1.39 % accuracy improvement) on MedNLI and 64 % (2.98 % accuracy improvement) on PubMedQA dataset.

pdf bib
Word centrality constrained representation for keyphrase extraction
Zelalem Gero | Joyce Ho

To keep pace with the increased generation and digitization of documents, automated methods that can improve search, discovery and mining of the vast body of literature are essential. Keyphrases provide a concise representation by identifying salient concepts in a document. Various supervised approaches model keyphrase extraction using local context to predict the label for each token and perform much better than the unsupervised counterparts. Unfortunately, this method fails for short documents where the context is unclear. Moreover, keyphrases, which are usually the gist of a document, need to be the central theme. We propose a new extraction model that introduces a centrality constraint to enrich the word representation of a Bidirectional long short-term memory. Performance evaluation on 2 publicly available datasets demonstrate our model outperforms existing state-of-the art approaches.

pdf bib
End-to-end Biomedical Entity Linking with Span-based Dictionary Matching
Shogo Ujiie | Hayate Iso | Shuntaro Yada | Shoko Wakamiya | Eiji Aramaki

Disease name recognition and normalization is a fundamental process in biomedical text mining. Recently, neural joint learning of both tasks has been proposed to utilize the mutual benefits. While this approach achieves high performance, disease concepts that do not appear in the training dataset can not be accurately predicted. This study introduces a novel end-to-end approach that combines span representations with dictionary-matching features to address this problem. Our model handles unseen concepts by referring to a dictionary while maintaining the performance of neural network-based models. Experiments using two major datasaets demonstrate that our model achieved competitive results with strong baselines, especially for unseen concepts during training.

pdf bib
Context-aware query design combines knowledge and data for efficient reading and reasoning
Emilee Holtzapple | Brent Cochran | Natasa Miskov-Zivanov

The amount of biomedical literature has vastly increased over the past few decades. As a result, the sheer quantity of accessible information is overwhelming, and complicates manual information retrieval. Automated methods seek to speed up information retrieval from biomedical literature. However, such automated methods are still too time-intensive to survey all existing biomedical literature. We present a methodology for automatically generating literature queries that select relevant papers based on biological data. By using differentially expressed genes to inform our literature searches, we focus information extraction on mechanistic signaling details that are crucial for the disease or context of interest.

pdf bib
Measuring the relative importance of full text sections for information retrieval from scientific literature.
Lana Yeganova | Won Gyu Kim | Donald Comeau | W John Wilbur | Zhiyong Lu

With the growing availability of full-text articles, integrating abstracts and full texts of documents into a unified representation is essential for comprehensive search of scientific literature. However, previous studies have shown that navely merging abstracts with full texts of articles does not consistently yield better performance. Balancing the contribution of query terms appearing in the abstract and in sections of different importance in full text articles remains a challenge both with traditional bag-of-words IR approaches and for neural retrieval methods. In this work we establish the connection between the BM25 score of a query term appearing in a section of a full text document and the probability of that document being clicked or identified as relevant. Probability is computed using Pool Adjacent Violators (PAV), an isotonic regression algorithm, providing a maximum likelihood estimate based on the observed data. Using this probabilistic transformation of BM25 scores we show an improved performance on the PubMed Click dataset developed and presented in this study, as well as the 2007 TREC Genomics collection.

pdf bib
SB_NITK at MEDIQA 2021 : Leveraging Transfer Learning for Question Summarization in Medical DomainSB_NITK at MEDIQA 2021: Leveraging Transfer Learning for Question Summarization in Medical Domain
Spandana Balumuri | Sony Bachina | Sowmya Kamath S

Recent strides in the healthcare domain, have resulted in vast quantities of streaming data available for use for building intelligent knowledge-based applications. However, the challenges introduced to the huge volume, velocity of generation, variety and variability of this medical data have to be adequately addressed. In this paper, we describe the model and results for our submission at MEDIQA 2021 Question Summarization shared task. In order to improve the performance of summarization of consumer health questions, our method explores the use of transfer learning to utilize the knowledge of NLP transformers like BART, T5 and PEGASUS. The proposed models utilize the knowledge of pre-trained NLP transformers to achieve improved results when compared to conventional deep learning models such as LSTM, RNN etc. Our team SB_NITK ranked 12th among the total 22 submissions in the official final rankings. Our BART based model achieved a ROUGE-2 F1 score of 0.139.

pdf bib
QIAI at MEDIQA 2021 : Multimodal Radiology Report SummarizationQIAI at MEDIQA 2021: Multimodal Radiology Report Summarization
Jean-Benoit Delbrouck | Cassie Zhang | Daniel Rubin

This paper describes the solution of the QIAI lab sent to the Radiology Report Summarization (RRS) challenge at MEDIQA 2021. This paper aims to investigate whether using multimodality during training improves the summarizing performances of the model at test-time. Our preliminary results shows that taking advantage of the visual features from the x-rays associated to the radiology reports leads to higher evaluation metrics compared to a text-only baseline system. These improvements are reported according to the automatic evaluation metrics METEOR, BLEU and ROUGE scores. Our experiments can be fully replicated at the following address : https:// github.com/jbdel/vilmedic.

pdf bib
MNLP at MEDIQA 2021 : Fine-Tuning PEGASUS for Consumer Health Question SummarizationMNLP at MEDIQA 2021: Fine-Tuning PEGASUS for Consumer Health Question Summarization
Jooyeon Lee | Huong Dang | Ozlem Uzuner | Sam Henry

This paper details a Consumer Health Question (CHQ) summarization model submitted to MEDIQA 2021 for shared task 1 : Question Summarization. Many CHQs are composed of multiple sentences with typos or unnecessary information, which can interfere with automated question answering systems. Question summarization mitigates this issue by removing this unnecessary information, aiding automated systems in generating a more accurate summary. Our summarization approach focuses on applying multiple pre-processing techniques, including question focus identification on the input and the development of an ensemble method to combine question focus with an abstractive summarization method. We use the state-of-art abstractive summarization model, PEGASUS (Pre-training with Extracted Gap-sentences for Abstractive Summarization), to generate abstractive summaries. Our experiments show that using our ensemble method, which combines abstractive summarization with question focus identification, improves performance over using summarization alone. Our model shows a ROUGE-2 F-measure of 11.14 % against the official test dataset.

up

pdf (full)
bib (full)
Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching

pdf bib
Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching
Thamar Solorio | Shuguang Chen | Alan W. Black | Mona Diab | Sunayana Sitaram | Victor Soto | Emre Yilmaz | Anirudh Srinivasan

pdf bib
Translate and Classify : Improving Sequence Level Classification for English-Hindi Code-Mixed DataEnglish-Hindi Code-Mixed Data
Devansh Gautam | Kshitij Gupta | Manish Shrivastava

Code-mixing is a common phenomenon in multilingual societies around the world and is especially common in social media texts. Traditional NLP systems, usually trained on monolingual corpora, do not perform well on code-mixed texts. Training specialized models for code-switched texts is difficult due to the lack of large-scale datasets. Translating code-mixed data into standard languages like English could improve performance on various code-mixed tasks since we can use transfer learning from state-of-the-art English models for processing the translated data. This paper focuses on two sequence-level classification tasks for English-Hindi code mixed texts, which are part of the GLUECoS benchmark-Natural Language Inference and Sentiment Analysis. We propose using various pre-trained models that have been fine-tuned for similar English-only tasks and have shown state-of-the-art performance. We further fine-tune these models on the translated code-mixed datasets and achieve state-of-the-art performance in both tasks. To translate English-Hindi code-mixed data to English, we use mBART, a pre-trained multilingual sequence-to-sequence model that has shown competitive performance on various low-resource machine translation pairs and has also shown performance gains in languages that were not in its pre-training corpus.

pdf bib
Exploring Text-to-Text Transformers for English to Hinglish Machine Translation with Synthetic Code-MixingEnglish to Hinglish Machine Translation with Synthetic Code-Mixing
Ganesh Jawahar | El Moatez Billah Nagoudi | Muhammad Abdul-Mageed | Laks Lakshmanan, V.S.

We describe models focused at the understudied problem of translating between monolingual and code-mixed language pairs. More specifically, we offer a wide range of models that convert monolingual English text into Hinglish (code-mixed Hindi and English). Given the recent success of pretrained language models, we also test the utility of two recent Transformer-based encoder-decoder models (i.e., mT5 and mBART) on the task finding both to work well. Given the paucity of training data for code-mixing, we also propose a dependency-free method for generating code-mixed texts from bilingual distributed representations that we exploit for improving language model performance. In particular, armed with this additional data, we adopt a curriculum learning approach where we first finetune the language models on synthetic data then on gold code-mixed data. We find that, although simple, our synthetic code-mixing method is competitive with (and in some cases is even superior to) several standard methods (backtranslation, method based on equivalence constraint theory) under a diverse set of conditions. Our work shows that the mT5 model, finetuned following the curriculum learning procedure, achieves best translation performance (12.67 BLEU). Our models place first in the overall ranking of the English-Hinglish official shared task.

pdf bib
CoMeT : Towards Code-Mixed Translation Using Parallel Monolingual SentencesCoMeT: Towards Code-Mixed Translation Using Parallel Monolingual Sentences
Devansh Gautam | Prashant Kodali | Kshitij Gupta | Anmol Goel | Manish Shrivastava | Ponnurangam Kumaraguru

Code-mixed languages are very popular in multilingual societies around the world, yet the resources lag behind to enable robust systems on such languages. A major contributing factor is the informal nature of these languages which makes it difficult to collect code-mixed data. In this paper, we propose our system for Task 1 of CACLS 2021 to generate a machine translation system for English to Hinglish in a supervised setting. Translating in the given direction can help expand the set of resources for several tasks by translating valuable datasets from high resource languages. We propose to use mBART, a pre-trained multilingual sequence-to-sequence model, and fully utilize the pre-training of the model by transliterating the roman Hindi words in the code-mixed sentences to Devanagri script. We evaluate how expanding the input by concatenating Hindi translations of the English sentences improves mBART’s performance. Our system gives a BLEU score of 12.22 on test set. Further, we perform a detailed error analysis of our proposed systems and explore the limitations of the provided dataset and metrics.

pdf bib
On the logistical difficulties and findings of Jopara Sentiment Analysis
Marvin Agüero-Torales | David Vilares | Antonio López-Herrera

This paper addresses the problem of sentiment analysis for Jopara, a code-switching language between Guarani and Spanish. We first collect a corpus of Guarani-dominant tweets and discuss on the difficulties of finding quality data for even relatively easy-to-annotate tasks, such as sentiment analysis. Then, we train a set of neural models, including pre-trained language models, and explore whether they perform better than traditional machine learning ones in this low-resource setup. Transformer architectures obtain the best results, despite not considering Guarani during pre-training, but traditional machine learning models perform close due to the low-resource nature of the problem.

pdf bib
CodemixedNLP : An Extensible and Open NLP Toolkit for Code-MixingCodemixedNLP: An Extensible and Open NLP Toolkit for Code-Mixing
Sai Muralidhar Jayanthi | Kavya Nerella | Khyathi Raghavi Chandu | Alan W Black

The NLP community has witnessed steep progress in a variety of tasks across the realms of monolingual and multilingual language processing recently. These successes, in conjunction with the proliferating mixed language interactions on social media, have boosted interest in modeling code-mixed texts. In this work, we present CodemixedNLP, an open-source library with the goals of bringing together the advances in code-mixed NLP and opening it up to a wider machine learning community. The library consists of tools to develop and benchmark versatile model architectures that are tailored for mixed texts, methods to expand training sets, techniques to quantify mixing styles, and fine-tuned state-of-the-art models for 7 tasks in Hinglish. We believe this work has the potential to foster a distributed yet collaborative and sustainable ecosystem in an otherwise dispersed space of code-mixing research. The toolkit is designed to be simple, easily extensible, and resourceful to both researchers as well as practitioners. Demo : http://k-ikkees.pc.cs.cmu.edu:5000 and Library : https://github.com/murali1996/CodemixedNLP

pdf bib
Normalization and Back-Transliteration for Code-Switched Data
Dwija Parikh | Thamar Solorio

Code-switching is an omnipresent phenomenon in multilingual communities all around the world but remains a challenge for NLP systems due to the lack of proper data and processing techniques. Hindi-English code-switched text on social media is often transliterated to the Roman script which prevents from utilizing monolingual resources available in the native Devanagari script. In this paper, we propose a method to normalize and back-transliterate code-switched Hindi-English text. In addition, we present a grapheme-to-phoneme (G2P) conversion technique for romanized Hindi data. We also release a dataset of script-corrected Hindi-English code-switched sentences labeled for the named entity recognition and part-of-speech tagging tasks to facilitate further research.

pdf bib
Abusive content detection in transliterated Bengali-English social media corpusBengali-English social media corpus
Salim Sazzed

Abusive text detection in low-resource languages such as Bengali is a challenging task due to the inadequacy of resources and tools. The ubiquity of transliterated Bengali comments in social media makes the task even more involved as monolingual approaches can not capture them. Unfortunately, no transliterated Bengali corpus is publicly available yet for abusive content analysis. Therefore, in this paper, we introduce an annotated Bengali corpus of 3000 transliterated Bengali comments categorized into two classes, abusive and non-abusive, 1500 comments for each. For baseline evaluations, we employ several supervised machine learning (ML) and deep learning-based classifiers. We find support vector machine (SVM) shows the highest efficacy for identifying abusive content. We make the annotated corpus freely available for the researcher to aid abusive content detection in Bengali social media data.

pdf bib
Are Multilingual Models Effective in Code-Switching?
Genta Indra Winata | Samuel Cahyawijaya | Zihan Liu | Zhaojiang Lin | Andrea Madotto | Pascale Fung

Multilingual language models have shown decent performance in multilingual and cross-lingual natural language understanding tasks. However, the power of these multilingual models in code-switching tasks has not been fully explored. In this paper, we study the effectiveness of multilingual language models to understand their capability and adaptability to the mixed-language setting by considering the inference speed, performance, and number of parameters to measure their practicality. We conduct experiments in three language pairs on named entity recognition and part-of-speech tagging and compare them with existing methods, such as using bilingual embeddings and multilingual meta-embeddings. Our findings suggest that pre-trained multilingual models do not necessarily guarantee high-quality representations on code-switching, while using meta-embeddings achieves similar results with significantly fewer parameters.

up

pdf (full)
bib (full)
Proceedings of the Seventh Workshop on Computational Linguistics and Clinical Psychology: Improving Access

pdf bib
Proceedings of the Seventh Workshop on Computational Linguistics and Clinical Psychology: Improving Access
Nazli Goharian | Philip Resnik | Andrew Yates | Molly Ireland | Kate Niederhoffer | Rebecca Resnik

pdf bib
Understanding who uses Reddit : Profiling individuals with a self-reported bipolar disorder diagnosisReddit: Profiling individuals with a self-reported bipolar disorder diagnosis
Glorianna Jagfeld | Fiona Lobban | Paul Rayson | Steven Jones

Recently, research on mental health conditions using public online data, including Reddit, has surged in NLP and health research but has not reported user characteristics, which are important to judge generalisability of findings. This paper shows how existing NLP methods can yield information on clinical, demographic, and identity characteristics of almost 20 K Reddit users who self-report a bipolar disorder diagnosis. This population consists of slightly more feminine- than masculine-gendered mainly young or middle-aged US-based adults who often report additional mental health diagnoses, which is compared with general Reddit statistics and epidemiological studies. Additionally, this paper carefully evaluates all methods and discusses ethical issues.

pdf bib
Individual Differences in the Movement-Mood Relationship in Digital Life Data
Glen Coppersmith | Alex Fine | Patrick Crutchley | Joshua Carroll

Our increasingly digitized lives generate troves of data that reflect our behavior, beliefs, mood, and wellbeing. Such digital life data provides crucial insight into the lives of patients outside the healthcare setting that has long been lacking, from a better understanding of mundane patterns of exercise and sleep routines to harbingers of emotional crisis. Moreover, information about individual differences and personalities is encoded in digital life data. In this paper we examine the relationship between mood and movement using linguistic and biometric data, respectively. Does increased physical activity (movement) have an effect on a person’s mood (or vice-versa)? We find that weak group-level relationships between movement and mood mask interesting and often strong relationships between the two for individuals within the group. We describe these individual differences, and argue that individual variability in the relationship between movement and mood is one of many such factors that ought be taken into account in wellbeing-focused apps and AI systems.

pdf bib
Suicide Risk Prediction by Tracking Self-Harm Aspects in Tweets : NUS-IDS at the CLPsych 2021 Shared TaskNUS-IDS at the CLPsych 2021 Shared Task
Sujatha Das Gollapalli | Guilherme Augusto Zagatti | See-Kiong Ng

We describe our system for identifying users at-risk for suicide based on their tweets developed for the CLPsych 2021 Shared Task. Based on research in mental health studies linking self-harm tendencies with suicide, in our system, we attempt to characterize self-harm aspects expressed in user tweets over a period of time. To this end, we design SHTM, a Self-Harm Topic Model that combines Latent Dirichlet Allocation with a self-harm dictionary for modeling daily tweets of users. Next, differences in moods and topics over time are captured as features to train a deep learning model for suicide prediction.

pdf bib
Using Psychologically-Informed Priors for Suicide Prediction in the CLPsych 2021 Shared TaskCLPsych 2021 Shared Task
Avi Gamoran | Yonatan Kaplan | Almog Simchon | Michael Gilead

This paper describes our approach to the CLPsych 2021 Shared Task, in which we aimed to predict suicide attempts based on Twitter feed data. We addressed this challenge by emphasizing reliance on prior domain knowledge. We engineered novel theory-driven features, and integrated prior knowledge with empirical evidence in a principled manner using Bayesian modeling. While this theory-guided approach increases bias and lowers accuracy on the training set, it was successful in preventing over-fitting. The models provided reasonable classification accuracy on unseen test data (0.68 = AUC= 0.84). Our approach may be particularly useful in prediction tasks trained on a relatively small data set.

pdf bib
Analysis of Behavior Classification in Motivational Interviewing
Leili Tavabi | Trang Tran | Kalin Stefanov | Brian Borsari | Joshua Woolley | Stefan Scherer | Mohammad Soleymani

Analysis of client and therapist behavior in counseling sessions can provide helpful insights for assessing the quality of the session and consequently, the client’s behavioral outcome. In this paper, we study the automatic classification of standardized behavior codes (annotations) used for assessment of psychotherapy sessions in Motivational Interviewing (MI). We develop models and examine the classification of client behaviors throughout MI sessions, comparing the performance by models trained on large pretrained embeddings (RoBERTa) versus interpretable and expert-selected features (LIWC). Our best performing model using the pretrained RoBERTa embeddings beats the baseline model, achieving an F1 score of 0.66 in the subject-independent 3-class classification. Through statistical analysis on the classification results, we identify prominent LIWC features that may not have been captured by the model using pretrained embeddings. Although classification using LIWC features underperforms RoBERTa, our findings motivate the future direction of incorporating auxiliary tasks in the classification of MI codes.

pdf bib
Automatic Detection and Prediction of Psychiatric Hospitalizations From Social Media Posts
Zhengping Jiang | Jonathan Zomick | Sarah Ita Levitan | Mark Serper | Julia Hirschberg

We address the problem of predicting psychiatric hospitalizations using linguistic features drawn from social media posts. We formulate this novel task and develop an approach to automatically extract time spans of self-reported psychiatric hospitalizations. Using this dataset, we build predictive models of psychiatric hospitalization, comparing feature sets, user vs. post classification, and comparing model performance using a varying time window of posts. Our best model achieves an F1 of.718 using 7 days of posts. Our results suggest that this is a useful framework for collecting hospitalization data, and that social media data can be leveraged to predict acute psychiatric crises before they occur, potentially saving lives and improving outcomes for individuals with mental illness.

pdf bib
Automated coherence measures fail to index thought disorder in individuals at risk for psychosis
Kasia Hitczenko | Henry Cowan | Vijay Mittal | Matthew Goldrick

Thought disorder linguistic disturbances including incoherence and derailment of topic is seen in individuals both with and at risk for psychosis. Methods from computational linguistics have increasingly sought to quantify thought disorder to detect group differences between clinical populations and healthy controls. While previous work has been quite successful at these classification tasks, the lack of interpretability of the computational metrics has made it unclear whether they are in fact measuring thought disorder. In this paper, we dive into these measures to try to better understand what they reflect. While we find group differences between at-risk and healthy control populations, we also find that the measures mostly do not correlate with existing measures of thought disorder symptoms (what they are intended to measure), but rather correlate with surface properties of the speech (e.g., sentence length) and sociodemographic properties of the speaker (e.g., race). These results highlight the importance of considering interpretability and front and center as the field continues to grow. Ethical use of computational measures like those studied here especially in the high-stakes context of clinical care requires us to devote substantial attention to potential biases in our measures.

pdf bib
Detecting Cognitive Distortions from Patient-Therapist Interactions
Sagarika Shreevastava | Peter Foltz

An important part of Cognitive Behavioral Therapy (CBT) is to recognize and restructure certain negative thinking patterns that are also known as cognitive distortions. The aim of this project is to detect these distortions using natural language processing. We compare and contrast different types of linguistic features as well as different classification algorithms and explore the limitations of applying these techniques on a small dataset. We find that pre-trained Sentence-BERT embeddings to train an SVM classifier yields the best results with an F1-score of 0.79. Lastly, we discuss how this work provides insights into the types of linguistic features that are inherent in cognitive distortions.

pdf bib
Evaluating Automatic Speech Recognition Quality and Its Impact on Counselor Utterance Coding
Do June Min | Verónica Pérez-Rosas | Rada Mihalcea

Automatic speech recognition (ASR) is a crucial step in many natural language processing (NLP) applications, as often available data consists mainly of raw speech. Since the result of the ASR step is considered as a meaningful, informative input to later steps in the NLP pipeline, it is important to understand the behavior and failure mode of this step. In this work, we analyze the quality of ASR in the psychotherapy domain, using motivational interviewing conversations between therapists and clients. We conduct domain agnostic and domain-relevant evaluations using standard evaluation metrics and also identify domain-relevant keywords in the ASR output. Moreover, we empirically study the effect of mixing ASR and manual data during the training of a downstream NLP model, and also demonstrate how additional local context can help alleviate the error introduced by noisy ASR transcripts.

pdf bib
Safeguarding against spurious AI-based predictions : The case of automated verbal memory assessmentAI-based predictions: The case of automated verbal memory assessment
Chelsea Chandler | Peter Foltz | Alex Cohen | Terje Holmlund | Brita Elvevåg

A growing amount of psychiatric research incorporates machine learning and natural language processing methods, however findings have yet to be translated into actual clinical decision support systems. Many of these studies are based on relatively small datasets in homogeneous populations, which has the associated risk that the models may not perform adequately on new data in real clinical practice. The nature of serious mental illness is that it is hard to define, hard to capture, and requires frequent monitoring, which leads to imperfect data where attribute and class noise are common. With the goal of an effective AI-mediated clinical decision support system, there must be computational safeguards placed on the models used in order to avoid spurious predictions and thus allow humans to review data in the settings where models are unstable or bound not to generalize. This paper describes two approaches to implementing safeguards : (1) the determination of cases in which models are unstable by means of attribute and class based outlier detection and (2) finding the extent to which models show inductive bias. These safeguards are illustrated in the automated scoring of a story recall task via natural language processing methods. With the integration of human-in-the-loop machine learning in the clinical implementation process, incorporating safeguards such as these into the models will offer patients increased protection from spurious predictions.

pdf bib
Towards Understanding the Role of Gender in Deploying Social Media-Based Mental Health Surveillance Models
Eli Sherman | Keith Harrigian | Carlos Aguirre | Mark Dredze

Spurred by advances in machine learning and natural language processing, developing social media-based mental health surveillance models has received substantial recent attention. For these models to be maximally useful, it is necessary to understand how they perform on various subgroups, especially those defined in terms of protected characteristics. In this paper we study the relationship between user demographics focusing on gender and depression. Considering a population of Reddit users with known genders and depression statuses, we analyze the degree to which depression predictions are subject to biases along gender lines using domain-informed classifiers. We then study our models’ parameters to gain qualitative insight into the differences in posting behavior across genders.

up

pdf (full)
bib (full)
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics

pdf bib
Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics
Emmanuele Chersoni | Nora Hollenstein | Cassandra Jacobs | Yohei Oseki | Laurent Prévot | Enrico Santus

pdf bib
Modeling Incremental Language Comprehension in the Brain with Combinatory Categorial GrammarCombinatory Categorial Grammar
Miloš Stanojević | Shohini Bhattasali | Donald Dunagan | Luca Campanelli | Mark Steedman | Jonathan Brennan | John Hale

Hierarchical sentence structure plays a role in word-by-word human sentence comprehension, but it remains unclear how best to characterize this structure and unknown how exactly it would be recognized in a step-by-step process model. With a view towards sharpening this picture, we model the time course of hemodynamic activity within the brain during an extended episode of naturalistic language comprehension using Combinatory Categorial Grammar (CCG). CCG has well-defined incremental parsing algorithms, surface compositional semantics, and can explain long-range dependencies as well as complicated cases of coordination. We find that CCG-derived predictors improve a regression model of fMRI time course in six language-relevant brain regions, over and above predictors derived from context-free phrase structure. Adding a special Revealing operator to CCG parsing, one designed to handle right-adjunction, improves the fit in three of these regions. This evidence for CCG from neuroimaging bolsters the more general case for mildly context-sensitive grammars in the cognitive science of language.

pdf bib
That Looks Hard : Characterizing Linguistic Complexity in Humans and Language Models
Gabriele Sarti | Dominique Brunato | Felice Dell’Orletta

This paper investigates the relationship between two complementary perspectives in the human assessment of sentence complexity and how they are modeled in a neural language model (NLM). The first perspective takes into account multiple online behavioral metrics obtained from eye-tracking recordings. The second one concerns the offline perception of complexity measured by explicit human judgments. Using a broad spectrum of linguistic features modeling lexical, morpho-syntactic, and syntactic properties of sentences, we perform a comprehensive analysis of linguistic phenomena associated with the two complexity viewpoints and report similarities and differences. We then show the effectiveness of linguistic features when explicitly leveraged by a regression model for predicting sentence complexity and compare its results with the ones obtained by a fine-tuned neural language model. We finally probe the NLM’s linguistic competence before and after fine-tuning, highlighting how linguistic information encoded in representations changes when the model learns to predict complexity.

pdf bib
LangResearchLab_NC at CMCL2021 Shared Task : Predicting Gaze Behaviour Using Linguistic Features and Tree RegressorsLangResearchLab_NC at CMCL2021 Shared Task: Predicting Gaze Behaviour Using Linguistic Features and Tree Regressors
Raksha Agarwal | Niladri Chatterjee

Analysis of gaze data behaviour has gained momentum in recent years for different NLP applications. The present paper aims at modelling gaze data behaviour of tokens in the context of a sentence. We have experimented with various Machine Learning Regression Algorithms on a feature space comprising the linguistic features of the target tokens for prediction of five Eye-Tracking features. CatBoost Regressor performed the best and achieved fourth position in terms of MAE based accuracy measurement for the ZuCo Dataset.

pdf bib
TorontoCL at CMCL 2021 Shared Task : RoBERTa with Multi-Stage Fine-Tuning for Eye-Tracking PredictionTorontoCL at CMCL 2021 Shared Task: RoBERTa with Multi-Stage Fine-Tuning for Eye-Tracking Prediction
Bai Li | Frank Rudzicz

Eye movement data during reading is a useful source of information for understanding language comprehension processes. In this paper, we describe our submission to the CMCL 2021 shared task on predicting human reading patterns. Our model uses RoBERTa with a regression layer to predict 5 eye-tracking features. We train the model in two stages : we first fine-tune on the Provo corpus (another eye-tracking dataset), then fine-tune on the task data. We compare different Transformer models and apply ensembling methods to improve the performance. Our final submission achieves a MAE score of 3.929, ranking 3rd place out of 13 teams that participated in this shared task.

pdf bib
LAST at CMCL 2021 Shared Task : Predicting Gaze Data During Reading with a Gradient Boosting Decision Tree ApproachLAST at CMCL 2021 Shared Task: Predicting Gaze Data During Reading with a Gradient Boosting Decision Tree Approach
Yves Bestgen

A LightGBM model fed with target word lexical characteristics and features obtained from word frequency lists, psychometric data and bigram association measures has been optimized for the 2021 CMCL Shared Task on Eye-Tracking Data Prediction. It obtained the best performance of all teams on two of the five eye-tracking measures to predict, allowing it to rank first on the official challenge criterion and to outperform all deep-learning based systems participating in the challenge.

pdf bib
PIHKers at CMCL 2021 Shared Task : Cosine Similarity and Surprisal to Predict Human Reading Patterns.PIHKers at CMCL 2021 Shared Task: Cosine Similarity and Surprisal to Predict Human Reading Patterns.
Lavinia Salicchi | Alessandro Lenci

Eye-tracking psycholinguistic studies have revealed that context-word semantic coherence and predictability influence language processing. In this paper we show our approach to predict eye-tracking features from the ZuCo dataset for the shared task of the Cognitive Modeling and Computational Linguistics (CMCL2021) workshop. Using both cosine similarity and surprisal within a regression model, we significantly improved the baseline Mean Absolute Error computed among five eye-tracking features.

pdf bib
TALEP at CMCL 2021 Shared Task : Non Linear Combination of Low and High-Level Features for Predicting Eye-Tracking DataTALEP at CMCL 2021 Shared Task: Non Linear Combination of Low and High-Level Features for Predicting Eye-Tracking Data
Franck Dary | Alexis Nasr | Abdellah Fourtassi

In this paper we describe our contribution to the CMCL 2021 Shared Task, which consists in predicting 5 different eye tracking variables from English tokenized text. Our approach is based on a neural network that combines both raw textual features we extracted from the text and parser-based features that include linguistic predictions (e.g. part of speech) and complexity metrics (e.g., entropy of parsing). We found that both the features we considered as well as the architecture of the neural model that combined these features played a role in the overall performance. Our system achieved relatively high accuracy on the test data of the challenge and was ranked 2nd out of 13 competing teams and a total of 30 submissions.

pdf bib
MTL782_IITD at CMCL 2021 Shared Task : Prediction of Eye-Tracking Features Using BERT Embeddings and Linguistic FeaturesMTL782_IITD at CMCL 2021 Shared Task: Prediction of Eye-Tracking Features Using BERT Embeddings and Linguistic Features
Shivani Choudhary | Kushagri Tandon | Raksha Agarwal | Niladri Chatterjee

Reading and comprehension are quintessentially cognitive tasks. Eye movement acts as a surrogate to understand which part of a sentence is critical to the process of comprehension. The aim of the shared task is to predict five eye-tracking features for a given word of the input sentence. We experimented with several models based on LGBM (Light Gradient Boosting Machine) Regression, ANN (Artificial Neural Network), and CNN (Convolutional Neural Network), using BERT embeddings and some combination of linguistic features. Our submission using CNN achieved an average MAE of 4.0639 and ranked 7th in the shared task. The average MAE was further lowered to 3.994 in post-task evaluation.

pdf bib
Enhancing Cognitive Models of Emotions with Representation Learning
Yuting Guo | Jinho D. Choi

We present a novel deep learning-based framework to generate embedding representations of fine-grained emotions that can be used to computationally describe psychological models of emotions. Our framework integrates a contextualized embedding encoder with a multi-head probing model that enables to interpret dynamically learned representations optimized for an emotion classification task. Our model is evaluated on the Empathetic Dialogue dataset and shows the state-of-the-art result for classifying 32 emotions. Our layer analysis can derive an emotion graph to depict hierarchical relations among the emotions. Our emotion representations can be used to generate an emotion wheel directly comparable to the one from Plutchik’s model, and also augment the values of missing emotions in the PAD emotional state model.

pdf bib
Clause Final Verb Prediction in Hindi : Evidence for Noisy Channel Model of CommunicationHindi: Evidence for Noisy Channel Model of Communication
Kartik Sharma | Niyati Bafna | Samar Husain

Verbal prediction has been shown to be critical during online comprehension of Subject-Object-Verb (SOV) languages. In this work we present three computational models to predict clause final verbs in Hindi given its prior arguments. The models differ in their use of prior context during the prediction process the context is either noisy or noise-free. Model predictions are compared with the sentence completion data obtained from Hindi native speakers. Results show that models that assume noisy context outperform the noise-free model. In particular, a lossy context model that assumes prior context to be affected by predictability and recency captures the distribution of the predicted verb class and error sources best. The success of the predictability-recency lossy context model is consistent with the noisy channel hypothesis for sentence comprehension and supports the idea that the reconstruction of the context during prediction is driven by prior linguistic exposure. These results also shed light on the nature of the noise that affects the reconstruction process. Overall the results pose a challenge to the adaptability hypothesis that assumes use of noise-free preverbal context for robust verbal prediction.

pdf bib
Sentence Complexity in Context
Benedetta Iavarone | Dominique Brunato | Felice Dell’Orletta

We study the influence of context on how humans evaluate the complexity of a sentence in English. We collect a new dataset of sentences, where each sentence is rated for perceived complexity within different contextual windows. We carry out an in-depth analysis to detect which linguistic features correlate more with complexity judgments and with the degree of agreement among annotators. We train several regression models, using either explicit linguistic features or contextualized word embeddings, to predict the mean complexity values assigned to sentences in the different contextual windows, as well as their standard deviation. Results show that models leveraging explicit features capturing morphosyntactic and syntactic phenomena perform always better, especially when they have access to features extracted from all contextual sentences.

pdf bib
Evaluating the Acquisition of Semantic Knowledge from Cross-situational Learning in Artificial Neural Networks
Mitja Nikolaus | Abdellah Fourtassi

When learning their native language, children acquire the meanings of words and sentences from highly ambiguous input without much explicit supervision. One possible learning mechanism is cross-situational learning, which has been successfully tested in laboratory experiments with children. Here we use Artificial Neural Networks to test if this mechanism scales up to more natural language and visual scenes using a large dataset of crowd-sourced images with corresponding descriptions. We evaluate learning using a series of tasks inspired by methods commonly used in laboratory studies of language acquisition. We show that the model acquires rich semantic knowledge both at the word- and sentence-level, mirroring the patterns and trajectory of learning in early childhood. Our work highlights the usefulness of low-level co-occurrence statistics across modalities in facilitating the early acquisition of higher-level semantic knowledge.

pdf bib
Representation and Pre-Activation of Lexical-Semantic Knowledge in Neural Language Models
Steven Derby | Paul Miller | Barry Devereux

In this paper, we perform a systematic analysis of how closely the intermediate layers from LSTM and trans former language models correspond to human semantic knowledge. Furthermore, in order to make more meaningful comparisons with theories of human language comprehension in psycholinguistics, we focus on two key stages where the meaning of a particular target word may arise : immediately before the word’s presentation to the model (comparable to forward inferencing), and immediately after the word token has been input into the network. Our results indicate that the transformer models are better at capturing semantic knowledge relating to lexical concepts, both during word prediction and when retention is required.

pdf bib
Graph-theoretic Properties of the Class of Phonological Neighbourhood Networks
Rory Turnbull

This paper concerns the structure of phonological neighbourhood networks, which are a graph-theoretic representation of the phonological lexicon. These networks represent each word as a node and links are placed between words which are phonological neighbours, usually defined as a string edit distance of one. Phonological neighbourhood networks have been used to study many aspects of the mental lexicon and psycholinguistic theories of speech production and perception. This paper offers preliminary graph-theoretic observations about phonological neighbourhood networks considered as a class. To aid this exploration, this paper introduces the concept of the hyperlexicon, the network consisting of all possible words for a given symbol set and their neighbourhood relations. The construction of the hyperlexicon is discussed, and basic properties are derived. This work is among the first to directly address the nature of phonological neighbourhood networks from an analytic perspective.

up

pdf (full)
bib (full)
Proceedings of the Second Workshop on Data Science with Human in the Loop: Language Advances

pdf bib
Proceedings of the Second Workshop on Data Science with Human in the Loop: Language Advances
Eduard Dragut | Yunyao Li | Lucian Popa | Slobodan Vucetic

pdf bib
ViziTex : Interactive Visual Sense-Making of Text CorporaViziTex: Interactive Visual Sense-Making of Text Corpora
Natraj Raman | Sameena Shah | Tucker Balch | Manuela Veloso

Information visualization is critical to analytical reasoning and knowledge discovery. We present an interactive studio that integrates perceptive visualization techniques with powerful text analytics algorithms to assist humans in sense-making of large complex text corpora. The novel visual representations introduced here encode the features delivered by modern text mining models using advanced metaphors such as hypergraphs, nested topologies and tessellated planes. They enhance human-computer interaction experience for various tasks such as summarization, exploration, organization and labeling of documents. We demonstrate the ability of the visuals to surface the structure, relations and concepts from documents across different domains.

pdf bib
Bridging Multi-disciplinary Collaboration Challenges in ML Development via Domain Knowledge ElicitationML Development via Domain Knowledge Elicitation
Soya Park

Building a machine learning model in a sophisticated domain is a time-consuming process, partially due to the steep learning curve of domain knowledge for data scientists. We introduce Ziva, an interface for supporting domain knowledge from domain experts to data scientists in two ways : (1) a concept creation interface where domain experts extract important concept of the domain and (2) five kinds of justification elicitation interfaces that solicit elicitation how the domain concept are expressed in data instances.

pdf bib
Towards integrated, interactive, and extensible text data analytics with Leam
Peter Griggs | Cagatay Demiralp | Sajjadur Rahman

From tweets to product reviews, text is ubiquitous on the web and often contains valuable information for both enterprises and consumers. However, the online text is generally noisy and incomplete, requiring users to process and analyze the data to extract insights. While there are systems effective for different stages of text analysis, users lack extensible platforms to support interactive text analysis workflows end-to-end. To facilitate integrated text analytics, we introduce LEAM, which aims at combining the strengths of spreadsheets, computational notebooks, and interactive visualizations. LEAM supports interactive analysis via GUI-based interactions and provides a declarative specification language, implemented based on a visual text algebra, to enable user-guided analysis. We evaluate LEAM through two case studies using two popular Kaggle text analytics workflows to understand the strengths and weaknesses of the system.

pdf bib
Data Cleaning Tools for Token Classification Tasks
Karthik Muthuraman | Frederick Reiss | Hong Xu | Bryan Cutler | Zachary Eichenberger

Human-in-the-loop systems for cleaning NLP training data rely on automated sieves to isolate potentially-incorrect labels for manual review. We have developed a novel technique for flagging potentially-incorrect labels with high sensitivity in named entity recognition corpora. We incorporated our sieve into an end-to-end system for cleaning NLP corpora, implemented as a modular collection of Jupyter notebooks built on extensions to the Pandas DataFrame library. We used this system to identify incorrect labels in the CoNLL-2003 corpus for English-language named entity recognition (NER), one of the most influential corpora for NER model research. Unlike previous work that only looked at a subset of the corpus’s validation fold, our automated sieve enabled us to examine the entire corpus in depth. Across the entire CoNLL-2003 corpus, we identified over 1300 incorrect labels (out of 35089 in the corpus). We have published our corrections, along with the code we used in our experiments. We are developing a repeatable version of the process we used on the CoNLL-2003 corpus as an open-source library.

pdf bib
Building Low-Resource NER Models Using Non-Speaker AnnotationsNER Models Using Non-Speaker Annotations
Tatiana Tsygankova | Francesca Marini | Stephen Mayhew | Dan Roth

In low-resource natural language processing (NLP), the key problems are a lack of target language training data, and a lack of native speakers to create it. Cross-lingual methods have had notable success in addressing these concerns, but in certain common circumstances, such as insufficient pre-training corpora or languages far from the source language, their performance suffers. In this work we propose a complementary approach to building low-resource Named Entity Recognition (NER) models using non-speaker (NS) annotations, provided by annotators with no prior experience in the target language. We recruit 30 participants in a carefully controlled annotation experiment with Indonesian, Russian, and Hindi. We show that use of NS annotators produces results that are consistently on par or better than cross-lingual methods built on modern contextual representations, and have the potential to outperform with additional effort. We conclude with observations of common annotation patterns and recommended implementation practices, and motivate how NS annotations can be used in addition to prior methods for improved performance.

pdf bib
CrossCheck : Rapid, Reproducible, and Interpretable Model EvaluationCrossCheck: Rapid, Reproducible, and Interpretable Model Evaluation
Dustin Arendt | Zhuanyi Shaw | Prasha Shrestha | Ellyn Ayton | Maria Glenski | Svitlana Volkova

Evaluation beyond aggregate performance metrics, e.g. F1-score, is crucial to both establish an appropriate level of trust in machine learning models and identify avenues for future model improvements. In this paper we demonstrate CrossCheck, an interactive capability for rapid cross-model comparison and reproducible error analysis. We describe the tool, discuss design and implementation details, and present three NLP use cases named entity recognition, reading comprehension, and clickbait detection that show the benefits of using the tool for model evaluation. CrossCheck enables users to make informed decisions when choosing between multiple models, identify when the models are correct and for which examples, investigate whether the models are making the same mistakes as humans, evaluate models’ generalizability and highlight models’ limitations, strengths and weaknesses. Furthermore, CrossCheck is implemented as a Jupyter widget, which allows for rapid and convenient integration into existing model development workflows.

pdf bib
TopGuNN : Fast NLP Training Data Augmentation using Large CorporaTopGuNN: Fast NLP Training Data Augmentation using Large Corpora
Rebecca Iglesias-Flores | Megha Mishra | Ajay Patel | Akanksha Malhotra | Reno Kriz | Martha Palmer | Chris Callison-Burch

Acquiring training data for natural language processing systems can be expensive and time-consuming. Given a few training examples crafted by experts, large corpora can be mined for thousands of semantically similar examples that provide useful variability to improve model generalization. We present TopGuNN, a fast contextualized k-NN retrieval system that can efficiently index and search over contextual embeddings generated from large corpora. TopGuNN is demonstrated for a training data augmentation use case over the Gigaword corpus. Using approximate k-NN and an efficient architecture, TopGuNN performs queries over an embedding space of 4.63 TB (approximately 1.5B embeddings) in less than a day.

pdf bib
A Computational Model for Interactive Transcription
William Lane | Mat Bettinson | Steven Bird

Transcribing low resource languages can be challenging in the absence of a good lexicon and trained transcribers. Accordingly, we seek a way to enable interactive transcription whereby the machine amplifies human efforts. This paper presents a data model and a system architecture for interactive transcription, supporting multiple modes of interactivity, increasing the likelihood of finding tasks that engage local participation in language work. The approach also supports other applications which are useful in our context, including spoken document retrieval and language learning.

up

pdf (full)
bib (full)
Proceedings of Deep Learning Inside Out (DeeLIO): The 2nd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures

pdf bib
Proceedings of Deep Learning Inside Out (DeeLIO): The 2nd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures
Eneko Agirre | Marianna Apidianaki | Ivan Vulić

pdf bib
Augmenting Topic Aware Knowledge-Grounded Conversations with Dynamic Built Knowledge Graphs
Junjie Wu | Hao Zhou

Dialog topic management and background knowledge selection are essential factors for the success of knowledge-grounded open-domain conversations. However, existing models are primarily performed with symmetric knowledge bases or stylized with pre-defined roles between conversational partners, while people usually have their own knowledge before a real chit-chat. To address this problem, we propose a dynamic knowledge graph-based topical conversation model (DKGT). Given a dialog history context, our model first builds knowledge graphs from the context as an imitation of human’s ability to form logical relationships between known and unknown topics during a conversation. This logical information will be fed into a topic predictor to promote topic management, then facilitate background knowledge selection and response generation. To the best of our knowledge, this is the first attempt to dynamically form knowledge graphs between chatting topics to assist dialog topic management during a conversation. Experimental results manifest that our model can properly schedule conversational topics and pick suitable knowledge to generate informative responses comparing to several strong baselines.

pdf bib
What Makes My Model Perplexed? A Linguistic Investigation on Neural Language Models Perplexity
Alessio Miaschi | Dominique Brunato | Felice Dell’Orletta | Giulia Venturi

This paper presents an investigation aimed at studying how the linguistic structure of a sentence affects the perplexity of two of the most popular Neural Language Models (NLMs), BERT and GPT-2. We first compare the sentence-level likelihood computed with BERT and the GPT-2’s perplexity showing that the two metrics are correlated. In addition, we exploit linguistic features capturing a wide set of morpho-syntactic and syntactic phenomena showing how they contribute to predict the perplexity of the two NLMs.

pdf bib
What BERTs and GPTs know about your brand? Probing contextual language models for affect associationsBERTs and GPTs know about your brand? Probing contextual language models for affect associations
Vivek Srivastava | Stephen Pilli | Savita Bhat | Niranjan Pedanekar | Shirish Karande

Investigating brand perception is fundamental to marketing strategies. In this regard, brand image, defined by a set of attributes (Aaker, 1997), is recognized as a key element in indicating how a brand is perceived by various stakeholders such as consumers and competitors. Traditional approaches (e.g., surveys) to monitor brand perceptions are time-consuming and inefficient. In the era of digital marketing, both brand managers and consumers engage with a vast amount of digital marketing content. The exponential growth of digital content has propelled the emergence of pre-trained language models such as BERT and GPT as essential tools in solving myriads of challenges with textual data. This paper seeks to investigate the extent of brand perceptions (i.e., brand and image attribute associations) these language models encode. We believe that any kind of bias for a brand and attribute pair may influence customer-centric downstream tasks such as recommender systems, sentiment analysis, and question-answering, e.g., suggesting a specific brand consistently when queried for innovative products. We use synthetic data and real-life data and report comparison results for five contextual LMs, viz. BERT, RoBERTa, DistilBERT, ALBERT and BART.

pdf bib
Attention vs non-attention for a Shapley-based explanation method
Tom Kersten | Hugh Mee Wong | Jaap Jumelet | Dieuwke Hupkes

The field of explainable AI has recently seen an explosion in the number of explanation methods for highly non-linear deep neural networks. The extent to which such methods that are often proposed and tested in the domain of computer vision are appropriate to address the explainability challenges in NLP is yet relatively unexplored. In this work, we consider Contextual Decomposition (CD) a Shapley-based input feature attribution method that has been shown to work well for recurrent NLP models and we test the extent to which it is useful for models that contain attention operations. To this end, we extend CD to cover the operations necessary for attention-based models. We then compare how long distance subject-verb relationships are processed by models with and without attention, considering a number of different syntactic structures in two different languages : English and Dutch. Our experiments confirm that CD can successfully be applied for attention-based models as well, providing an alternative Shapley-based attribution method for modern neural networks. In particular, using CD, we show that the English and Dutch models demonstrate similar processing behaviour, but that under the hood there are consistent differences between our attention and non-attention models.

up

pdf (full)
bib (full)
Proceedings of the Third Workshop on Multimodal Artificial Intelligence

pdf bib
Proceedings of the Third Workshop on Multimodal Artificial Intelligence
Amir Zadeh | Louis-Philippe Morency | Paul Pu Liang | Candace Ross | Ruslan Salakhutdinov | Soujanya Poria | Erik Cambria | Kelly Shi

pdf bib
Multi Task Learning based Framework for Multimodal Classification
Danting Zeng

Large-scale multi-modal classification aim to distinguish between different multi-modal data, and it has drawn dramatically attentions since last decade. In this paper, we propose a multi-task learning-based framework for the multimodal classification task, which consists of two branches : multi-modal autoencoder branch and attention-based multi-modal modeling branch. Multi-modal autoencoder can receive multi-modal features and obtain the interactive information which called multi-modal encoder feature, and use this feature to reconstitute all the input data. Besides, multi-modal encoder feature can be used to enrich the raw dataset, and improve the performance of downstream tasks (such as classification task). As for attention-based multimodal modeling branch, we first employ attention mechanism to make the model focused on important features, then we use the multi-modal encoder feature to enrich the input information, achieve a better performance. We conduct extensive experiments on different dataset, the results demonstrate the effectiveness of proposed framework.

pdf bib
A Package for Learning on Tabular and Text Data with Transformers
Ken Gu | Akshay Budhkar

Recent progress in natural language processing has led to Transformer architectures becoming the predominant model used for natural language tasks. However, in many real- world datasets, additional modalities are included which the Transformer does not directly leverage. We present Multimodal- Toolkit, an open-source Python package to incorporate text and tabular (categorical and numerical) data with Transformers for downstream applications. Our toolkit integrates well with Hugging Face’s existing API such as tokenization and the model hub which allows easy download of different pre-trained models.

pdf bib
Learning to Select Question-Relevant Relations for Visual Question Answering
Jaewoong Lee | Heejoon Lee | Hwanhee Lee | Kyomin Jung

Previous existing visual question answering (VQA) systems commonly use graph neural networks(GNNs) to extract visual relationships such as semantic relations or spatial relations. However, studies that use GNNs typically ignore the importance of each relation and simply concatenate outputs from multiple relation encoders. In this paper, we propose a novel layer architecture that fuses multiple visual relations through an attention mechanism to address this issue. Specifically, we develop a model that uses question embedding and joint embedding of the encoders to obtain dynamic attention weights with regard to the type of questions. Using the learnable attention weights, the proposed model can efficiently use the necessary visual relation features for a given question. Experimental results on the VQA 2.0 dataset demonstrate that the proposed model outperforms existing graph attention network-based architectures. Additionally, we visualize the attention weight and show that the proposed model assigns a higher weight to relations that are more relevant to the question.

up

pdf (full)
bib (full)
Proceedings of the Fourth Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda

pdf bib
Proceedings of the Fourth Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda
Anna Feldman | Giovanni Da San Martino | Chris Leberknight | Preslav Nakov

pdf bib
Identifying Automatically Generated Headlines using Transformers
Antonis Maronikolakis | Hinrich Schütze | Mark Stevenson

False information spread via the internet and social media influences public opinion and user activity, while generative models enable fake content to be generated faster and more cheaply than had previously been possible. In the not so distant future, identifying fake content generated by deep learning models will play a key role in protecting users from misinformation. To this end, a dataset containing human and computer-generated headlines was created and a user study indicated that humans were only able to identify the fake headlines in 47.8 % of the cases. However, the most accurate automatic approach, transformers, achieved an overall accuracy of 85.7 %, indicating that content generated from language models can be filtered out accurately.

pdf bib
Improving Cross-Domain Hate Speech Detection by Reducing the False Positive Rate
Ilia Markov | Walter Daelemans

Hate speech detection is an actively growing field of research with a variety of recently proposed approaches that allowed to push the state-of-the-art results. One of the challenges of such automated approaches namely recent deep learning models is a risk of false positives (i.e., false accusations), which may lead to over-blocking or removal of harmless social media content in applications with little moderator intervention. We evaluate deep learning models both under in-domain and cross-domain hate speech detection conditions, and introduce an SVM approach that allows to significantly improve the state-of-the-art results when combined with the deep learning models through a simple majority-voting ensemble. The improvement is mainly due to a reduction of the false positive rate.

pdf bib
Leveraging Community and Author Context to Explain the Performance and Bias of Text-Based Deception Detection Models
Galen Weld | Ellyn Ayton | Tim Althoff | Maria Glenski

Deceptive news posts shared in online communities can be detected with NLP models, and much recent research has focused on the development of such models. In this work, we use characteristics of online communities and authors the context of how and where content is posted to explain the performance of a neural network deception detection model and identify sub-populations who are disproportionately affected by model accuracy or failure. We examine who is posting the content, and where the content is posted to. We find that while author characteristics are better predictors of deceptive content than community characteristics, both characteristics are strongly correlated with model performance. Traditional performance metrics such as F1 score may fail to capture poor model performance on isolated sub-populations such as specific authors, and as such, more nuanced evaluation of deception detection models is critical.

pdf bib
DamascusTeam at NLP4IF2021 : Fighting the Arabic COVID-19 Infodemic on Twitter Using AraBERTDamascusTeam at NLP4IF2021: Fighting the Arabic COVID-19 Infodemic on Twitter Using AraBERT
Ahmad Hussein | Nada Ghneim | Ammar Joukhadar

The objective of this work was the introduction of an effective approach based on the AraBERT language model for fighting Tweets COVID-19 Infodemic. It was arranged in the form of a two-step pipeline, where the first step involved a series of pre-processing procedures to transform Twitter jargon, including emojis and emoticons, into plain text, and the second step exploited a version of AraBERT, which was pre-trained on plain text, to fine-tune and classify the tweets with respect to their Label. The use of language models pre-trained on plain texts rather than on tweets was motivated by the necessity to address two critical issues shown by the scientific literature, namely (1) pre-trained language models are widely available in many languages, avoiding the time-consuming and resource-intensive model training directly on tweets from scratch, allowing to focus only on their fine-tuning ; (2) available plain text corpora are larger than tweet-only ones, allowing for better performance.

pdf bib
NARNIA at NLP4IF-2021 : Identification of Misinformation in COVID-19 Tweets Using BERTweetNARNIA at NLP4IF-2021: Identification of Misinformation in COVID-19 Tweets Using BERTweet
Ankit Kumar | Naman Jhunjhunwala | Raksha Agarwal | Niladri Chatterjee

The spread of COVID-19 has been accompanied with widespread misinformation on social media. In particular, Twitterverse has seen a huge increase in dissemination of distorted facts and figures. The present work aims at identifying tweets regarding COVID-19 which contains harmful and false information. We have experimented with a number of Deep Learning-based models, including different word embeddings, such as Glove, ELMo, among others. BERTweet model achieved the best overall F1-score of 0.881 and secured the third rank on the above task.

pdf bib
iCompass at NLP4IF-2021Fighting the COVID-19 InfodemicCompass at NLP4IF-2021–Fighting the COVID-19 Infodemic
Wassim Henia | Oumayma Rjab | Hatem Haddad | Chayma Fourati

This paper provides a detailed overview of the system and its outcomes, which were produced as part of the NLP4IF Shared Task on Fighting the COVID-19 Infodemic at NAACL 2021. This task is accomplished using a variety of techniques. We used state-of-the-art contextualized text representation models that were fine-tuned for the downstream task in hand. ARBERT, MARBERT, AraBERT, Arabic ALBERT and BERT-base-arabic were used. According to the results, BERT-base-arabic had the highest 0.784 F1 score on the test set.

up

pdf (full)
bib (full)
Proceedings of the Third Workshop on Narrative Understanding

pdf bib
Proceedings of the Third Workshop on Narrative Understanding
Nader Akoury | Faeze Brahman | Snigdha Chaturvedi | Elizabeth Clark | Mohit Iyyer | Lara J. Martin

pdf bib
Document-level Event Extraction with Efficient End-to-end Learning of Cross-event Dependencies
Kung-Hsiang Huang | Nanyun Peng

Fully understanding narratives often requires identifying events in the context of whole documents and modeling the event relations. However, document-level event extraction is a challenging task as it requires the extraction of event and entity coreference, and capturing arguments that span across different sentences. Existing works on event extraction usually confine on extracting events from single sentences, which fail to capture the relationships between the event mentions at the scale of a document, as well as the event arguments that appear in a different sentence than the event trigger. In this paper, we propose an end-to-end model leveraging Deep Value Networks (DVN), a structured prediction algorithm, to efficiently capture cross-event dependencies for document-level event extraction. Experimental results show that our approach achieves comparable performance to CRF-based models on ACE05, while enjoys significantly higher computational efficiency.

pdf bib
Gender and Representation Bias in GPT-3 Generated StoriesGPT-3 Generated Stories
Li Lucy | David Bamman

Using topic modeling and lexicon-based word similarity, we find that stories generated by GPT-3 exhibit many known gender stereotypes. Generated stories depict different topics and descriptions depending on GPT-3’s perceived gender of the character in a prompt, with feminine characters more likely to be associated with family and appearance, and described as less powerful than masculine characters, even when associated with high power verbs in a prompt. Our study raises questions on how one can avoid unintended social biases when using large language models for storytelling.

pdf bib
Transformer-based Screenplay Summarization Using Augmented Learning Representation with Dialogue Information
Myungji Lee | Hongseok Kwon | Jaehun Shin | WonKee Lee | Baikjin Jung | Jong-Hyeok Lee

Screenplay summarization is the task of extracting informative scenes from a screenplay. The screenplay contains turning point (TP) events that change the story direction and thus define the story structure decisively. Accordingly, this task can be defined as the TP identification task. We suggest using dialogue information, one attribute of screenplays, motivated by previous work that discovered that TPs have a relation with dialogues appearing in screenplays. To teach a model this characteristic, we add a dialogue feature to the input embedding. Moreover, in an attempt to improve the model architecture of previous studies, we replace LSTM with Transformer. We observed that the model can better identify TPs in a screenplay by using dialogue information and that a model adopting Transformer outperforms LSTM-based models.

pdf bib
Plug-and-Blend : A Framework for Controllable Story Generation with Blended Control Codes
Zhiyu Lin | Mark Riedl

We describe a Plug-and-Play controllable language generation framework, Plug-and-Blend, that allows a human user to input multiple control codes (topics). In the context of automated story generation, this allows a human user lose or fine grained control of the topics that will appear in the generated story, and can even allow for overlapping, blended topics. We show that our framework, working with different generation models, controls the generation towards given continuous-weighted control codes while keeping the generated sentences fluent, demonstrating strong blending capability.

up

pdf (full)
bib (full)
Proceedings of the Third Workshop on Privacy in Natural Language Processing

pdf bib
Proceedings of the Third Workshop on Privacy in Natural Language Processing
Oluwaseyi Feyisetan | Sepideh Ghanavati | Shervin Malmasi | Patricia Thaine

pdf bib
Learning and Evaluating a Differentially Private Pre-trained Language Model
Shlomo Hoory | Amir Feder | Avichai Tendler | Alon Cohen | Sofia Erell | Itay Laish | Hootan Nakhost | Uri Stemmer | Ayelet Benjamini | Avinatan Hassidim | Yossi Matias

Contextual language models have led to significantly better results on a plethora of language understanding tasks, especially when pre-trained on the same data as the downstream task. While this additional pre-training usually improves performance, it can lead to information leakage and therefore risks the privacy of individuals mentioned in the training data. One method to guarantee the privacy of such individuals is to train a differentially-private model, but this usually comes at the expense of model performance. Moreover, it is hard to tell given a privacy parameter $ \\epsilon$ what was the effect on the trained representation. In this work we aim to guide future practitioners and researchers on how to improve privacy while maintaining good model performance. We demonstrate how to train a differentially-private pre-trained language model (i.e., BERT) with a privacy guarantee of $ \\epsilon=1 $ and with only a small degradation in performance. We experiment on a dataset of clinical notes with a model trained on a target entity extraction task, and compare it to a similar model trained without differential privacy. Finally, we present experiments showing how to interpret the differentially-private representation and understand the information lost and maintained in this process.

pdf bib
Using Confidential Data for Domain Adaptation of Neural Machine Translation
Sohyung Kim | Arianna Bisazza | Fatih Turkmen

We study the problem of domain adaptation in Neural Machine Translation (NMT) when domain-specific data can not be shared due to confidentiality or copyright issues. As a first step, we propose to fragment data into phrase pairs and use a random sample to fine-tune a generic NMT model instead of the full sentences. Despite the loss of long segments for the sake of confidentiality protection, we find that NMT quality can considerably benefit from this adaptation, and that further gains can be obtained with a simple tagging technique.

pdf bib
Private Text Classification with Convolutional Neural Networks
Samuel Adams | David Melanson | Martine De Cock

Text classifiers are regularly applied to personal texts, leaving users of these classifiers vulnerable to privacy breaches. We propose a solution for privacy-preserving text classification that is based on Convolutional Neural Networks (CNNs) and Secure Multiparty Computation (MPC). Our method enables the inference of a class label for a personal text in such a way that (1) the owner of the personal text does not have to disclose their text to anyone in an unencrypted manner, and (2) the owner of the text classifier does not have to reveal the trained model parameters to the text owner or to anyone else. To demonstrate the feasibility of our protocol for practical private text classification, we implemented it in the PyTorch-based MPC framework CrypTen, using a well-known additive secret sharing scheme in the honest-but-curious setting. We test the runtime of our privacy-preserving text classifier, which is fast enough to be used in practice.

up

pdf (full)
bib (full)
Proceedings of the Second Workshop on Scholarly Document Processing

pdf bib
Proceedings of the Second Workshop on Scholarly Document Processing
Iz Beltagy | Arman Cohan | Guy Feigenblat | Dayne Freitag | Tirthankar Ghosal | Keith Hall | Drahomira Herrmannova | Petr Knoth | Kyle Lo | Philipp Mayr | Robert M. Patton | Michal Shmueli-Scheuer | Anita de Waard | Kuansan Wang | Lucy Lu Wang

pdf bib
Keyphrase Extraction from Scientific Articles via Extractive Summarization
Chrysovalantis Giorgos Kontoulis | Eirini Papagiannopoulou | Grigorios Tsoumakas

Automatically extracting keyphrases from scholarly documents leads to a valuable concise representation that humans can understand and machines can process for tasks, such as information retrieval, article clustering and article classification. This paper is concerned with the parts of a scientific article that should be given as input to keyphrase extraction methods. Recent deep learning methods take titles and abstracts as input due to the increased computational complexity in processing long sequences, whereas traditional approaches can also work with full-texts. Titles and abstracts are dense in keyphrases, but often miss important aspects of the articles, while full-texts on the other hand are richer in keyphrases but much noisier. To address this trade-off, we propose the use of extractive summarization models on the full-texts of scholarly documents. Our empirical study on 3 article collections using 3 keyphrase extraction methods shows promising results.

pdf bib
The Effect of Pretraining on Extractive Summarization for Scientific Documents
Yash Gupta | Pawan Sasanka Ammanamanchi | Shikha Bordia | Arjun Manoharan | Deepak Mittal | Ramakanth Pasunuru | Manish Shrivastava | Maneesh Singh | Mohit Bansal | Preethi Jyothi

Large pretrained models have seen enormous success in extractive summarization tasks. In this work, we investigate the influence of pretraining on a BERT-based extractive summarization system for scientific documents. We derive significant performance improvements using an intermediate pretraining step that leverages existing summarization datasets and report state-of-the-art results on a recently released scientific summarization dataset, SciTLDR. We systematically analyze the intermediate pretraining step by varying the size and domain of the pretraining corpus, changing the length of the input sequence in the target task and varying target tasks. We also investigate how intermediate pretraining interacts with contextualized word embeddings trained on different domains.

pdf bib
Finding Pragmatic Differences Between Disciplines
Lee Kezar | Jay Pujara

Scholarly documents have a great degree of variation, both in terms of content (semantics) and structure (pragmatics). Prior work in scholarly document understanding emphasizes semantics through document summarization and corpus topic modeling but tends to omit pragmatics such as document organization and flow. Using a corpus of scholarly documents across 19 disciplines and state-of-the-art language modeling techniques, we learn a fixed set of domain-agnostic descriptors for document sections and retrofit the corpus to these descriptors (also referred to as normalization). Then, we analyze the position and ordering of these descriptors across documents to understand the relationship between discipline and structure. We report within-discipline structural archetypes, variability, and between-discipline comparisons, supporting the hypothesis that scholarly communities, despite their size, diversity, and breadth, share similar avenues for expressing their work. Our findings lay the foundation for future work in assessing research quality, domain style transfer, and further pragmatic analysis.

pdf bib
Extractive Research Slide Generation Using Windowed Labeling Ranking
Athar Sefid | Prasenjit Mitra | Jian Wu | C Lee Giles

Presentation slides generated from original research papers provide an efficient form to present research innovations. Manually generating presentation slides is labor-intensive. We propose a method to automatically generates slides for scientific articles based on a corpus of 5000 paper-slide pairs compiled from conference proceedings websites. The sentence labeling module of our method is based on SummaRuNNer, a neural sequence model for extractive summarization. Instead of ranking sentences based on semantic similarities in the whole document, our algorithm measures the importance and novelty of sentences by combining semantic and lexical features within a sentence window. Our method outperforms several baseline methods including SummaRuNNer by a significant margin in terms of ROUGE score.

pdf bib
Unsupervised document summarization using pre-trained sentence embeddings and graph centrality
Juan Ramirez-Orta | Evangelos Milios

This paper describes our submission for the LongSumm task in SDP 2021. We propose a method for incorporating sentence embeddings produced by deep language models into extractive summarization techniques based on graph centrality in an unsupervised manner. The proposed method is simple, fast, can summarize any kind of document of any size and can satisfy any length constraints for the summaries produced. The method offers competitive performance to more sophisticated supervised methods and can serve as a proxy for abstractive summarization techniques

pdf bib
QMUL-SDS at SCIVER : Step-by-Step Binary Classification for Scientific Claim VerificationQMUL-SDS at SCIVER: Step-by-Step Binary Classification for Scientific Claim Verification
Xia Zeng | Arkaitz Zubiaga

Scientific claim verification is a unique challenge that is attracting increasing interest. The SCIVER shared task offers a benchmark scenario to test and compare claim verification approaches by participating teams and consists in three steps : relevant abstract selection, rationale selection and label prediction. In this paper, we present team QMUL-SDS’s participation in the shared task. We propose an approach that performs scientific claim verification by doing binary classifications step-by-step. We trained a BioBERT-large classifier to select abstracts based on pairwise relevance assessments for each claim, title of the abstract and continued to train it to select rationales out of each retrieved abstract based on claim, sentence. We then propose a two-step setting for label prediction, i.e. first predicting NOT_ENOUGH_INFO or ENOUGH_INFO, then label those marked as ENOUGH_INFO as either SUPPORT or CONTRADICT. Compared to the baseline system, we achieve substantial improvements on the dev set. As a result, our team is the No. 4 team on the leaderboard.

up

pdf (full)
bib (full)
Proceedings of the Third Workshop on Computational Typology and Multilingual NLP

pdf bib
Proceedings of the Third Workshop on Computational Typology and Multilingual NLP
Ekaterina Vylomova | Elizabeth Salesky | Sabrina Mielke | Gabriella Lapesa | Ritesh Kumar | Harald Hammarström | Ivan Vulić | Anna Korhonen | Roi Reichart | Edoardo Maria Ponti | Ryan Cotterell

pdf bib
OTEANN : Estimating the Transparency of Orthographies with an Artificial Neural NetworkOTEANN: Estimating the Transparency of Orthographies with an Artificial Neural Network
Xavier Marjou

To transcribe spoken language to written medium, most alphabets enable an unambiguous sound-to-letter rule. However, some writing systems have distanced themselves from this simple concept and little work exists in Natural Language Processing (NLP) on measuring such distance. In this study, we use an Artificial Neural Network (ANN) model to evaluate the transparency between written words and their pronunciation, hence its name Orthographic Transparency Estimation with an ANN (OTEANN). Based on datasets derived from Wikimedia dictionaries, we trained and tested this model to score the percentage of false predictions in phoneme-to-grapheme and grapheme-to-phoneme translation tasks. The scores obtained on 17 orthographies were in line with the estimations of other studies. Interestingly, the model also provided insight into typical mistakes made by learners who only consider the phonemic rule in reading and writing.

pdf bib
Improving Cross-Lingual Sentiment Analysis via Conditional Language Adversarial Nets
Hemanth Kandula | Bonan Min

Sentiment analysis has come a long way for high-resource languages due to the availability of large annotated corpora. However, it still suffers from lack of training data for low-resource languages. To tackle this problem, we propose Conditional Language Adversarial Network (CLAN), an end-to-end neural architecture for cross-lingual sentiment analysis without cross-lingual supervision. CLAN differs from prior work in that it allows the adversarial training to be conditioned on both learned features and the sentiment prediction, to increase discriminativity for learned representation in the cross-lingual setting. Experimental results demonstrate that CLAN outperforms previous methods on the multilingual multi-domain Amazon review dataset. Our source code is released at https://github.com/hemanthkandula/clan.

pdf bib
Anlirika : An LSTMCNN Flow Twister for Spoken Language IdentificationLSTMCNN Flow Twister for Spoken Language Identification
Andreas Scherbakov | Liam Whittle | Ritesh Kumar | Siddharth Singh | Matthew Coleman | Ekaterina Vylomova

The paper presents Anlirika’s submission to SIGTYP 2021 Shared Task on Robust Spoken Language Identification. The task aims at building a robust system that generalizes well across different domains and speakers. The training data is limited to a single domain only with predominantly single speaker per language while the validation and test data samples are derived from diverse dataset and multiple speakers. We experiment with a neural system comprising a combination of dense, convolutional, and recurrent layers that are designed to perform better generalization and obtain speaker-invariant representations. We demonstrate that the task in its constrained form (without making use of external data or augmentation the train set with samples from the validation set) is still challenging. Our best system trained on the data augmented with validation samples achieves 29.9 % accuracy on the test data.

up

pdf (full)
bib (full)
Proceedings of the Sixth Social Media Mining for Health (#SMM4H) Workshop and Shared Task

pdf bib
Proceedings of the Sixth Social Media Mining for Health (#SMM4H) Workshop and Shared Task
Arjun Magge | Ari Klein | Antonio Miranda-Escalada | Mohammed Ali Al-garadi | Ilseyar Alimova | Zulfat Miftahutdinov | Eulalia Farre-Maduell | Salvador Lima Lopez | Ivan Flores | Karen O'Connor | Davy Weissenbacher | Elena Tutubalina | Abeed Sarker | Juan M Banda | Martin Krallinger | Graciela Gonzalez-Hernandez

pdf bib
View Distillation with Unlabeled Data for Extracting Adverse Drug Effects from User-Generated Data
Payam Karisani | Jinho D. Choi | Li Xiong

We present an algorithm based on multi-layer transformers for identifying Adverse Drug Reactions (ADR) in social media data. Our model relies on the properties of the problem and the characteristics of contextual word embeddings to extract two views from documents. Then a classifier is trained on each view to label a set of unlabeled documents to be used as an initializer for a new classifier in the other view. Finally, the initialized classifier in each view is further trained using the initial training examples. We evaluated our model in the largest publicly available ADR dataset. The experiments testify that our model significantly outperforms the transformer-based models pretrained on domain-specific data.

pdf bib
The ProfNER shared task on automatic recognition of occupation mentions in social media : systems, evaluation, guidelines, embeddings and corporaProfNER shared task on automatic recognition of occupation mentions in social media: systems, evaluation, guidelines, embeddings and corpora
Antonio Miranda-Escalada | Eulàlia Farré-Maduell | Salvador Lima-López | Luis Gascó | Vicent Briva-Iglesias | Marvin Agüero-Torales | Martin Krallinger

Detection of occupations in texts is relevant for a range of important application scenarios, like competitive intelligence, sociodemographic analysis, legal NLP or health-related occupational data mining. Despite the importance and heterogeneous data types that mention occupations, text mining efforts to recognize them have been limited. This is due to the lack of clear annotation guidelines and high-quality Gold Standard corpora. Social media data can be regarded as a relevant source of information for real-time monitoring of at-risk occupational groups in the context of pandemics like the COVID-19 one, facilitating intervention strategies for occupations in direct contact with infectious agents or affected by mental health issues. To evaluate current NLP methods and to generate resources, we have organized the ProfNER track at SMM4H 2021, providing ProfNER participants with a Gold Standard corpus of manually annotated tweets (human IAA of 0.919) following annotation guidelines available in Spanish and English, an occupation gazetteer, a machine-translated version of tweets, and FastText embeddings. Out of 35 registered teams, 11 submitted a total of 27 runs. Best-performing participants built systems based on recent NLP technologies (e.g. transformers) and achieved 0.93 F-score in Text Classification and 0.839 in Named Entity Recognition. Corpus : https://doi.org/10.5281/zenodo.4309356

pdf bib
Transformer-based Multi-Task Learning for Adverse Effect Mention Analysis in Tweets
George-Andrei Dima | Dumitru-Clementin Cercel | Mihai Dascalu

This paper presents our contribution to the Social Media Mining for Health Applications Shared Task 2021. We addressed all the three subtasks of Task 1 : Subtask A (classification of tweets containing adverse effects), Subtask B (extraction of text spans containing adverse effects) and Subtask C (adverse effects resolution). We explored various pre-trained transformer-based language models and we focused on a multi-task training architecture. For the first subtask, we also applied adversarial augmentation techniques and we formed model ensembles in order to improve the robustness of the prediction. Our system ranked first at Subtask B with 0.51 F1 score, 0.514 precision and 0.514 recall. For Subtask A we obtained 0.44 F1 score, 0.49 precision and 0.39 recall and for Subtask C we obtained 0.16 F1 score with 0.16 precision and 0.17 recall.

pdf bib
UACH-INAOE at SMM4H : a BERT based approach for classification of COVID-19 Twitter postsUACH-INAOE at SMM4H: a BERT based approach for classification of COVID-19 Twitter posts
Alberto Valdes | Jesus Lopez | Manuel Montes

This work describes the participation of the Universidad Autnoma de Chihuahua-Instituto Nacional de Astrofsica, ptica y Electrnica team at the Social Media Mining for Health Applications (SMM4H) 2021 shared task. Our team participated in task 5 and 6, both focused on the automatic classification of Twitter posts related to COVID-19. Task 5 was oriented on solving a binary classification problem, trying to identify self-reporting tweets of potential cases of COVID-19. Task 6 objective was to classify tweets containing COVID-19 symptoms. For both tasks we used models based on bidirectional encoder representations from transformers (BERT). Our objective was to determine if a model pretrained on a corpus in the domain of interest can outperform one trained on a much larger general domain corpus. Our F1 results were encouraging, 0.77 and 0.95 for task 5 and 6 respectively, having achieved the highest score among all the participants in the latter.

pdf bib
Word Embeddings, Cosine Similarity and Deep Learning for Identification of Professions & Occupations in Health-related Social Media
Sergio Santamaría Carrasco | Roberto Cuervo Rosillo

ProfNER-ST focuses on the recognition of professions and occupations from Twitter using Spanish data. Our participation is based on a combination of word-level embeddings, including pre-trained Spanish BERT, as well as cosine similarity computed over a subset of entities that serve as input for an encoder-decoder architecture with attention mechanism. Finally, our best score achieved an F1-measure of 0.823 in the official test set.

pdf bib
A Joint Training Approach to Tweet Classification and Adverse Effect Extraction and Normalization for SMM4H 2021SMM4H 2021
Mohab Elkaref | Lamiece Hassan

In this work we describe our submissions to the Social Media Mining for Health (SMM4H) 2021 Shared Task. We investigated the effectiveness of a joint training approach to Task 1, specifically classification, extraction and normalization of Adverse Drug Effect (ADE) mentions in English tweets. Our approach performed well on the normalization task, achieving an above average f1 score of 24 %, but less so on classification and extraction, with f1 scores of 22 % and 37 % respectively. Our experiments also showed that a larger dataset with more negative results led to stronger results than a smaller more balanced dataset, even when both datasets have the same positive examples. Finally we also submitted a tuned BERT model for Task 6 : Classification of Covid-19 tweets containing symptoms, which achieved an above average f1 score of 96 %.

pdf bib
Identification of profession & occupation in Health-related Social Media using tweets in SpanishSpanish
Victoria Pachón | Jacinto Mata Vázquez | Juan Luís Domínguez Olmedo

In this paper we present our approach and system description on Task 7a in ProfNer-ST : Identification of profession & occupation in Health related Social Media. Our main contribution is to show the effectiveness of using BETO-Spanish BERT as a model based on transformers pretrained with a Spanish Corpus for classification tasks. In our experiments we compared several architectures based on transformers with others based on classical machine learning algorithms. With this approach, we achieved an F1-score of 0.92 in the evaluation process.

pdf bib
UoB at ProfNER 2021 : Data Augmentation for Classification Using Machine TranslationUoB at ProfNER 2021: Data Augmentation for Classification Using Machine Translation
Frances Adriana Laureano De Leon | Harish Tayyar Madabushi | Mark Lee

This paper describes the participation of the UoB-NLP team in the ProfNER-ST shared subtask 7a. The task was aimed at detecting the mention of professions in social media text. Our team experimented with two methods of improving the performance of pre-trained models : Specifically, we experimented with data augmentation through translation and the merging of multiple language inputs to meet the objective of the task. While the best performing model on the test data consisted of mBERT fine-tuned on augmented data using back-translation, the improvement is minor possibly because multi-lingual pre-trained models such as mBERT already have access to the kind of information provided through back-translation and bilingual data.

pdf bib
PAII-NLP at SMM4H 2021 : Joint Extraction and Normalization of Adverse Drug Effect Mentions in TweetsPAII-NLP at SMM4H 2021: Joint Extraction and Normalization of Adverse Drug Effect Mentions in Tweets
Zongcheng Ji | Tian Xia | Mei Han

This paper describes our system developed for the subtask 1c of the sixth Social Media Mining for Health Applications (SMM4H) shared task in 2021. The aim of the subtask is to recognize the adverse drug effect (ADE) mentions from tweets and normalize the identified mentions to their mapping MedDRA preferred term IDs. Our system is based on a neural transition-based joint model, which is to perform recognition and normalization simultaneously. Our final two submissions outperform the average F1 score by 1-2 %.

up

pdf (full)
bib (full)
Proceedings of the Ninth International Workshop on Natural Language Processing for Social Media

pdf bib
Proceedings of the Ninth International Workshop on Natural Language Processing for Social Media
Lun-Wei Ku | Cheng-Te Li

pdf bib
A Case Study of In-House Competition for Ranking Constructive Comments in a News Service
Hayato Kobayashi | Hiroaki Taguchi | Yoshimune Tabuchi | Chahine Koleejan | Ken Kobayashi | Soichiro Fujita | Kazuma Murao | Takeshi Masuyama | Taichi Yatsuka | Manabu Okumura | Satoshi Sekine

Ranking the user comments posted on a news article is important for online news services because comment visibility directly affects the user experience. Research on ranking comments with different metrics to measure the comment quality has shown constructiveness used in argument analysis is promising from a practical standpoint. In this paper, we report a case study in which this constructiveness is examined in the real world. Specifically, we examine an in-house competition to improve the performance of ranking constructive comments and demonstrate the effectiveness of the best obtained model for a commercial service.

pdf bib
Quantifying the Effects of COVID-19 on Restaurant ReviewsCOVID-19 on Restaurant Reviews
Ivy Cao | Zizhou Liu | Giannis Karamanolakis | Daniel Hsu | Luis Gravano

The COVID-19 pandemic has implications beyond physical health, affecting society and economies. Government efforts to slow down the spread of the virus have had a severe impact on many businesses, including restaurants. Mandatory policies such as restaurant closures, bans on social gatherings, and social distancing restrictions have affected restaurant operations as well as customer preferences (e.g., prompting a demand of stricter hygiene standards). As of now, however, it is not clear how and to what extent the pandemic has affected restaurant reviews, an analysis of which could potentially inform policies for addressing this ongoing situation. In this work, we present our efforts to understand the effects of COVID-19 on restaurant reviews, with a focus on Yelp reviews produced during the pandemic for New York City and Los Angeles County restaurants. Overall, we make the following contributions. First, we assemble a dataset of 600 reviews with manual annotations of fine-grained COVID-19 aspects related to restaurants (e.g., hygiene practices, service changes, sympathy and support for local businesses). Second, we address COVID-19 aspect detection using supervised classifiers, weakly-supervised approaches based on keywords, and unsupervised topic modeling approaches, and experimentally show that classifiers based on pre-trained BERT representations achieve the best performance (F1=0.79). Third, we analyze the number and evolution of COVID-related aspects over time and show that the resulting time series have substantial correlation (Spearman’s = 0.84) with critical statistics related to the COVID-19 pandemic, including the number of new COVID-19 cases.\\rho=0.84) with critical statistics related to the COVID-19 pandemic, including the number of new COVID-19 cases. To our knowledge, this is the first work analyzing the effects of COVID-19 on Yelp restaurant reviews and could potentially inform policies by public health departments, for example, to cover resource utilization.

pdf bib
Assessing Cognitive Linguistic Influences in the Assignment of Blame
Karen Zhou | Ana Smith | Lillian Lee

Lab studies in cognition and the psychology of morality have proposed some thematic and linguistic factors that influence moral reasoning. This paper assesses how well the findings of these studies generalize to a large corpus of over 22,000 descriptions of fraught situations posted to a dedicated forum. At this social-media site, users judge whether or not an author is in the wrong with respect to the event that the author described. We find that, consistent with lab studies, there are statistically significant differences in uses of first-person passive voice, as well as first-person agents and patients, between descriptions of situations that receive different blame judgments. These features also aid performance in the task of predicting the eventual collective verdicts.

pdf bib
Evaluating Deception Detection Model Robustness To Linguistic Variation
Maria Glenski | Ellyn Ayton | Robin Cosbey | Dustin Arendt | Svitlana Volkova

With the increasing use of machine-learning driven algorithmic judgements, it is critical to develop models that are robust to evolving or manipulated inputs. We propose an extensive analysis of model robustness against linguistic variation in the setting of deceptive news detection, an important task in the context of misinformation spread online. We consider two prediction tasks and compare three state-of-the-art embeddings to highlight consistent trends in model performance, high confidence misclassifications, and high impact failures. By measuring the effectiveness of adversarial defense strategies and evaluating model susceptibility to adversarial attacks using character- and word-perturbed text, we find that character or mixed ensemble models are the most effective defenses and that character perturbation-based attack tactics are more successful.

pdf bib
Using Noisy Self-Reports to Predict Twitter User DemographicsTwitter User Demographics
Zach Wood-Doughty | Paiheng Xu | Xiao Liu | Mark Dredze

Computational social science studies often contextualize content analysis within standard demographics. Since demographics are unavailable on many social media platforms (e.g. Twitter), numerous studies have inferred demographics automatically. Despite many studies presenting proof-of-concept inference of race and ethnicity, training of practical systems remains elusive since there are few annotated datasets. Existing datasets are small, inaccurate, or fail to cover the four most common racial and ethnic groups in the United States. We present a method to identify self-reports of race and ethnicity from Twitter profile descriptions. Despite the noise of automated supervision, our self-report datasets enable improvements in classification performance on gold standard self-report survey data. The result is a reproducible method for creating large-scale training resources for race and ethnicity.

pdf bib
PANDORA Talks : Personality and Demographics on RedditPANDORA Talks: Personality and Demographics on Reddit
Matej Gjurković | Mladen Karan | Iva Vukojević | Mihaela Bošnjak | Jan Snajder

Personality and demographics are important variables in social sciences and computational sociolinguistics. However, datasets with both personality and demographic labels are scarce. To address this, we present PANDORA, the first dataset of Reddit comments of 10k users partially labeled with three personality models and demographics (age, gender, and location), including 1.6k users labeled with the well-established Big 5 personality model. We showcase the usefulness of this dataset on three experiments, where we leverage the more readily available data from other personality models to predict the Big 5 traits, analyze gender classification biases arising from psycho-demographic variables, and carry out a confirmatory and exploratory analysis based on psychological theories. Finally, we present benchmark prediction models for all personality and demographic variables.

pdf bib
Room to Grow : Understanding Personal Characteristics Behind Self Improvement Using Social Media
MeiXing Dong | Xueming Xu | Yiwei Zhang | Ian Stewart | Rada Mihalcea

Many people aim for change, but not everyone succeeds. While there are a number of social psychology theories that propose motivation-related characteristics of those who persist with change, few computational studies have explored the motivational stage of personal change. In this paper, we investigate a new dataset consisting of the writings of people who manifest intention to change, some of whom persist while others do not. Using a variety of linguistic analysis techniques, we first examine the writing patterns that distinguish the two groups of people. Persistent people tend to reference more topics related to long-term self-improvement and use a more complicated writing style. Drawing on these consistent differences, we build a classifier that can reliably identify the people more likely to persist, based on their language. Our experiments provide new insights into the motivation-related behavior of people who persist with their intention to change.

pdf bib
Jujeop : Korean Puns for K-pop Stars on Social MediaKorean Puns for K-pop Stars on Social Media
Soyoung Oh | Jisu Kim | Seungpeel Lee | Eunil Park

Jujeop is a type of pun and a unique way for fans to express their love for the K-pop stars they follow using Korean. One of the unique characteristics of Jujeop is its use of exaggerated expressions to compliment K-pop stars, which contain or lead to humor. Based on this characteristic, Jujeop can be separated into four distinct types, with their own lexical collocations : (1) Fragmenting words to create a twist, (2) Homophones and homographs, (3) Repetition, and (4) Nonsense. Thus, the current study first defines the concept of Jujeop in Korean, manually labels 8.6 K comments and annotates the comments to one of the four Jujeop types. With the given annotated corpus, this study presents distinctive characteristics of Jujeop comments compared to the other comments by classification task. Moreover, with the clustering approach, we proposed a structural dependency within each Jujeop type. We have made our dataset publicly available for future research of Jujeop expressions.

up

pdf (full)
bib (full)
Proceedings of the Fifth Workshop on Teaching NLP

pdf bib
Proceedings of the Fifth Workshop on Teaching NLP
David Jurgens | Varada Kolhatkar | Lucy Li | Margot Mieskes | Ted Pedersen

pdf bib
Teaching a Massive Open Online Course on Natural Language Processing
Ekaterina Artemova | Murat Apishev | Denis Kirianov | Veronica Sarkisyan | Sergey Aksenov | Oleg Serikov

In this paper we present a new Massive Open Online Course on Natural Language Processing, targeted at non-English speaking students. The course lasts 12 weeks, every week consists of lectures, practical sessions and quiz assigments. Three weeks out of 12 are followed by Kaggle-style coding assigments. Our course intents to serve multiple purposes : (i) familirize students with the core concepts and methods in NLP, such as language modelling or word or sentence representations, (ii) show that recent advances, including pre-trained Transformer-based models, are build upon these concepts ; (iii) to introduce architectures for most most demanded real-life applications, (iii) to develop practical skills to process texts in multiple languages. The course was prepared and recorded during 2020 and so far have received positive feedback.

pdf bib
Natural Language Processing 4 All (NLP4All): A New Online Platform for Teaching and Learning NLP ConceptsNLP4All): A New Online Platform for Teaching and Learning NLP Concepts
Rebekah Baglini | Hermes Hjorth

Natural Language Processing offers new insights into language data across almost all disciplines and domains, and allows us to corroborate and/or challenge existing knowledge. The primary hurdles to widening participation in and use of these new research tools are, first, a lack of coding skills in students across K-16, and in the population at large, and second, a lack of knowledge of how NLP-methods can be used to answer questions of disciplinary interest outside of linguistics and/or computer science. To broaden participation in NLP and improve NLP-literacy, we introduced a new tool web-based tool called Natural Language Processing 4 All (NLP4All). The intended purpose of NLP4All is to help teachers facilitate learning with and about NLP, by providing easy-to-use interfaces to NLP-methods, data, and analyses, making it possible for non- and novice-programmers to learn NLP concepts interactively.

pdf bib
A New Broad NLP Training from Speech to KnowledgeNLP Training from Speech to Knowledge
Maxime Amblard | Miguel Couceiro

In 2018, the Master Sc. in NLP opened at IDMC-Institut des Sciences du Digital, du Management et de la Cognition, Universit de Lorraine-Nancy, France. Far from being a creation ex-nihilo, it is the product of a history and many reflections on the field and its teaching. This article proposes epistemological and critical elements on the opening and maintainance of this so far new master’s program in NLP.

pdf bib
A Crash Course on Ethics for Natural Language Processing
Annemarie Friedrich | Torsten Zesch

It is generally agreed upon in the natural language processing (NLP) community that ethics should be integrated into any curriculum. Being aware of and understanding the relevant core concepts is a prerequisite for following and participating in the discourse on ethical NLP. We here present ready-made teaching material in the form of slides and practical exercises on ethical issues in NLP, which is primarily intended to be integrated into introductory NLP or computational linguistics courses. By making this material freely available, we aim at lowering the threshold to adding ethics to the curriculum. We hope that increased awareness will enable students to identify potentially unethical behavior.

pdf bib
MiniVQA-A resource to build your tailored VQA competitionMiniVQA - A resource to build your tailored VQA competition
Jean-Benoit Delbrouck

MiniVQA is a Jupyter notebook to build a tailored VQA competition for your students. The resource creates all the needed resources to create a classroom competition that engages and inspires your students on the free, self-service Kaggle platform. InClass competitions make machine learning fun ‘.

pdf bib