International Conference on Computational Linguistics (2020)



bib (full) Proceedings of the 28th International Conference on Computational Linguistics

pdf bib
Proceedings of the 28th International Conference on Computational Linguistics
Donia Scott | Nuria Bel | Chengqing Zong

pdf bib
CharBERT : Character-aware Pre-trained Language ModelCharBERT: Character-aware Pre-trained Language Model
Wentao Ma | Yiming Cui | Chenglei Si | Ting Liu | Shijin Wang | Guoping Hu

Most pre-trained language models (PLMs) construct word representations at subword level with Byte-Pair Encoding (BPE) or its variations, by which OOV (out-of-vocab) words are almost avoidable. However, those methods split a word into subword units and make the representation incomplete and fragile. In this paper, we propose a character-aware pre-trained language model named CharBERT improving on the previous methods (such as BERT, RoBERTa) to tackle these problems. We first construct the contextual word embedding for each token from the sequential character representations, then fuse the representations of characters and the subword representations by a novel heterogeneous interaction module. We also propose a new pre-training task named NLM (Noisy LM) for unsupervised character representation learning. We evaluate our method on question answering, sequence labeling, and text classification tasks, both on the original datasets and adversarial misspelling test sets. The experimental results show that our method can significantly improve the performance and robustness of PLMs simultaneously.

pdf bib
A Graph Representation of Semi-structured Data for Web Question Answering
Xingyao Zhang | Linjun Shou | Jian Pei | Ming Gong | Lijie Wen | Daxin Jiang

The abundant semi-structured data on the Web, such as HTML-based tables and lists, provide commercial search engines a rich information source for question answering (QA). Different from plain text passages in Web documents, Web tables and lists have inherent structures, which carry semantic correlations among various elements in tables and lists. Many existing studies treat tables and lists as flat documents with pieces of text and do not make good use of semantic information hidden in structures. In this paper, we propose a novel graph representation of Web tables and lists based on a systematic categorization of the components in semi-structured data as well as their relations. We also develop pre-training and reasoning techniques on the graph model for the QA task. Extensive experiments on several real datasets collected from a commercial engine verify the effectiveness of our approach. Our method improves F1 score by 3.90 points over the state-of-the-art baselines.

pdf bib
Is Killed More Significant than Fled? A Contextual Model for Salient Event Detection
Disha Jindal | Daniel Deutsch | Dan Roth

Identifying the key events in a document is critical to holistically understanding its important information. Although measuring the salience of events is highly contextual, most previous work has used a limited representation of events that omits essential information. In this work, we propose a highly contextual model of event salience that uses a rich representation of events, incorporates document-level information and allows for interactions between latent event encodings. Our experimental results on an event salience dataset demonstrate that our model improves over previous work by an absolute 2-4 % on standard metrics, establishing a new state-of-the-art performance for the task. We also propose a new evaluation metric that addresses flaws in previous evaluation methodologies. Finally, we discuss the importance of salient event detection for the downstream task of summarization.

pdf bib
Appraisal Theories for Emotion Classification in Text
Jan Hofmann | Enrica Troiano | Kai Sassenberg | Roman Klinger

Automatic emotion categorization has been predominantly formulated as text classification in which textual units are assigned to an emotion from a predefined inventory, for instance following the fundamental emotion classes proposed by Paul Ekman (fear, joy, anger, disgust, sadness, surprise) or Robert Plutchik (adding trust, anticipation). This approach ignores existing psychological theories to some degree, which provide explanations regarding the perception of events. For instance, the description that somebody discovers a snake is associated with fear, based on the appraisal as being an unpleasant and non-controllable situation. This emotion reconstruction is even possible without having access to explicit reports of a subjective feeling (for instance expressing this with the words I am afraid.). Automatic classification approaches therefore need to learn properties of events as latent variables (for instance that the uncertainty and the mental or physical effort associated with the encounter of a snake leads to fear). With this paper, we propose to make such interpretations of events explicit, following theories of cognitive appraisal of events, and show their potential for emotion classification when being encoded in classification models. Our results show that high quality appraisal dimension assignments in event descriptions lead to an improvement in the classification of discrete emotion categories. We make our corpus of appraisal-annotated emotion-associated event descriptions publicly available.

pdf bib
A Symmetric Local Search Network for Emotion-Cause Pair Extraction
Zifeng Cheng | Zhiwei Jiang | Yafeng Yin | Hua Yu | Qing Gu

Emotion-cause pair extraction (ECPE) is a new task which aims at extracting the potential clause pairs of emotions and corresponding causes in a document. To tackle this task, a two-step method was proposed by previous study which first extracted emotion clauses and cause clauses individually, then paired the emotion and cause clauses, and filtered out the pairs without causality. Different from this method that separated the detection and the matching of emotion and cause into two steps, we propose a Symmetric Local Search Network (SLSN) model to perform the detection and matching simultaneously by local search. SLSN consists of two symmetric subnetworks, namely the emotion subnetwork and the cause subnetwork. Each subnetwork is composed of a clause representation learner and a local pair searcher. The local pair searcher is a specially-designed cross-subnetwork component which can extract the local emotion-cause pairs. Experimental results on the ECPE corpus demonstrate the superiority of our SLSN over existing state-of-the-art methods.

pdf bib
METNet : A Mutual Enhanced Transformation Network for Aspect-based Sentiment AnalysisMETNet: A Mutual Enhanced Transformation Network for Aspect-based Sentiment Analysis
Bin Jiang | Jing Hou | Wanyue Zhou | Chao Yang | Shihan Wang | Liang Pang

Aspect-based sentiment analysis (ABSA) aims to determine the sentiment polarity of each specific aspect in a given sentence. Existing researches have realized the importance of the aspect for the ABSA task and have derived many interactive learning methods that model context based on specific aspect. However, current interaction mechanisms are ill-equipped to learn complex sentences with multiple aspects, and these methods underestimate the representation learning of the aspect. In order to solve the two problems, we propose a mutual enhanced transformation network (METNet) for the ABSA task. First, the aspect enhancement module in METNet improves the representation learning of the aspect with contextual semantic features, which gives the aspect more abundant information. Second, METNet designs and implements a hierarchical structure, which enhances the representations of aspect and context iteratively. Experimental results on SemEval 2014 Datasets demonstrate the effectiveness of METNet, and we further prove that METNet is outstanding in multi-aspect scenarios.

pdf bib
Affective and Contextual Embedding for Sarcasm Detection
Nastaran Babanejad | Heidar Davoudi | Aijun An | Manos Papagelis

Automatic sarcasm detection from text is an important classification task that can help identify the actual sentiment in user-generated data, such as reviews or tweets. Despite its usefulness, sarcasm detection remains a challenging task, due to a lack of any vocal intonation or facial gestures in textual data. To date, most of the approaches to addressing the problem have relied on hand-crafted affect features, or pre-trained models of non-contextual word embeddings, such as Word2vec. However, these models inherit limitations that render them inadequate for the task of sarcasm detection. In this paper, we propose two novel deep neural network models for sarcasm detection, namely ACE 1 and ACE 2. Given as input a text passage, the models predict whether it is sarcastic (or not). Our models extend the architecture of BERT by incorporating both affective and contextual features. To the best of our knowledge, this is the first attempt to directly alter BERT’s architecture and train it from scratch to build a sarcasm classifier. Extensive experiments on different datasets demonstrate that the proposed models outperform state-of-the-art models for sarcasm detection with significant margins.

pdf bib
Understanding Pre-trained BERT for Aspect-based Sentiment AnalysisBERT for Aspect-based Sentiment Analysis
Hu Xu | Lei Shu | Philip Yu | Bing Liu

This paper analyzes the pre-trained hidden representations learned from reviews on BERT for tasks in aspect-based sentiment analysis (ABSA). Our work is motivated by the recent progress in BERT-based language models for ABSA. However, it is not clear how the general proxy task of (masked) language model trained on unlabeled corpus without annotations of aspects or opinions can provide important features for downstream tasks in ABSA. By leveraging the annotated datasets in ABSA, we investigate both the attentions and the learned representations of BERT pre-trained on reviews. We found that BERT uses very few self-attention heads to encode context words (such as prepositions or pronouns that indicating an aspect) and opinion words for an aspect. Most features in the representation of an aspect are dedicated to the fine-grained semantics of the domain (or product category) and the aspect itself, instead of carrying summarized opinions from its context. We hope this investigation can help future research in improving self-supervised learning, unsupervised learning and fine-tuning for ABSA. The pre-trained model and code can be found at

pdf bib
Integrating External Event Knowledge for Script Learning
Shangwen Lv | Fuqing Zhu | Songlin Hu

Script learning aims to predict the subsequent event according to the existing event chain. Recent studies focus on event co-occurrence to solve this problem. However, few studies integrate external event knowledge to solve this problem. With our observations, external event knowledge can provide additional knowledge like temporal or causal knowledge for understanding event chain better and predicting the right subsequent event. In this work, we integrate event knowledge from ASER (Activities, States, Events and their Relations) knowledge base to help predict the next event. We propose a new approach consisting of knowledge retrieval stage and knowledge integration stage. In the knowledge retrieval stage, we select relevant external event knowledge from ASER. In the knowledge integration stage, we propose three methods to integrate external knowledge into our model and infer final answers. Experiments on the widely-used Multi- Choice Narrative Cloze (MCNC) task show our approach achieves state-of-the-art performance compared to other methods.

pdf bib
Heterogeneous Graph Neural Networks to Predict What Happen Next
Jianming Zheng | Fei Cai | Yanxiang Ling | Honghui Chen

Given an incomplete event chain, script learning aims to predict the missing event, which can support a series of NLP applications. Existing work can not well represent the heterogeneous relations and capture the discontinuous event segments that are common in the event chain. To address these issues, we introduce a heterogeneous-event (HeterEvent) graph network. In particular, we employ each unique word and individual event as nodes in the graph, and explore three kinds of edges based on realistic relations (e.g., the relations of word-and-word, word-and-event, event-and-event). We also design a message passing process to realize information interactions among homo or heterogeneous nodes. And the discontinuous event segments could be explicitly modeled by finding the specific path between corresponding nodes in the graph. The experimental results on one-step and multi-step inference tasks demonstrate that our ensemble model HeterEvent[W+E ] can outperform existing baselines.

pdf bib
Predicting Stance Change Using Modular Architectures
Aldo Porco | Dan Goldwasser

The ability to change a person’s mind on a given issue depends both on the arguments they are presented with and on their underlying perspectives and biases on that issue. Predicting stance changes require characterizing both aspects and the interaction between them, especially in realistic settings in which stance changes are very rare. In this paper, we suggest a modular learning approach, which decomposes the task into multiple modules, focusing on different aspects of the interaction between users, their beliefs, and the arguments they are exposed to. Our experiments show that our modular approach archives significantly better results compared to the end-to-end approach using BERT over the same inputs.

pdf bib
Multimodal Review Generation with Privacy and Fairness Awareness
Xuan-Son Vu | Thanh-Son Nguyen | Duc-Trong Le | Lili Jiang

Users express their opinions towards entities (e.g., restaurants) via online reviews which can be in diverse forms such as text, ratings, and images. Modeling reviews are advantageous for user behavior understanding which, in turn, supports various user-oriented tasks such as recommendation, sentiment analysis, and review generation. In this paper, we propose MG-PriFair, a multimodal neural-based framework, which generates personalized reviews with privacy and fairness awareness. Motivated by the fact that reviews might contain personal information and sentiment bias, we propose a novel differentially private (dp)-embedding model for training privacy guaranteed embeddings and an evaluation approach for sentiment fairness in the food-review domain. Experiments on our novel review dataset show that MG-PriFair is capable of generating plausibly long reviews while controlling the amount of exploited user data and using the least sentiment biased word embeddings. To the best of our knowledge, we are the first to bring user privacy and sentiment fairness into the review generation task. The dataset and source codes are available at

pdf bib
Improving Abstractive Dialogue Summarization with Graph Structures and Topic Words
Lulu Zhao | Weiran Xu | Jun Guo

Recently, people have been beginning paying more attention to the abstractive dialogue summarization task. Since the information flows are exchanged between at least two interlocutors and key elements about a certain event are often spanned across multiple utterances, it is necessary for researchers to explore the inherent relations and structures of dialogue contents. However, the existing approaches often process the dialogue with sequence-based models, which are hard to capture long-distance inter-sentence relations. In this paper, we propose a Topic-word Guided Dialogue Graph Attention (TGDGA) network to model the dialogue as an interaction graph according to the topic word information. A masked graph self-attention mechanism is used to integrate cross-sentence information flows and focus more on the related utterances, which makes it better to understand the dialogue. Moreover, the topic word features are introduced to assist the decoding process. We evaluate our model on the SAMSum Corpus and Automobile Master Corpus. The experimental results show that our method outperforms most of the baselines.

pdf bib
Recent Neural Methods on Slot Filling and Intent Classification for Task-Oriented Dialogue Systems : A Survey
Samuel Louvan | Bernardo Magnini

In recent years, fostered by deep learning technologies and by the high demand for conversational AI, various approaches have been proposed that address the capacity to elicit and understand user’s needs in task-oriented dialogue systems. We focus on two core tasks, slot filling (SF) and intent classification (IC), and survey how neural based models have rapidly evolved to address natural language understanding in dialogue systems. We introduce three neural architectures : independent models, which model SF and IC separately, joint models, which exploit the mutual benefit of the two tasks simultaneously, and transfer learning models, that scale the model to new domains. We discuss the current state of the research in SF and IC, and highlight challenges that still require attention.

pdf bib
Re-framing Incremental Deep Language Models for Dialogue Processing with Multi-task Learning
Morteza Rohanian | Julian Hough

We present a multi-task learning framework to enable the training of one universal incremental dialogue processing model with four tasks of disfluency detection, language modelling, part-of-speech tagging and utterance segmentation in a simple deep recurrent setting. We show that these tasks provide positive inductive biases to each other with optimal contribution of each one relying on the severity of the noise from the task. Our live multi-task model outperforms similar individual tasks, delivers competitive performance and is beneficial for future use in conversational agents in psychiatric treatment.

pdf bib
TIMBERT : Toponym Identifier For The Medical Domain Based on BERTTIMBERT: Toponym Identifier For The Medical Domain Based on BERT
MohammadReza Davari | Leila Kosseim | Tien Bui

In this paper, we propose an approach to automate the process of place name detection in the medical domain to enable epidemiologists to better study and model the spread of viruses. We created a family of Toponym Identification Models based on BERT (TIMBERT), in order to learn in an end-to-end fashion the mapping from an input sentence to the associated sentence labeled with toponyms. When evaluated with the SemEval 2019 task 12 test set (Weissenbacher et al., 2019), our best TIMBERT model achieves an F1 score of 90.85 %, a significant improvement compared to the state-of-the-art of 89.13 % (Wang et al., 2019).

pdf bib
Identifying Depressive Symptoms from Tweets : Figurative Language Enabled Multitask Learning Framework
Shweta Yadav | Jainish Chauhan | Joy Prakash Sain | Krishnaprasad Thirunarayan | Amit Sheth | Jeremiah Schumm

Existing studies on using social media for deriving mental health status of users focus on the depression detection task. However, for case management and referral to psychiatrists, health-care workers require practical and scalable depressive disorder screening and triage system. This study aims to design and evaluate a decision support system (DSS) to reliably determine the depressive triage level by capturing fine-grained depressive symptoms expressed in user tweets through the emulation of the Patient Health Questionnaire-9 (PHQ-9) that is routinely used in clinical practice. The reliable detection of depressive symptoms from tweets is challenging because the 280-character limit on tweets incentivizes the use of creative artifacts in the utterances and figurative usage contributes to effective expression. We propose a novel BERT based robust multi-task learning framework to accurately identify the depressive symptoms using the auxiliary task of figurative usage detection. Specifically, our proposed novel task sharing mechanism, co-task aware attention, enables automatic selection of optimal information across the BERT lay-ers and tasks by soft-sharing of parameters. Our results show that modeling figurative usage can demonstrably improve the model’s robustness and reliability for distinguishing the depression symptoms.

pdf bib
Probing Multimodal Embeddings for Linguistic Properties : the Visual-Semantic Case
Adam Dahlgren Lindström | Johanna Björklund | Suna Bensch | Frank Drewes

Semantic embeddings have advanced the state of the art for countless natural language processing tasks, and various extensions to multimodal domains, such as visual-semantic embeddings, have been proposed. While the power of visual-semantic embeddings comes from the distillation and enrichment of information through machine learning, their inner workings are poorly understood and there is a shortage of analysis tools. To address this problem, we generalize the notion ofprobing tasks to the visual-semantic case. To this end, we (i) discuss the formalization of probing tasks for embeddings of image-caption pairs, (ii) define three concrete probing tasks within our general framework, (iii) train classifiers to probe for those properties, and (iv) compare various state-of-the-art embeddings under the lens of the proposed probing tasks. Our experiments reveal an up to 16 % increase in accuracy on visual-semantic embeddings compared to the corresponding unimodal embeddings, which suggest that the text and image dimensions represented in the former do complement each other.

pdf bib
Aspect-Category based Sentiment Analysis with Hierarchical Graph Convolutional Network
Hongjie Cai | Yaofeng Tu | Xiangsheng Zhou | Jianfei Yu | Rui Xia

Most of the aspect based sentiment analysis research aims at identifying the sentiment polarities toward some explicit aspect terms while ignores implicit aspects in text. To capture both explicit and implicit aspects, we focus on aspect-category based sentiment analysis, which involves joint aspect category detection and category-oriented sentiment classification. However, currently only a few simple studies have focused on this problem. The shortcomings in the way they defined the task make their approaches difficult to effectively learn the inner-relations between categories and the inter-relations between categories and sentiments. In this work, we re-formalize the task as a category-sentiment hierarchy prediction problem, which contains a hierarchy output structure to first identify multiple aspect categories in a piece of text, and then predict the sentiment for each of the identified categories. Specifically, we propose a Hierarchical Graph Convolutional Network (Hier-GCN), where a lower-level GCN is to model the inner-relations among multiple categories, and the higher-level GCN is to capture the inter-relations between aspect categories and sentiments. Extensive evaluations demonstrate that our hierarchy output structure is superior over existing ones, and the Hier-GCN model can consistently achieve the best results on four benchmarks.

pdf bib
Constituency Lattice Encoding for Aspect Term Extraction
Yunyi Yang | Kun Li | Xiaojun Quan | Weizhou Shen | Qinliang Su

One of the remaining challenges for aspect term extraction in sentiment analysis resides in the extraction of phrase-level aspect terms, which is non-trivial to determine the boundaries of such terms. In this paper, we aim to address this issue by incorporating the span annotations of constituents of a sentence to leverage the syntactic information in neural network models. To this end, we first construct a constituency lattice structure based on the constituents of a constituency tree. Then, we present two approaches to encoding the constituency lattice using BiLSTM-CRF and BERT as the base models, respectively. We experimented on two benchmark datasets to evaluate the two models, and the results confirm their superiority with respective 3.17 and 1.35 points gained in F1-Measure over the current state of the art. The improvements justify the effectiveness of the constituency lattice for aspect term extraction.

pdf bib
A Dataset and Evaluation Framework for Complex Geographical Description Parsing
Egoitz Laparra | Steven Bethard

Much previous work on geoparsing has focused on identifying and resolving individual toponyms in text like Adrano, S.Maria di Licodia or Catania. However, geographical locations occur not only as individual toponyms, but also as compositions of reference geolocations joined and modified by connectives, e.g.,. between the towns of Adrano and S.Maria di Licodia, 32 kilometres northwest of Catania. Ideally, a geoparser should be able to take such text, and the geographical shapes of the toponyms referenced within it, and parse these into a geographical shape, formed by a set of coordinates, that represents the location described. But creating a dataset for this complex geoparsing task is difficult and, if done manually, would require a huge amount of effort to annotate the geographical shapes of not only the geolocation described but also the reference toponyms. We present an approach that automates most of the process by combining Wikipedia and OpenStreetMap. As a result, we have gathered a collection of 360,187 uncurated complex geolocation descriptions, from which we have manually curated 1,000 examples intended to be used as a test set. To accompany the data, we define a new geoparsing evaluation framework along with a scoring methodology and a set of baselines.

pdf bib
DocBank : A Benchmark Dataset for Document Layout AnalysisDocBank: A Benchmark Dataset for Document Layout Analysis
Minghao Li | Yiheng Xu | Lei Cui | Shaohan Huang | Furu Wei | Zhoujun Li | Ming Zhou

Document layout analysis usually relies on computer vision models to understand documents while ignoring textual information that is vital to capture. Meanwhile, high quality labeled datasets with both visual and textual information are still insufficient. In this paper, we present DocBank, a benchmark dataset that contains 500 K document pages with fine-grained token-level annotations for document layout analysis. DocBank is constructed using a simple yet effective way with weak supervision from the LaTeX documents available on the With DocBank, models from different modalities can be compared fairly and multi-modal approaches will be further investigated and boost the performance of document layout analysis. We build several strong baselines and manually split train / dev / test sets for evaluation. Experiment results show that models trained on DocBank accurately recognize the layout information for a variety of documents. The DocBank dataset is publicly available at

pdf bib
A High Precision Pipeline for Financial Knowledge Graph Construction
Sarah Elhammadi | Laks V.S. Lakshmanan | Raymond Ng | Michael Simpson | Baoxing Huai | Zhefeng Wang | Lanjun Wang

Motivated by applications such as question answering, fact checking, and data integration, there is significant interest in constructing knowledge graphs by extracting information from unstructured information sources, particularly text documents. Knowledge graphs have emerged as a standard for structured knowledge representation, whereby entities and their inter-relations are represented and conveniently stored as (subject, predicate, object) triples in a graph that can be used to power various downstream applications. The proliferation of financial news sources reporting on companies, markets, currencies, and stocks presents an opportunity for extracting valuable knowledge about this crucial domain. In this paper, we focus on constructing a knowledge graph automatically by information extraction from a large corpus of financial news articles. For that purpose, we develop a high precision knowledge extraction pipeline tailored for the financial domain. This pipeline combines multiple information extraction techniques with a financial dictionary that we built, all working together to produce over 342,000 compact extractions from over 288,000 financial news articles, with a precision of 78 % at the top-100 extractions. The extracted triples are stored in a knowledge graph making them readily available for use in downstream applications.

pdf bib
Automatic Charge Identification from Facts : A Few Sentence-Level Charge Annotations is All You Need
Shounak Paul | Pawan Goyal | Saptarshi Ghosh

Automatic Charge Identification (ACI) is the task of identifying the relevant charges given the facts of a situation and the statutory laws that define these charges, and is a crucial aspect of the judicial process. Existing works focus on learning charge-side representations by modeling relationships between the charges, but not much effort has been made in improving fact-side representations. We observe that only a small fraction of sentences in the facts actually indicates the charges. We show that by using a very small subset (3 %) of fact descriptions annotated with sentence-level charges, we can achieve an improvement across a range of different ACI models, as compared to modeling just the main document-level task on a much larger dataset. Additionally, we propose a novel model that utilizes sentence-level charge labels as an auxiliary task, coupled with the main task of document-level charge identification in a multi-task learning framework. The proposed model comprehensively outperforms a large number of recent baselines for ACI. The improvement in performance is particularly noticeable for the rare charges which are known to be especially challenging to identify.

pdf bib
Context-Aware Text Normalisation for Historical Dialects
Maria Sukhareva

Context-aware historical text normalisation is a severely under-researched area. To fill the gap we propose a context-aware normalisation approach that relies on the state-of-the-art methods in neural machine translation and transfer learning. We propose a multidialect normaliser with a context-aware reranking of the candidates. The reranker relies on a word-level n-gram language model that is applied to the five best normalisation candidates. The results are evaluated on the historical multidialect datasets of German, Spanish, Portuguese and Slovene. We show that incorporating dialectal information into the training leads to an accuracy improvement on all the datasets. The context-aware reranking gives further improvement over the baseline. For three out of six datasets, we reach a significantly higher accuracy than reported in the previous studies. The other three results are comparable with the current state-of-the-art. The code for the reranker is published as open-source.

pdf bib
Exploring Amharic Sentiment Analysis from Social Media Texts : Building Annotation Tools and Classification ModelsAmharic Sentiment Analysis from Social Media Texts: Building Annotation Tools and Classification Models
Seid Muhie Yimam | Hizkiel Mitiku Alemayehu | Abinew Ayele | Chris Biemann

This paper presents the study of sentiment analysis for Amharic social media texts. As the number of social media users is ever-increasing, social media platforms would like to understand the latent meaning and sentiments of a text to enhance decision-making procedures. However, low-resource languages such as Amharic have received less attention due to several reasons such as lack of well-annotated datasets, unavailability of computing resources, and fewer or no expert researchers in the area. This research addresses three main research questions. We first explore the suitability of existing tools for the sentiment analysis task. Annotation tools are scarce to support large-scale annotation tasks in Amharic. Also, the existing crowdsourcing platforms do not support Amharic text annotation. Hence, we build a social-network-friendly annotation tool called ‘ASAB’ using the Telegram bot. We collect 9.4k tweets, where each tweet is annotated by three Telegram users. Moreover, we explore the suitability of machine learning approaches for Amharic sentiment analysis. The FLAIR deep learning text classifier, based on network embeddings that are computed from a distributional thesaurus, outperforms other supervised classifiers. We further investigate the challenges in building a sentiment analysis system for Amharic and we found that the widespread usage of sarcasm and figurative speech are the main issues in dealing with the problem. To advance the sentiment analysis research in Amharic and other related low-resource languages, we release the dataset, the annotation tool, source code, and models publicly under a permissive.

pdf bib
Effective Few-Shot Classification with Transfer Learning
Aakriti Gupta | Kapil Thadani | Neil O’Hare

Few-shot learning addresses the the problem of learning based on a small amount of training data. Although more well-studied in the domain of computer vision, recent work has adapted the Amazon Review Sentiment Classification (ARSC) text dataset for use in the few-shot setting. In this work, we use the ARSC dataset to study a simple application of transfer learning approaches to few-shot classification. We train a single binary classifier to learn all few-shot classes jointly by prefixing class identifiers to the input text. Given the text and class, the model then makes a binary prediction for that text / class pair. Our results show that this simple approach can outperform most published results on this dataset. Surprisingly, we also show that including domain information as part of the task definition only leads to a modest improvement in model accuracy, and zero-shot classification, without further fine-tuning on few-shot domains, performs equivalently to few-shot classification. These results suggest that the classes in the ARSC few-shot task, which are defined by the intersection of domain and rating, are actually very similar to each other, and that a more suitable dataset is needed for the study of few-shot text classification.

pdf bib
Meet Changes with Constancy : Learning Invariance in Multi-Source Translation
Jianfeng Liu | Ling Luo | Xiang Ao | Yan Song | Haoran Xu | Jian Ye

Multi-source neural machine translation aims to translate from parallel sources of information (e.g. languages, images, etc.) to a single target language, which has shown better performance than most one-to-one systems. Despite the remarkable success of existing models, they usually neglect the fact that multiple source inputs may have inconsistencies. Such differences might bring noise to the task and limit the performance of existing multi-source NMT approaches due to their indiscriminate usage of input sources for target word predictions. In this paper, we attempt to leverage the potential complementary information among distinct sources and alleviate the occasional conflicts of them. To accomplish that, we propose a source invariance network to learn the invariant information of parallel sources. Such network can be easily integrated with multi-encoder based multi-source NMT methods (e.g. multi-encoder RNN and transformer) to enhance the translation results. Extensive experiments on two multi-source translation tasks demonstrate that the proposed approach not only achieves clear gains in translation quality but also captures implicit invariance between different sources.

pdf bib
How Relevant Are Selectional Preferences for Transformer-based Language Models?
Eleni Metheniti | Tim Van de Cruys | Nabil Hathout

Selectional preference is defined as the tendency of a predicate to favor particular arguments within a certain linguistic context, and likewise, reject others that result in conflicting or implausible meanings. The stellar success of contextual word embedding models such as BERT in NLP tasks has led many to question whether these models have learned linguistic information, but up till now, most research has focused on syntactic information. We investigate whether Bert contains information on the selectional preferences of words, by examining the probability it assigns to the dependent word given the presence of a head word in a sentence. We are using word pairs of head-dependent words in five different syntactic relations from the SP-10 K corpus of selectional preference (Zhang et al., 2019b), in sentences from the ukWaC corpus, and we are calculating the correlation of the plausibility score (from SP-10 K) and the model probabilities. Our results show that overall, there is no strong positive or negative correlation in any syntactic relation, but we do find that certain head words have a strong correlation and that masking all words but the head word yields the most positive correlations in most scenarios which indicates that the semantics of the predicate is indeed an integral and influential factor for the selection of the argument.

pdf bib
A Retrofitting Model for Incorporating Semantic Relations into Word Embeddings
Sapan Shah | Sreedhar Reddy | Pushpak Bhattacharyya

We present a novel retrofitting model that can leverage relational knowledge available in a knowledge resource to improve word embeddings. The knowledge is captured in terms of relation inequality constraints that compare similarity of related and unrelated entities in the context of an anchor entity. These constraints are used as training data to learn a non-linear transformation function that maps original word vectors to a vector space respecting these constraints. The transformation function is learned in a similarity metric learning setting using Triplet network architecture. We applied our model to synonymy, antonymy and hypernymy relations in WordNet and observed large gains in performance over original distributional models as well as other retrofitting approaches on word similarity task and significant overall improvement on lexical entailment detection task.

pdf bib
Lexical Relation Mining in Neural Word Embeddings
Aishwarya Jadhav | Yifat Amir | Zachary Pardos

Work with neural word embeddings and lexical relations has largely focused on confirmatory experiments which use human-curated examples of semantic and syntactic relations to validate against. In this paper, we explore the degree to which lexical relations, such as those found in popular validation sets, can be derived and extended from a variety of neural embeddings using classical clustering methods. We show that the Word2Vec space of word-pairs (i.e., offset vectors) significantly outperforms other more contemporary methods, even in the presence of a large number of noisy offsets. Moreover, we show that via a simple nearest neighbor approach in the offset space, new examples of known relations can be discovered. Our results speak to the amenability of offset vectors from non-contextual neural embeddings to find semantically coherent clusters. This simple approach has implications for the exploration of emergent regularities and their examples, such as emerging trends on social media and their related posts.

pdf bib
BERT-based Cohesion Analysis of Japanese TextsBERT-based Cohesion Analysis of Japanese Texts
Nobuhiro Ueda | Daisuke Kawahara | Sadao Kurohashi

The meaning of natural language text is supported by cohesion among various kinds of entities, including coreference relations, predicate-argument structures, and bridging anaphora relations. However, predicate-argument structures for nominal predicates and bridging anaphora relations have not been studied well, and their analyses have been still very difficult. Recent advances in neural networks, in particular, self training-based language models including BERT (Devlin et al., 2019), have significantly improved many natural language processing tasks, making it possible to dive into the study on analysis of cohesion in the whole text. In this study, we tackle an integrated analysis of cohesion in Japanese texts. Our results significantly outperformed existing studies in each task, especially about 10 to 20 point improvement both for zero anaphora and coreference resolution. Furthermore, we also showed that coreference resolution is different in nature from the other tasks and should be treated specially.

pdf bib
Schema Aware Semantic Reasoning for Interpreting Natural Language Queries in Enterprise Settings
Jaydeep Sen | Tanaya Babtiwale | Kanishk Saxena | Yash Butala | Sumit Bhatia | Karthik Sankaranarayanan

Natural Language Query interfaces allow the end-users to access the desired information without the need to know any specialized query language, data storage, or schema details. Even with the recent advances in NLP research space, the state-of-the-art QA systems fall short of understanding implicit intents of real-world Business Intelligence (BI) queries in enterprise systems, since Natural Language Understanding still remains an AI-hard problem. We posit that deploying ontology reasoning over domain semantics can help in achieving better natural language understanding for QA systems. In this paper, we specifically focus on building a Schema Aware Semantic Reasoning Framework that translates natural language interpretation as a sequence of solvable tasks by an ontology reasoner. We apply our framework on top of an ontology based, state-of-the-art natural language question-answering system ATHENA, and experiment with 4 benchmarks focused on BI queries. Our experimental numbers empirically show that the Schema Aware Semantic Reasoning indeed helps in achieving significantly better results for handling BI queries with an average accuracy improvement of ~30 %

pdf bib
What Can We Learn from Noun Substitutions in Revision Histories?
Talita Anthonio | Michael Roth

In community-edited resources such as wikiHow, sentences are subject to revisions on a daily basis. Recent work has shown that resulting improvements over time can be modelled computationally, assuming that each revision contributes to the improvement. We take a closer look at a subset of such revisions, for which we attempt to improve a computational model and validate in how far the assumption that ‘revised means better’ actually holds. The subset of revisions considered here are noun substitutions, which often involve interesting semantic relations, including synonymy, antonymy and hypernymy. Despite the high semantic relatedness, we find that a supervised classifier can distinguish the revised version of a sentence from an original version with an accuracy close to 70 %, when taking context into account. In a human annotation study, we observe that annotators identify the revised sentence as the ‘better version’ with similar performance. Our analysis reveals a fair agreement among annotators when a revision improves fluency. In contrast, noun substitutions that involve other lexical-semantic relationships are often perceived as being equally good or tend to cause disagreements. While these findings are also reflected in classification scores, a comparison of results shows that our model fails in cases where humans can resort to factual knowledge or intuitions about the required level of specificity.

pdf bib
Specializing Unsupervised Pretraining Models for Word-Level Semantic Similarity
Anne Lauscher | Ivan Vulić | Edoardo Maria Ponti | Anna Korhonen | Goran Glavaš

Unsupervised pretraining models have been shown to facilitate a wide range of downstream NLP applications. These models, however, retain some of the limitations of traditional static word embeddings. In particular, they encode only the distributional knowledge available in raw text corpora, incorporated through language modeling objectives. In this work, we complement such distributional knowledge with external lexical knowledge, that is, we integrate the discrete knowledge on word-level semantic similarity into pretraining. To this end, we generalize the standard BERT model to a multi-task learning setting where we couple BERT’s masked language modeling and next sentence prediction objectives with an auxiliary task of binary word relation classification. Our experiments suggest that our Lexically Informed BERT (LIBERT), specialized for the word-level semantic similarity, yields better performance than the lexically blind vanilla BERT on several language understanding tasks. Concretely, LIBERT outperforms BERT in 9 out of 10 tasks of the GLUE benchmark and is on a par with BERT in the remaining one. Moreover, we show consistent gains on 3 benchmarks for lexical simplification, a task where knowledge about word-level semantic similarity is paramount, as well as large gains on lexical reasoning probes.

pdf bib
A Deep Generative Distance-Based Classifier for Out-of-Domain Detection with Mahalanobis Space
Hong Xu | Keqing He | Yuanmeng Yan | Sihong Liu | Zijun Liu | Weiran Xu

Detecting out-of-domain (OOD) input intents is critical in the task-oriented dialog system. Different from most existing methods that rely heavily on manually labeled OOD samples, we focus on the unsupervised OOD detection scenario where there are no labeled OOD samples except for labeled in-domain data. In this paper, we propose a simple but strong generative distance-based classifier to detect OOD samples. We estimate the class-conditional distribution on feature spaces of DNNs via Gaussian discriminant analysis (GDA) to avoid over-confidence problems. And we use two distance functions, Euclidean and Mahalanobis distances, to measure the confidence score of whether a test sample belongs to OOD. Experiments on four benchmark datasets show that our method can consistently outperform the baselines.

pdf bib
Contrastive Zero-Shot Learning for Cross-Domain Slot Filling with Adversarial Attack
Keqing He | Jinchao Zhang | Yuanmeng Yan | Weiran Xu | Cheng Niu | Jie Zhou

Zero-shot slot filling has widely arisen to cope with data scarcity in target domains. However, previous approaches often ignore constraints between slot value representation and related slot description representation in the latent space and lack enough model robustness. In this paper, we propose a Contrastive Zero-Shot Learning with Adversarial Attack (CZSL-Adv) method for the cross-domain slot filling. The contrastive loss aims to map slot value contextual representations to the corresponding slot description representations. And we introduce an adversarial attack training strategy to improve model robustness. Experimental results show that our model significantly outperforms state-of-the-art baselines under both zero-shot and few-shot settings.

pdf bib
Contextual Argument Component Classification for Class Discussions
Luca Lugini | Diane Litman

Argument mining systems often consider contextual information, i.e. information outside of an argumentative discourse unit, when trained to accomplish tasks such as argument component identification, classification, and relation extraction. However, prior work has not carefully analyzed the utility of different contextual properties in context-aware models. In this work, we show how two different types of contextual information, local discourse context and speaker context, can be incorporated into a computational model for classifying argument components in multi-party classroom discussions. We find that both context types can improve performance, although the improvements are dependent on context size and position.

pdf bib
Pre-trained Language Model Based Active Learning for Sentence Matching
Guirong Bai | Shizhu He | Kang Liu | Jun Zhao | Zaiqing Nie

Active learning is able to significantly reduce the annotation cost for data-driven techniques. However, previous active learning approaches for natural language processing mainly depend on the entropy-based uncertainty criterion, and ignore the characteristics of natural language. In this paper, we propose a pre-trained language model based active learning approach for sentence matching. Differing from previous active learning, it can provide linguistic criteria from the pre-trained language model to measure instances and help select more effective instances for annotation. Experiments demonstrate our approach can achieve greater accuracy with fewer labeled training instances.

pdf bib
Using a Penalty-based Loss Re-estimation Method to Improve Implicit Discourse Relation Classification
Xiao Li | Yu Hong | Huibin Ruan | Zhen Huang

We tackle implicit discourse relation classification, a task of automatically determining semantic relationships between arguments. The attention-worthy words in arguments are crucial clues for classifying the discourse relations. Attention mechanisms have been proven effective in highlighting the attention-worthy words during encoding. However, our survey shows that some inessential words are unintentionally misjudged as the attention-worthy words and, therefore, assigned heavier attention weights than should be. We propose a penalty-based loss re-estimation method to regulate the attention learning process, integrating penalty coefficients into the computation of loss by means of overstability of attention weight distributions. We conduct experiments on the Penn Discourse TreeBank (PDTB) corpus. The test results show that our loss re-estimation method leads to substantial improvements for a variety of attention mechanisms, and it obtains highly competitive performance compared to the state-of-the-art methods.

pdf bib
Knowledge Graph Embedding with Atrous Convolution and Residual Learning
Feiliang Ren | Juchen Li | Huihui Zhang | Shilei Liu | Bochao Li | Ruicheng Ming | Yujia Bai

Knowledge graph embedding is an important task and it will benefit lots of downstream applications. Currently, deep neural networks based methods achieve state-of-the-art performance. However, most of these existing methods are very complex and need much time for training and inference. To address this issue, we propose a simple but effective atrous convolution based knowledge graph embedding method. Compared with existing state-of-the-art methods, our method has following main characteristics. First, it effectively increases feature interactions by using atrous convolutions. Second, to address the original information forgotten issue and vanishing / exploding gradient issue, it uses the residual learning method. Third, it has simpler structure but much higher parameter efficiency. We evaluate our method on six benchmark datasets with different evaluation metrics. Extensive experiments show that our model is very effective. On these diverse datasets, it achieves better results than the compared state-of-the-art methods on most of evaluation metrics. The source codes of our model could be found at

pdf bib
TPLinker : Single-stage Joint Extraction of Entities and Relations Through Token Pair LinkingTPLinker: Single-stage Joint Extraction of Entities and Relations Through Token Pair Linking
Yucheng Wang | Bowen Yu | Yueyang Zhang | Tingwen Liu | Hongsong Zhu | Limin Sun

Extracting entities and relations from unstructured text has attracted increasing attention in recent years but remains challenging, due to the intrinsic difficulty in identifying overlapping relations with shared entities. Prior works show that joint learning can result in a noticeable performance gain. However, they usually involve sequential interrelated steps and suffer from the problem of exposure bias. At training time, they predict with the ground truth conditions while at inference it has to make extraction from scratch. This discrepancy leads to error accumulation. To mitigate the issue, we propose in this paper a one-stage joint extraction model, namely, TPLinker, which is capable of discovering overlapping relations sharing one or both entities while being immune from the exposure bias. TPLinker formulates joint extraction as a token pair linking problem and introduces a novel handshaking tagging scheme that aligns the boundary tokens of entity pairs under each relation type. Experiment results show that TPLinker performs significantly better on overlapping and multiple relation extraction, and achieves state-of-the-art performance on two public datasets.

pdf bib
Unsupervised Deep Language and Dialect Identification for Short Texts
Koustava Goswami | Rajdeep Sarkar | Bharathi Raja Chakravarthi | Theodorus Fransen | John P. McCrae

Automatic Language Identification (LI) or Dialect Identification (DI) of short texts of closely related languages or dialects, is one of the primary steps in many natural language processing pipelines. Language identification is considered a solved task in many cases ; however, in the case of very closely related languages, or in an unsupervised scenario (where the languages are not known in advance), performance is still poor. In this paper, we propose the Unsupervised Deep Language and Dialect Identification (UDLDI) method, which can simultaneously learn sentence embeddings and cluster assignments from short texts. The UDLDI model understands the sentence constructions of languages by applying attention to character relations which helps to optimize the clustering of languages. We have performed our experiments on three short-text datasets for different language families, each consisting of closely related languages or dialects, with very minimal training sets. Our experimental evaluations on these datasets have shown significant improvement over state-of-the-art unsupervised methods and our model has outperformed state-of-the-art LI and DI systems in supervised settings.

pdf bib
Improving Long-Tail Relation Extraction with Collaborating Relation-Augmented Attention
Yang Li | Tao Shen | Guodong Long | Jing Jiang | Tianyi Zhou | Chengqi Zhang

Wrong labeling problem and long-tail relations are two main challenges caused by distant supervision in relation extraction. Recent works alleviate the wrong labeling by selective attention via multi-instance learning, but can not well handle long-tail relations even if hierarchies of the relations are introduced to share knowledge. In this work, we propose a novel neural network, Collaborating Relation-augmented Attention (CoRA), to handle both the wrong labeling and long-tail relations. Particularly, we first propose relation-augmented attention network as base model. It operates on sentence bag with a sentence-to-relation attention to minimize the effect of wrong labeling. Then, facilitated by the proposed base model, we introduce collaborating relation features shared among relations in the hierarchies to promote the relation-augmenting process and balance the training data for long-tail relations. Besides the main training objective to predict the relation of a sentence bag, an auxiliary objective is utilized to guide the relation-augmenting process for a more accurate bag-level representation. In the experiments on the popular benchmark dataset NYT, the proposed CoRA improves the prior state-of-the-art performance by a large margin in terms of Precision@N, AUC and Hits@K. Further analyses verify its superior capability in handling long-tail relations in contrast to the competitors.

pdf bib
ToHRE : A Top-Down Classification Strategy with Hierarchical Bag Representation for Distantly Supervised Relation ExtractionToHRE: A Top-Down Classification Strategy with Hierarchical Bag Representation for Distantly Supervised Relation Extraction
Erxin Yu | Wenjuan Han | Yuan Tian | Yi Chang

Distantly Supervised Relation Extraction (DSRE) has proven to be effective to find relational facts from texts, but it still suffers from two main problems : the wrong labeling problem and the long-tail problem. Most of the existing approaches address these two problems through flat classification, which lacks hierarchical information of relations. To leverage the informative relation hierarchies, we formulate DSRE as a hierarchical classification task and propose a novel hierarchical classification framework, which extracts the relation in a top-down manner. Specifically, in our proposed framework, 1) we use a hierarchically-refined representation method to achieve hierarchy-specific representation ; 2) a top-down classification strategy is introduced instead of training a set of local classifiers. The experiments on NYT dataset demonstrate that our approach significantly outperforms other state-of-the-art approaches, especially for the long-tail problem.

pdf bib
Combining Event Semantics and Degree Semantics for Natural Language Inference
Izumi Haruta | Koji Mineshima | Daisuke Bekki

In formal semantics, there are two well-developed semantic frameworks : event semantics, which treats verbs and adverbial modifiers using the notion of event, and degree semantics, which analyzes adjectives and comparatives using the notion of degree. However, it is not obvious whether these frameworks can be combined to handle cases in which the phenomena in question are interacting with each other. Here, we study this issue by focusing on natural language inference (NLI). We implement a logic-based NLI system that combines event semantics and degree semantics and their interaction with lexical knowledge. We evaluate the system on various NLI datasets containing linguistically challenging problems. The results show that the system achieves high accuracies on these datasets in comparison with previous logic-based systems and deep-learning-based systems. This suggests that the two semantic frameworks can be combined consistently to handle various combinations of linguistic phenomena without compromising the advantage of either framework.

pdf bib
Detecting de minimis Code-Switching in Historical German BooksGerman Books
Shijia Liu | David Smith

Code-switching has long interested linguists, with computational work in particular focusing on speech and social media data (Sitaram et al., 2019). This paper contrasts these informal instances of code-switching to its appearance in more formal registers, by examining the mixture of languages in the Deutsches Textarchiv (DTA), a corpus of 1406 primarily German books from the 17th to 19th centuries. We automatically annotate and manually inspect spans of six embedded languages (Latin, French, English, Italian, Spanish, and Greek) in the corpus. We quantitatively analyze the differences between code-switching patterns in these books and those in more typically studied speech and social media corpora. Furthermore, we address the practical task of predicting code-switching from features of the matrix language alone in the DTA corpus. Such classifiers can help reduce errors when optical character recognition or speech transcription is applied to a large corpus with rare embedded languages.

pdf bib
Connecting the Dots Between Fact Verification and Fake News Detection
Qifei Li | Wangchunshu Zhou

Fact verification models have enjoyed a fast advancement in the last two years with the development of pre-trained language models like BERT and the release of large scale datasets such as FEVER. However, the challenging problem of fake news detection has not benefited from the improvement of fact verification models, which is closely related to fake news detection. In this paper, we propose a simple yet effective approach to connect the dots between fact verification and fake news detection. Our approach first employs a text summarization model pre-trained on news corpora to summarize the long news article into a short claim. Then we use a fact verification model pre-trained on the FEVER dataset to detect whether the input news article is real or fake. Our approach makes use of the recent success of fact verification models and enables zero-shot fake news detection, alleviating the need of large scale training data to train fake news detection models. Experimental results on FakenewsNet, a benchmark dataset for fake news detection, demonstrate the effectiveness of our proposed approach.

pdf bib
Reasoning Step-by-Step : Temporal Sentence Localization in Videos via Deep Rectification-Modulation Network
Daizong Liu | Xiaoye Qu | Jianfeng Dong | Pan Zhou

Temporal sentence localization in videos aims to ground the best matched segment in an untrimmed video according to a given sentence query. Previous works in this field mainly rely on attentional frameworks to align the temporal boundaries by a soft selection. Although they focus on the visual content relevant to the query, these single-step attention are insufficient to model complex video contents and restrict the higher-level reasoning demand for this task. In this paper, we propose a novel deep rectification-modulation network (RMN), transforming this task into a multi-step reasoning process by repeating rectification and modulation. In each rectification-modulation layer, unlike existing methods directly conducting the cross-modal interaction, we first devise a rectification module to correct implicit attention misalignment which focuses on the wrong position during the cross-interaction process. Then, a modulation module is developed to capture the frame-to-frame relation with the help of sentence information for better correlating and composing the video contents over time. With multiple such layers cascaded in depth, our RMN progressively refines video and query interactions, thus enabling a further precise localization. Experimental evaluations on three public datasets show that the proposed method achieves state-of-the-art performance. Extensive ablation studies are carried out for the comprehensive analysis of the proposed method.

pdf bib
Language-Driven Region Pointer Advancement for Controllable Image Captioning
Annika Lindh | Robert Ross | John Kelleher

Controllable Image Captioning is a recent sub-field in the multi-modal task of Image Captioning wherein constraints are placed on which regions in an image should be described in the generated natural language caption. This puts a stronger focus on producing more detailed descriptions, and opens the door for more end-user control over results. A vital component of the Controllable Image Captioning architecture is the mechanism that decides the timing of attending to each region through the advancement of a region pointer. In this paper, we propose a novel method for predicting the timing of region pointer advancement by treating the advancement step as a natural part of the language structure via a NEXT-token, motivated by a strong correlation to the sentence structure in the training data. We find that our timing agrees with the ground-truth timing in the Flickr30k Entities test data with a precision of 86.55 % and a recall of 97.92 %. Our model implementing this technique improves the state-of-the-art on standard captioning metrics while additionally demonstrating a considerably larger effective vocabulary size.

pdf bib
An Enhanced Knowledge Injection Model for Commonsense Generation
Zhihao Fan | Yeyun Gong | Zhongyu Wei | Siyuan Wang | Yameng Huang | Jian Jiao | Xuanjing Huang | Nan Duan | Ruofei Zhang

Commonsense generation aims at generating plausible everyday scenario description based on a set of provided concepts. Digging the relationship of concepts from scratch is non-trivial, therefore, we retrieve prototypes from external knowledge to assist the understanding of the scenario for better description generation. We integrate two additional modules into the pretrained encoder-decoder model for prototype modeling to enhance the knowledge injection procedure. We conduct experiment on CommonGen benchmark, experimental results show that our method significantly improves the performance on all the metrics.

pdf bib
How Positive Are You : Text Style Transfer using Adaptive Style Embedding
Heejin Kim | Kyung-Ah Sohn

The prevalent approach for unsupervised text style transfer is disentanglement between content and style. However, it is difficult to completely separate style information from the content. Other approaches allow the latent text representation to contain style and the target style to affect the generated output more than the latent representation does. In both approaches, however, it is impossible to adjust the strength of the style in the generated output. Moreover, those previous approaches typically perform both the sentence reconstruction and style control tasks in a single model, which complicates the overall architecture. In this paper, we address these issues by separating the model into a sentence reconstruction module and a style module. We use the Transformer-based autoencoder model for sentence reconstruction and the adaptive style embedding is learned directly in the style module. Because of this separation, each module can better focus on its own task. Moreover, we can vary the style strength of the generated sentence by changing the style of the embedding expression. Therefore, our approach not only controls the strength of the style, but also simplifies the model architecture. Experimental results show that our approach achieves better style transfer performance and content preservation than previous approaches.

pdf bib
Grammatical error detection in transcriptions of spoken EnglishEnglish
Andrew Caines | Christian Bentz | Kate Knill | Marek Rei | Paula Buttery

We describe the collection of transcription corrections and grammatical error annotations for the CrowdED Corpus of spoken English monologues on business topics. The corpus recordings were crowdsourced from native speakers of English and learners of English with German as their first language. The new transcriptions and annotations are obtained from different crowdworkers : we analyse the 1108 new crowdworker submissions and propose that they can be used for automatic transcription post-editing and grammatical error correction for speech. To further explore the data we train grammatical error detection models with various configurations including pre-trained and contextual word representations as input, additional features and auxiliary objectives, and extra training data from written error-annotated corpora. We find that a model concatenating pre-trained and contextual word representations as input performs best, and that additional information does not lead to further performance gains.

pdf bib
Style versus Content : A distinction without a (learnable) difference?
Somayeh Jafaritazehjani | Gwénolé Lecorvé | Damien Lolive | John Kelleher

Textual style transfer involves modifying the style of a text while preserving its content. This assumes that it is possible to separate style from content. This paper investigates whether this separation is possible. We use sentiment transfer as our case study for style transfer analysis. Our experimental methodology frames style transfer as a multi-objective problem, balancing style shift with content preservation and fluency. Due to the lack of parallel data for style transfer we employ a variety of adversarial encoder-decoder networks in our experiments. Also, we use of a probing methodology to analyse how these models encode style-related features in their latent spaces. The results of our experiments which are further confirmed by a human evaluation reveal the inherent trade-off between the multiple style transfer objectives which indicates that style can not be usefully separated from content within these style-transfer systems.

pdf bib
Heterogeneous Recycle Generation for Chinese Grammatical Error CorrectionChinese Grammatical Error Correction
Charles Hinson | Hen-Hsen Huang | Hsin-Hsi Chen

Most recent works in the field of grammatical error correction (GEC) rely on neural machine translation-based models. Although these models boast impressive performance, they require a massive amount of data to properly train. Furthermore, NMT-based systems treat GEC purely as a translation task and overlook the editing aspect of it. In this work we propose a heterogeneous approach to Chinese GEC, composed of a NMT-based model, a sequence editing model, and a spell checker. Our methodology not only achieves a new state-of-the-art performance for Chinese GEC, but also does so without relying on data augmentation or GEC-specific architecture changes. We further experiment with all possible configurations of our system with respect to model composition order and number of rounds of correction. A detailed analysis of each model and their contributions to the correction process is performed by adapting the ERRANT scorer to be able to score Chinese sentences.

pdf bib
Formality Style Transfer with Shared Latent Space
Yunli Wang | Yu Wu | Lili Mou | Zhoujun Li | WenHan Chao

Conventional approaches for formality style transfer borrow models from neural machine translation, which typically requires massive parallel data for training. However, the dataset for formality style transfer is considerably smaller than translation corpora. Moreover, we observe that informal and formal sentences closely resemble each other, which is different from the translation task where two languages have different vocabularies and grammars. In this paper, we present a new approach, Sequence-to-Sequence with Shared Latent Space (S2S-SLS), for formality style transfer, where we propose two auxiliary losses and adopt joint training of bi-directional transfer and auto-encoding. Experimental results show that S2S-SLS (with either RNN or Transformer architectures) consistently outperforms baselines in various settings, especially when we have limited data.

pdf bib
Keep it Consistent : Topic-Aware Storytelling from an Image Stream via Iterative Multi-agent Communication
Ruize Wang | Zhongyu Wei | Ying Cheng | Piji Li | Haijun Shan | Ji Zhang | Qi Zhang | Xuanjing Huang

Visual storytelling aims to generate a narrative paragraph from a sequence of images automatically. Existing approaches construct text description independently for each image and roughly concatenate them as a story, which leads to the problem of generating semantically incoherent content. In this paper, we propose a new way for visual storytelling by introducing a topic description task to detect the global semantic context of an image stream. A story is then constructed with the guidance of the topic description. In order to combine the two generation tasks, we propose a multi-agent communication framework that regards the topic description generator and the story generator as two agents and learn them simultaneously via iterative updating mechanism. We validate our approach on VIST dataset, where quantitative results, ablations, and human evaluation demonstrate our method’s good ability in generating stories with higher quality compared to state-of-the-art methods.

pdf bib
Referring to what you know and do not know : Making Referring Expression Generation Models Generalize To Unseen Entities
Rossana Cunha | Thiago Castro Ferreira | Adriana Pagano | Fabio Alves

Data-to-text Natural Language Generation (NLG) is the computational process of generating natural language in the form of text or voice from non-linguistic data. A core micro-planning task within NLG is referring expression generation (REG), which aims to automatically generate noun phrases to refer to entities mentioned as discourse unfolds. A limitation of novel REG models is not being able to generate referring expressions to entities not encountered during the training process. To solve this problem, we propose two extensions to NeuralREG, a state-of-the-art encoder-decoder REG model. The first is a copy mechanism, whereas the second consists of representing the gender and type of the referent as inputs to the model. Drawing on the results of automatic and human evaluation as well as an ablation study using the WebNLG corpus, we contend that our proposal contributes to the generation of more meaningful referring expressions to unseen entities than the original system and related work. Code and all produced data are publicly available.

pdf bib
Automatic Detection of Machine Generated Text : A Critical Survey
Ganesh Jawahar | Muhammad Abdul-Mageed | Laks Lakshmanan, V.S.

Text generative models (TGMs) excel in producing text that matches the style of human language reasonably well. Such TGMs can be misused by adversaries, e.g., by automatically generating fake news and fake product reviews that can look authentic and fool humans. Detectors that can distinguish text generated by TGM from human written text play a vital role in mitigating such misuse of TGMs. Recently, there has been a flurry of works from both natural language processing (NLP) and machine learning (ML) communities to build accurate detectors for English. Despite the importance of this problem, there is currently no work that surveys this fast-growing literature and introduces newcomers to important research challenges. In this work, we fill this void by providing a critical survey and review of this literature to facilitate a comprehensive understanding of this problem. We conduct an in-depth error analysis of the state-of-the-art detector and discuss research directions to guide future work in this exciting area.

pdf bib
Learning with Contrastive Examples for Data-to-Text Generation
Yui Uehara | Tatsuya Ishigaki | Kasumi Aoki | Hiroshi Noji | Keiichi Goshima | Ichiro Kobayashi | Hiroya Takamura | Yusuke Miyao

Existing models for data-to-text tasks generate fluent but sometimes incorrect sentences e.g., Nikkei gains is generated when Nikkei drops is expected. We investigate models trained on contrastive examples i.e., incorrect sentences or terms, in addition to correct ones to reduce such errors. We first create rules to produce contrastive examples from correct ones by replacing frequent crucial terms such as gain or drop. We then use learning methods with several losses that exploit contrastive examples. Experiments on the market comment generation task show that 1) exploiting contrastive examples improves the capability of generating sentences with better lexical choice, without degrading the fluency, 2) the choice of the loss function is an important factor because the performances on different metrics depend on the types of loss functions, and 3) the use of the examples produced by some specific rules further improves performance. Human evaluation also supports the effectiveness of using contrastive examples.

pdf bib
MedWriter : Knowledge-Aware Medical Text GenerationMedWriter: Knowledge-Aware Medical Text Generation
Youcheng Pan | Qingcai Chen | Weihua Peng | Xiaolong Wang | Baotian Hu | Xin Liu | Junying Chen | Wenxiu Zhou

To exploit the domain knowledge to guarantee the correctness of generated text has been a hot topic in recent years, especially for high professional domains such as medical. However, most of recent works only consider the information of unstructured text rather than structured information of the knowledge graph. In this paper, we focus on the medical topic-to-text generation task and adapt a knowledge-aware text generation model to the medical domain, named MedWriter, which not only introduces the specific knowledge from the external MKG but also is capable of learning graph-level representation. We conduct experiments on a medical literature dataset collected from medical journals, each of which has a set of topic words, an abstract of medical literature and a corresponding knowledge graph from CMeKG. Experimental results demonstrate incorporating knowledge graph into generation model can improve the quality of the generated text and has robust superiority over the competitor methods.

pdf bib
Context Dependent Semantic Parsing : A Survey
Zhuang Li | Lizhen Qu | Gholamreza Haffari

Semantic parsing is the task of translating natural language utterances into machine-readable meaning representations. Currently, most semantic parsing methods are not able to utilize the contextual information (e.g. dialogue and comments history), which has a great potential to boost the semantic parsing systems. To address this issue, context dependent semantic parsing has recently drawn a lot of attention. In this survey, we investigate progress on the methods for the context dependent semantic parsing, together with the current datasets and tasks. We then point out open problems and challenges for future research in this area.

pdf bib
A Survey of Unsupervised Dependency Parsing
Wenjuan Han | Yong Jiang | Hwee Tou Ng | Kewei Tu

Syntactic dependency parsing is an important task in natural language processing. Unsupervised dependency parsing aims to learn a dependency parser from sentences that have no annotation of their correct parse trees. Despite its difficulty, unsupervised parsing is an interesting research direction because of its capability of utilizing almost unlimited unannotated text data. It also serves as the basis for other research in low-resource parsing. In this paper, we survey existing approaches to unsupervised dependency parsing, identify two major classes of approaches, and discuss recent trends. We hope that our survey can provide insights for researchers and facilitate future research on this topic.

pdf bib
Exploring Question-Specific Rewards for Generating Deep Questions
Yuxi Xie | Liangming Pan | Dongzhe Wang | Min-Yen Kan | Yansong Feng

Recent question generation (QG) approaches often utilize the sequence-to-sequence framework (Seq2Seq) to optimize the log likelihood of ground-truth questions using teacher forcing. However, this training objective is inconsistent with actual question quality, which is often reflected by certain global properties such as whether the question can be answered by the document. As such, we directly optimize for QG-specific objectives via reinforcement learning to improve question quality. We design three different rewards that target to improve the fluency, relevance, and answerability of generated questions. We conduct both automatic and human evaluations in addition to thorough analysis to explore the effect of each QG-specific reward. We find that optimizing on question-specific rewards generally leads to better performance in automatic evaluation metrics. However, only the rewards that correlate well with human judgement (e.g., relevance) lead to real improvement in question quality. Optimizing for the others, especially answerability, introduces incorrect bias to the model, resulting in poorer question quality. The code is publicly available at

pdf bib
CHIME : Cross-passage Hierarchical Memory Network for Generative Review Question AnsweringCHIME: Cross-passage Hierarchical Memory Network for Generative Review Question Answering
Junru Lu | Gabriele Pergola | Lin Gui | Binyang Li | Yulan He

We introduce CHIME, a cross-passage hierarchical memory network for question answering (QA) via text generation. It extends XLNet introducing an auxiliary memory module consisting of two components : the context memory collecting cross-passage evidences, and the answer memory working as a buffer continually refining the generated answers. Empirically, we show the efficacy of the proposed architecture in the multi-passage generative QA, outperforming the state-of-the-art baselines with better syntactically well-formed answers and increased precision in addressing the questions of the AmazonQA review dataset. An additional qualitative analysis revealed the interpretability introduced by the memory module.

pdf bib
Bi-directional CognitiveThinking Network for Machine Reading ComprehensionCognitiveThinking Network for Machine Reading Comprehension
Wei Peng | Yue Hu | Luxi Xing | Yuqiang Xie | Jing Yu | Yajing Sun | Xiangpeng Wei

We propose a novel Bi-directional Cognitive Knowledge Framework (BCKF) for reading comprehension from the perspective of complementary learning systems theory. It aims to simulate two ways of thinking in the brain to answer questions, including reverse thinking and inertial thinking. To validate the effectiveness of our framework, we design a corresponding Bi-directional Cognitive Thinking Network (BCTN) to encode the passage and generate a question (answer) given an answer (question) and decouple the bi-directional knowledge. The model has the ability to reverse reasoning questions which can assist inertial thinking to generate more accurate answers. Competitive improvement is observed in DuReader dataset, confirming our hypothesis that bi-directional knowledge helps the QA task. The novel framework shows an interesting perspective on machine reading comprehension and cognitive science.

pdf bib
Molweni : A Challenge Multiparty Dialogues-based Machine Reading Comprehension Dataset with Discourse Structure
Jiaqi Li | Ming Liu | Min-Yen Kan | Zihao Zheng | Zekun Wang | Wenqiang Lei | Ting Liu | Bing Qin

Research into the area of multiparty dialog has grown considerably over recent years. We present the Molweni dataset, a machine reading comprehension (MRC) dataset with discourse structure built over multiparty dialog. Molweni’s source samples from the Ubuntu Chat Corpus, including 10,000 dialogs comprising 88,303 utterances. We annotate 30,066 questions on this corpus, including both answerable and unanswerable questions. Molweni also uniquely contributes discourse dependency annotations in a modified Segmented Discourse Representation Theory (SDRT ; Asher et al., 2016) style for all of its multiparty dialogs, contributing large-scale (78,245 annotated discourse relations) data to bear on the task of multiparty dialog discourse parsing. Our experiments show that Molweni is a challenging dataset for current MRC models : BERT-wwm, a current, strong SQuAD 2.0 performer, achieves only 67.7 % F1 on Molweni’s questions, a 20+% significant drop as compared against its SQuAD 2.0 performance.

pdf bib
Conversational Machine Comprehension : a Literature Review
Somil Gupta | Bhanu Pratap Singh Rawat | Hong Yu

Conversational Machine Comprehension (CMC), a research track in conversational AI, expects the machine to understand an open-domain natural language text and thereafter engage in a multi-turn conversation to answer questions related to the text. While most of the research in Machine Reading Comprehension (MRC) revolves around single-turn question answering (QA), multi-turn CMC has recently gained prominence, thanks to the advancement in natural language understanding via neural language models such as BERT and the introduction of large-scale conversational datasets such as CoQA and QuAC. The rise in interest has, however, led to a flurry of concurrent publications, each with a different yet structurally similar modeling approach and an inconsistent view of the surrounding literature. With the volume of model submissions to conversational datasets increasing every year, there exists a need to consolidate the scattered knowledge in this domain to streamline future research. This literature review attempts at providing a holistic overview of CMC with an emphasis on the common trends across recently published models, specifically in their approach to tackling conversational history. The review synthesizes a generic framework for CMC models while highlighting the differences in recent approaches and intends to serve as a compendium of CMC for future researchers.

pdf bib
Reinforced Multi-task Approach for Multi-hop Question Generation
Deepak Gupta | Hardik Chauhan | Ravi Tej Akella | Asif Ekbal | Pushpak Bhattacharyya

Question generation (QG) attempts to solve the inverse of question answering (QA) problem by generating a natural language question given a document and an answer. While sequence to sequence neural models surpass rule-based systems for QG, they are limited in their capacity to focus on more than one supporting fact. For QG, we often require multiple supporting facts to generate high-quality questions. Inspired by recent works on multi-hop reasoning in QA, we take up Multi-hop question generation, which aims at generating relevant questions based on supporting facts in the context. We employ multitask learning with the auxiliary task of answer-aware supporting fact prediction to guide the question generator. In addition, we also proposed a question-aware reward function in a Reinforcement Learning (RL) framework to maximize the utilization of the supporting facts. We demonstrate the effectiveness of our approach through experiments on the multi-hop question answering dataset, HotPotQA. Empirical evaluation shows our model to outperform the single-hop neural question generation models on both automatic evaluation metrics such as BLEU, METEOR, and ROUGE and human evaluation metrics for quality and coverage of the generated questions.

pdf bib
Does Chinese BERT Encode Word Structure?Chinese BERT Encode Word Structure?
Yile Wang | Leyang Cui | Yue Zhang

Contextualized representations give significantly improved results for a wide range of NLP tasks. Much work has been dedicated to analyzing the features captured by representative models such as BERT. Existing work finds that syntactic, semantic and word sense knowledge are encoded in BERT. However, little work has investigated word features for character languages such as Chinese. We investigate Chinese BERT using both attention weight distribution statistics and probing tasks, finding that (1) word information is captured by BERT ; (2) word-level features are mostly in the middle representation layers ; (3) downstream tasks make different use of word features in BERT, with POS tagging and chunking relying the most on word features, and natural language inference relying the least on such features.

pdf bib
One Comment from One Perspective : An Effective Strategy for Enhancing Automatic Music Comment
Tengfei Huo | Zhiqiang Liu | Jinchao Zhang | Jie Zhou

The automatic generation of music comments is of great significance for increasing the popularity of music and the music platform’s activity. In human music comments, there exists high distinction and diverse perspectives for the same song. In other words, for a song, different comments stem from different musical perspectives. However, to date, this characteristic has not been considered well in research on automatic comment generation. The existing methods tend to generate common and meaningless comments. In this paper, we propose an effective multi-perspective strategy to enhance the diversity of the generated comments. The experiment results on two music comment datasets show that our proposed model can effectively generate a series of diverse music comments based on different perspectives, which outperforms state-of-the-art baselines by a substantial margin.

pdf bib
A Tale of Two Linkings : Dynamically Gating between Schema Linking and Structural Linking for Text-to-SQL ParsingSQL Parsing
Sanxing Chen | Aidan San | Xiaodong Liu | Yangfeng Ji

In Text-to-SQL semantic parsing, selecting the correct entities (tables and columns) for the generated SQL query is both crucial and challenging ; the parser is required to connect the natural language (NL) question and the SQL query to the structured knowledge in the database. We formulate two linking processes to address this challenge : schema linking which links explicit NL mentions to the database and structural linking which links the entities in the output SQL with their structural relationships in the database schema. Intuitively, the effectiveness of these two linking processes changes based on the entity being generated, thus we propose to dynamically choose between them using a gating mechanism. Integrating the proposed method with two graph neural network-based semantic parsers together with BERT representations demonstrates substantial gains in parsing accuracy on the challenging Spider dataset. Analyses show that our proposed method helps to enhance the structure of the model output when generating complicated SQL queries and offers more explainable predictions.

pdf bib
Autoregressive Affective Language Forecasting : A Self-Supervised Task
Matthew Matero | H. Andrew Schwartz

Human natural language is mentioned at a specific point in time while human emotions change over time. While much work has established a strong link between language use and emotional states, few have attempted to model emotional language in time. Here, we introduce the task of affective language forecasting predicting future change in language based on past changes of language, a task with real-world applications such as treating mental health or forecasting trends in consumer confidence. We establish some of the fundamental autoregressive characteristics of the task (necessary history size, static versus dynamic length, varying time-step resolutions) and then build on popular sequence models for words to instead model sequences of language-based emotion in time. Over a novel Twitter dataset of 1,900 users and weekly + daily scores for 6 emotions and 2 additional linguistic attributes, we find a novel dual-sequence GRU model with decayed hidden states achieves best results (r =.66) significantly out-predicting, e.g., a moving averaging based on the past time-steps (r =.49). We make our anonymized dataset as well as task setup and evaluation code available for others to build on.affective language forecasting – predicting future change in language based on past changes of language, a task with real-world applications such as treating mental health or forecasting trends in consumer confidence. We establish some of the fundamental autoregressive characteristics of the task (necessary history size, static versus dynamic length, varying time-step resolutions) and then build on popular sequence models for words to instead model sequences of language-based emotion in time. Over a novel Twitter dataset of 1,900 users and weekly + daily scores for 6 emotions and 2 additional linguistic attributes, we find a novel dual-sequence GRU model with decayed hidden states achieves best results (r = .66) significantly out-predicting, e.g., a moving averaging based on the past time-steps (r = .49). We make our anonymized dataset as well as task setup and evaluation code available for others to build on.

pdf bib
End to End Chinese Lexical Fusion Recognition with Sememe KnowledgeChinese Lexical Fusion Recognition with Sememe Knowledge
Yijiang Liu | Meishan Zhang | Donghong Ji

In this paper, we present Chinese lexical fusion recognition, a new task which could be regarded as one kind of coreference recognition. First, we introduce the task in detail, showing the relationship with coreference recognition and differences from the existing tasks. Second, we propose an end-to-end model for the task, handling mentions as well as coreference relationship jointly. The model exploits the state-of-the-art contextualized BERT representations as an encoder, and is further enhanced with the sememe knowledge from HowNet by graph attention networks. We manually annotate a benchmark dataset for the task and then conduct experiments on it. Results demonstrate that our final model is effective and competitive for the task. Detailed analysis is offered for comprehensively understanding the new task and our proposed model.

pdf bib
Comparison by Conversion : Reverse-Engineering UCCA from Syntax and Lexical SemanticsUCCA from Syntax and Lexical Semantics
Daniel Hershcovich | Nathan Schneider | Dotan Dvir | Jakob Prange | Miryam de Lhoneux | Omri Abend

Building robust natural language understanding systems will require a clear characterization of whether and how various linguistic meaning representations complement each other. To perform a systematic comparative analysis, we evaluate the mapping between meaning representations from different frameworks using two complementary methods : (i) a rule-based converter, and (ii) a supervised delexicalized parser that parses to one framework using only information from the other as features. We apply these methods to convert the STREUSLE corpus (with syntactic and lexical semantic annotations) to UCCA (a graph-structured full-sentence meaning representation). Both methods yield surprisingly accurate target representations, close to fully supervised UCCA parser qualityindicating that UCCA annotations are partially redundant with STREUSLE annotations. Despite this substantial convergence between frameworks, we find several important areas of divergence.

pdf bib
Normalizing Compositional Structures Across Graphbanks
Lucia Donatelli | Jonas Groschwitz | Matthias Lindemann | Alexander Koller | Pia Weißenhorn

The emergence of a variety of graph-based meaning representations (MRs) has sparked an important conversation about how to adequately represent semantic structure. MRs exhibit structural differences that reflect different theoretical and design considerations, presenting challenges to uniform linguistic analysis and cross-framework semantic parsing. Here, we ask the question of which design differences between MRs are meaningful and semantically-rooted, and which are superficial. We present a methodology for normalizing discrepancies between MRs at the compositional level (Lindemann et al., 2019), finding that we can normalize the majority of divergent phenomena using linguistically-grounded rules. Our work significantly increases the match in compositional structure between MRs and improves multi-task learning (MTL) in a low-resource setting, serving as a proof of concept for future broad-scale cross-MR normalization.

pdf bib
Finding the Evidence : Localization-aware Answer Prediction for Text Visual Question Answering
Wei Han | Hantao Huang | Tao Han

Image text carries essential information to understand the scene and perform reasoning. Text-based visual question answering (text VQA) task focuses on visual questions that require reading text in images. Existing text VQA systems generate an answer by selecting from optical character recognition (OCR) texts or a fixed vocabulary. Positional information of text is underused and there is a lack of evidence for the generated answer. As such, this paper proposes a localization-aware answer prediction network (LaAP-Net) to address this challenge. Our LaAP-Net not only generates the answer to the question but also predicts a bounding box as evidence of the generated answer. Moreover, a context-enriched OCR representation (COR) for multimodal fusion is proposed to facilitate the localization task. Our proposed LaAP-Net outperforms existing approaches on three benchmark datasets for the text VQA task by a noticeable margin.

pdf bib
Multi-level Alignment Pretraining for Multi-lingual Semantic Parsing
Bo Shao | Yeyun Gong | Weizhen Qi | Nan Duan | Xiaola Lin

In this paper, we present a multi-level alignment pretraining method in a unified architecture formulti-lingual semantic parsing. In this architecture, we use an adversarial training method toalign the space of different languages and use sentence level and word level parallel corpus assupervision information to align the semantic of different languages. Finally, we jointly train themulti-level alignment and semantic parsing tasks. We conduct experiments on a publicly avail-able multi-lingual semantic parsing dataset ATIS and a newly constructed dataset. Experimentalresults show that our model outperforms state-of-the-art methods on both datasets.

pdf bib
Conception : Multilingually-Enhanced, Human-Readable Concept Vector Representations
Simone Conia | Roberto Navigli

To date, the most successful word, word sense, and concept modelling techniques have used large corpora and knowledge resources to produce dense vector representations that capture semantic similarities in a relatively low-dimensional space. Most current approaches, however, suffer from a monolingual bias, with their strength depending on the amount of data available across languages. In this paper we address this issue and propose Conception, a novel technique for building language-independent vector representations of concepts which places multilinguality at its core while retaining explicit relationships between concepts. Our approach results in high-coverage representations that outperform the state of the art in multilingual and cross-lingual Semantic Word Similarity and Word Sense Disambiguation, proving particularly robust on low-resource languages. Conception its software and the complete set of representations is available at

pdf bib
Sentence Matching with Syntax- and Semantics-Aware BERTBERT
Tao Liu | Xin Wang | Chengguo Lv | Ranran Zhen | Guohong Fu

Sentence matching aims to identify the special relationship between two sentences, and plays a key role in many natural language processing tasks. However, previous studies mainly focused on exploiting either syntactic or semantic information for sentence matching, and no studies consider integrating both of them. In this study, we propose integrating syntax and semantics into BERT with sentence matching. In particular, we use an implicit syntax and semantics integration method that is less sensitive to the output structure information. Thus the implicit integration can alleviate the error propagation problem. The experimental results show that our approach has achieved state-of-the-art or competitive performance on several sentence matching datasets, demonstrating the benefits of implicitly integrating syntactic and semantic features in sentence matching.

pdf bib
Homonym normalisation by word sense clustering : a case in JapaneseJapanese
Yo Sato | Kevin Heffernan

This work presents a method of word sense clustering that differentiates homonyms and merge homophones, taking Japanese as an example, where orthographical variation causes problem for language processing. It uses contextualised embeddings (BERT) to cluster tokens into distinct sense groups, and we use these groups to normalise synonymous instances to a single representative form. We see the benefit of this normalisation in language model, as well as in transliteration.

pdf bib
An Unsupervised Method for Learning Representations of Multi-word Expressions for Semantic Classification
Robert Vacareanu | Marco A. Valenzuela-Escárcega | Rebecca Sharp | Mihai Surdeanu

This paper explores an unsupervised approach to learning a compositional representation function for multi-word expressions (MWEs), and evaluates it on the Tratz dataset, which associates two-word expressions with the semantic relation between the compound constituents (e.g. the label employer is associated with the noun compound government agency) (Tratz, 2011). The composition function is based on recurrent neural networks, and is trained using the Skip-Gram objective to predict the words in the context of MWEs. Thus our approach can naturally leverage large unlabeled text sources. Further, our method can make use of provided MWEs when available, but can also function as a completely unsupervised algorithm, using MWE boundaries predicted by a single, domain-agnostic part-of-speech pattern. With pre-defined MWE boundaries, our method outperforms the previous state-of-the-art performance on the coarse-grained evaluation of the Tratz dataset (Tratz, 2011), with an F1 score of 50.4 %. The unsupervised version of our method approaches the performance of the supervised one, and even outperforms it in some configurations.

pdf bib
Sentence Analogies : Linguistic Regularities in Sentence Embeddings
Xunjie Zhu | Gerard de Melo

While important properties of word vector representations have been studied extensively, far less is known about the properties of sentence vector representations. Word vectors are often evaluated by assessing to what degree they exhibit regularities with regard to relationships of the sort considered in word analogies. In this paper, we investigate to what extent commonly used sentence vector representation spaces as well reflect certain kinds of regularities. We propose a number of schemes to induce evaluation data, based on lexical analogy data as well as semantic relationships between sentences. Our experiments consider a wide range of sentence embedding methods, including ones based on BERT-style contextual embeddings. We find that different models differ substantially in their ability to reflect such regularities.

pdf bib
Manifold Learning-based Word Representation Refinement Incorporating Global and Local Information
Wenyu Zhao | Dong Zhou | Lin Li | Jinjun Chen

Recent studies show that word embedding models often underestimate similarities between similar words and overestimate similarities between distant words. This results in word similarity results obtained from embedding models inconsistent with human judgment. Manifold learning-based methods are widely utilized to refine word representations by re-embedding word vectors from the original embedding space to a new refined semantic space. These methods mainly focus on preserving local geometry information through performing weighted locally linear combination between words and their neighbors twice. However, these reconstruction weights are easily influenced by different selections of neighboring words and the whole combination process is time-consuming. In this paper, we propose two novel word representation refinement methods leveraging isometry feature mapping and local tangent space respectively. Unlike previous methods, our first method corrects pre-trained word embeddings by preserving global geometry information of all words instead of local geometry information between words and their neighbors. Our second method refines word representations by aligning original and re-fined embedding spaces based on local tangent space instead of performing weighted locally linear combination twice. Experimental results obtained from standard semantic relatedness and semantic similarity tasks show that our methods outperform various state-of-the-art baselines for word representation refinement.

pdf bib
Optimizing Transformer for Low-Resource Neural Machine Translation
Ali Araabi | Christof Monz

Language pairs with limited amounts of parallel data, also known as low-resource languages, remain a challenge for neural machine translation. While the Transformer model has achieved significant improvements for many language pairs and has become the de facto mainstream architecture, its capability under low-resource conditions has not been fully investigated yet. Our experiments on different subsets of the IWSLT14 training data show that the effectiveness of Transformer under low-resource conditions is highly dependent on the hyper-parameter settings. Our experiments show that using an optimized Transformer for low-resource conditions improves the translation quality up to 7.3 BLEU points compared to using the Transformer default settings.

pdf bib
Towards the First Machine Translation System for Sumerian TransliterationsSumerian Transliterations
Ravneet Punia | Niko Schenk | Christian Chiarcos | Émilie Pagé-Perron

The Sumerian cuneiform script was invented more than 5,000 years ago and represents one of the oldest in history. We present the first attempt to translate Sumerian texts into English automatically. We publicly release high-quality corpora for standardized training and evaluation and report results on experiments with supervised, phrase-based, and transfer learning techniques for machine translation. Quantitative and qualitative evaluations indicate the usefulness of the translations. Our proposed methodology provides a broader audience of researchers with novel access to the data, accelerates the costly and time-consuming manual translation process, and helps them better explore the relationships between Sumerian cuneiform and Mesopotamian culture.

pdf bib
Using Bilingual Patents for Translation Training
John Lee | Benjamin Tsou | Tianyuan Cai

While bilingual corpora have been instrumental for machine translation, their utility for training translators has been less explored. We investigate the use of bilingual corpora as pedagogical tools for translation in the technical domain. In a user study, novice translators revised Chinese translations of English patents through bilingual concordancing. Results show that concordancing with an in-domain bilingual corpus can yield greater improvement in translation quality of technical terms than a general-domain bilingual corpus.

pdf bib
Dual-decoder Transformer for Joint Automatic Speech Recognition and Multilingual Speech Translation
Hang Le | Juan Pino | Changhan Wang | Jiatao Gu | Didier Schwab | Laurent Besacier

We introduce dual-decoder Transformer, a new model architecture that jointly performs automatic speech recognition (ASR) and multilingual speech translation (ST). Our models are based on the original Transformer architecture (Vaswani et al., 2017) but consist of two decoders, each responsible for one task (ASR or ST). Our major contribution lies in how these decoders interact with each other : one decoder can attend to different information sources from the other via a dual-attention mechanism. We propose two variants of these architectures corresponding to two different levels of dependencies between the decoders, called the parallel and cross dual-decoder Transformers, respectively. Extensive experiments on the MuST-C dataset show that our models outperform the previously-reported highest translation performance in the multilingual settings, and outperform as well bilingual one-to-one results. Furthermore, our parallel models demonstrate no trade-off between ASR and ST compared to the vanilla multi-task architecture. Our code and pre-trained models are available at

pdf bib
Multitask Learning-Based Neural Bridging Reference Resolution
Juntao Yu | Massimo Poesio

We propose a multi task learning-based neural model for resolving bridging references tackling two key challenges. The first challenge is the lack of large corpora annotated with bridging references. To address this, we use multi-task learning to help bridging reference resolution with coreference resolution. We show that substantial improvements of up to 8 p.p. can be achieved on full bridging resolution with this architecture. The second challenge is the different definitions of bridging used in different corpora, meaning that hand-coded systems or systems using special features designed for one corpus do not work well with other corpora. Our neural model only uses a small number of corpus independent features, thus can be applied to different corpora. Evaluations with very different bridging corpora (ARRAU, ISNOTES, BASHI and SCICORP) suggest that our architecture works equally well on all corpora, and achieves the SoTA results on full bridging resolution for all corpora, outperforming the best reported results by up to 36.3 p.p..

pdf bib
Automatic Discovery of Heterogeneous Machine Learning Pipelines : An Application to Natural Language Processing
Suilan Estevez-Velarde | Yoan Gutiérrez | Andres Montoyo | Yudivián Almeida Cruz

This paper presents AutoGOAL, a system for automatic machine learning (AutoML) that uses heterogeneous techniques. In contrast with existing AutoML approaches, our contribution can automatically build machine learning pipelines that combine techniques and algorithms from different frameworks, including shallow classifiers, natural language processing tools, and neural networks. We define the heterogeneous AutoML optimization problem as the search for the best sequence of algorithms that transforms specific input data into the desired output. This provides a novel theoretical and practical approach to AutoML. Our proposal is experimentally evaluated in diverse machine learning problems and compared with alternative approaches, showing that it is competitive with other AutoML alternatives in standard benchmarks. Furthermore, it can be applied to novel scenarios, such as several NLP tasks, where existing alternatives can not be directly deployed. The system is freely available and includes in-built compatibility with a large number of popular machine learning frameworks, which makes our approach useful for solving practical problems with relative ease and effort.

pdf bib
Incorporating Noisy Length Constraints into Transformer with Length-aware Positional Encodings
Yui Oka | Katsuki Chousa | Katsuhito Sudoh | Satoshi Nakamura

Neural Machine Translation often suffers from an under-translation problem due to its limited modeling of output sequence lengths. In this work, we propose a novel approach to training a Transformer model using length constraints based on length-aware positional encoding (PE). Since length constraints with exact target sentence lengths degrade translation performance, we add random noise within a certain window size to the length constraints in the PE during the training. In the inference step, we predict the output lengths using input sequences and a BERT-based length prediction model. Experimental results in an ASPEC English-to-Japanese translation showed the proposed method produced translations with lengths close to the reference ones and outperformed a vanilla Transformer (especially in short sentences) by 3.22 points in BLEU. The average translation results using our length prediction model were also better than another baseline method using input lengths for the length constraints. The proposed noise injection improved robustness for length prediction errors, especially within the window size.

pdf bib
Deep Inside-outside Recursive Autoencoder with All-span Objective
Ruyue Hong | Jiong Cai | Kewei Tu

Deep inside-outside recursive autoencoder (DIORA) is a neural-based model designed for unsupervised constituency parsing. During its forward computation, it provides phrase and contextual representations for all spans in the input sentence. By utilizing the contextual representation of each leaf-level span, the span of length 1, to reconstruct the word inside the span, the model is trained without labeled data. In this work, we extend the training objective of DIORA by making use of all spans instead of only leaf-level spans. We test our new training objective on datasets of two languages : English and Japanese, and empirically show that our method achieves improvement in parsing accuracy over the original DIORA.

pdf bib
Picking BERT’s Brain : Probing for Linguistic Dependencies in Contextualized Embeddings Using Representational Similarity AnalysisBERT’s Brain: Probing for Linguistic Dependencies in Contextualized Embeddings Using Representational Similarity Analysis
Michael Lepori | R. Thomas McCoy

As the name implies, contextualized representations of language are typically motivated by their ability to encode context. Which aspects of context are captured by such representations? We introduce an approach to address this question using Representational Similarity Analysis (RSA). As case studies, we investigate the degree to which a verb embedding encodes the verb’s subject, a pronoun embedding encodes the pronoun’s antecedent, and a full-sentence representation encodes the sentence’s head word (as determined by a dependency parse). In all cases, we show that BERT’s contextualized embeddings reflect the linguistic dependency being studied, and that BERT encodes these dependencies to a greater degree than it encodes less linguistically-salient controls. These results demonstrate the ability of our approach to adjudicate between hypotheses about which aspects of context are encoded in representations of language.

pdf bib
The Devil is in the Details : Evaluating Limitations of Transformer-based Methods for Granular Tasks
Brihi Joshi | Neil Shah | Francesco Barbieri | Leonardo Neves

Contextual embeddings derived from transformer-based neural language models have shown state-of-the-art performance for various tasks such as question answering, sentiment analysis, and textual similarity in recent years. Extensive work shows how accurately such models can represent abstract, semantic information present in text. In this expository work, we explore a tangent direction and analyze such models’ performance on tasks that require a more granular level of representation. We focus on the problem of textual similarity from two perspectives : matching documents on a granular level (requiring embeddings to capture fine-grained attributes in the text), and an abstract level (requiring embeddings to capture overall textual semantics). We empirically demonstrate, across two datasets from different domains, that despite high performance in abstract document matching as expected, contextual embeddings are consistently (and at times, vastly) outperformed by simple baselines like TF-IDF for more granular tasks. We then propose a simple but effective method to incorporate TF-IDF into models that use contextual embeddings, achieving relative improvements of up to 36 % on granular tasks.

pdf bib
CoLAKE : Contextualized Language and Knowledge EmbeddingCoLAKE: Contextualized Language and Knowledge Embedding
Tianxiang Sun | Yunfan Shao | Xipeng Qiu | Qipeng Guo | Yaru Hu | Xuanjing Huang | Zheng Zhang

With the emerging branch of incorporating factual knowledge into pre-trained language models such as BERT, most existing models consider shallow, static, and separately pre-trained entity embeddings, which limits the performance gains of these models. Few works explore the potential of deep contextualized knowledge representation when injecting knowledge. In this paper, we propose the Contextualized Language and Knowledge Embedding (CoLAKE), which jointly learns contextualized representation for both language and knowledge with the extended MLM objective. Instead of injecting only entity embeddings, CoLAKE extracts the knowledge context of an entity from large-scale knowledge bases. To handle the heterogeneity of knowledge context and language context, we integrate them in a unified data structure, word-knowledge graph (WK graph). CoLAKE is pre-trained on large-scale WK graphs with the modified Transformer encoder. We conduct experiments on knowledge-driven tasks, knowledge probing tasks, and language understanding tasks. Experimental results show that CoLAKE outperforms previous counterparts on most of the tasks. Besides, CoLAKE achieves surprisingly high performance on our synthetic task called word-knowledge graph completion, which shows the superiority of simultaneously contextualizing language and knowledge representation.

pdf bib
Target Word Masking for Location Metonymy Resolution
Haonan Li | Maria Vasardani | Martin Tomko | Timothy Baldwin

Existing metonymy resolution approaches rely on features extracted from external resources like dictionaries and hand-crafted lexical resources. In this paper, we propose an end-to-end word-level classification approach based only on BERT, without dependencies on taggers, parsers, curated dictionaries of place names, or other external resources. We show that our approach achieves the state-of-the-art on 5 datasets, surpassing conventional BERT models and benchmarks by a large margin. We also show that our approach generalises well to unseen data.

pdf bib
What Meaning-Form Correlation Has to Compose With : A Study of MFC on Artificial and Natural LanguageMFC on Artificial and Natural Language
Timothee Mickus | Timothée Bernard | Denis Paperno

Compositionality is a widely discussed property of natural languages, although its exact definition has been elusive. We focus on the proposal that compositionality can be assessed by measuring meaning-form correlation. We analyze meaning-form correlation on three sets of languages : (i) artificial toy languages tailored to be compositional, (ii) a set of English dictionary definitions, and (iii) a set of English sentences drawn from literature. We find that linguistic phenomena such as synonymy and ungrounded stop-words weigh on MFC measurements, and that straightforward methods to mitigate their effects have widely varying results depending on the dataset they are applied to. Data and code are made publicly available.

pdf bib
Evaluating Pretrained Transformer-based Models on the Task of Fine-Grained Named Entity Recognition
Cedric Lothritz | Kevin Allix | Lisa Veiber | Tegawendé F. Bissyandé | Jacques Klein

Named Entity Recognition (NER) is a fundamental Natural Language Processing (NLP) task and has remained an active research field. In recent years, transformer models and more specifically the BERT model developed at Google revolutionised the field of NLP. While the performance of transformer-based approaches such as BERT has been studied for NER, there has not yet been a study for the fine-grained Named Entity Recognition (FG-NER) task. In this paper, we compare three transformer-based models (BERT, RoBERTa, and XLNet) to two non-transformer-based models (CRF and BiLSTM-CNN-CRF). Furthermore, we apply each model to a multitude of distinct domains. We find that transformer-based models incrementally outperform the studied non-transformer-based models in most domains with respect to the F1 score. Furthermore, we find that the choice of domains significantly influenced the performance regardless of the respective data size or the model chosen.

pdf bib
A Unifying Theory of Transition-based and Sequence Labeling Parsing
Carlos Gómez-Rodríguez | Michalina Strzyz | David Vilares

We define a mapping from transition-based parsing algorithms that read sentences from left to right to sequence labeling encodings of syntactic trees. This not only establishes a theoretical relation between transition-based parsing and sequence-labeling parsing, but also provides a method to obtain new encodings for fast and simple sequence labeling parsing from the many existing transition-based parsers for different formalisms. Applying it to dependency parsing, we implement sequence labeling versions of four algorithms, showing that they are learnable and obtain comparable performance to existing encodings.

pdf bib
Semi-supervised Domain Adaptation for Dependency Parsing via Improved Contextualized Word Representations
Ying Li | Zhenghua Li | Min Zhang

In recent years, parsing performance is dramatically improved on in-domain texts thanks to the rapid progress of deep neural network models. The major challenge for current parsing research is to improve parsing performance on out-of-domain texts that are very different from the in-domain training data when there is only a small-scale out-domain labeled data. To deal with this problem, we propose to improve the contextualized word representations via adversarial learning and fine-tuning BERT processes. Concretely, we apply adversarial learning to three representative semi-supervised domain adaption methods, i.e., direct concatenation (CON), feature augmentation (FA), and domain embedding (DE) with two useful strategies, i.e., fused target-domain word representations and orthogonality constraints, thus enabling to model more pure yet effective domain-specific and domain-invariant representations. Simultaneously, we utilize a large-scale target-domain unlabeled data to fine-tune BERT with only the language model loss, thus obtaining reliable contextualized word representations that benefit for the cross-domain dependency parsing. Experiments on a benchmark dataset show that our proposed adversarial approaches achieve consistent improvement, and fine-tuning BERT further boosts parsing accuracy by a large margin. Our single model achieves the same state-of-the-art performance as the top submitted system in the NLPCC-2019 shared task, which uses ensemble models and BERT.

pdf bib
Learning to Prune Dependency Trees with Rethinking for Neural Relation Extraction
Bowen Yu | Xue Mengge | Zhenyu Zhang | Tingwen Liu | Wang Yubin | Bin Wang

Dependency trees have been shown to be effective in capturing long-range relations between target entities. Nevertheless, how to selectively emphasize target-relevant information and remove irrelevant content from the tree is still an open problem. Existing approaches employing pre-defined rules to eliminate noise may not always yield optimal results due to the complexity and variability of natural language. In this paper, we present a novel architecture named Dynamically Pruned Graph Convolutional Network (DP-GCN), which learns to prune the dependency tree with rethinking in an end-to-end scheme. In each layer of DP-GCN, we employ a selection module to concentrate on nodes expressing the target relation by a set of binary gates, and then augment the pruned tree with a pruned semantic graph to ensure the connectivity. After that, we introduce a rethinking mechanism to guide and refine the pruning operation by feeding back the high-level learned features repeatedly. Extensive experimental results demonstrate that our model achieves impressive results compared to strong competitors.

pdf bib
How Far Does BERT Look At : Distance-based Clustering and Analysis of BERT’s AttentionBERT Look At: Distance-based Clustering and Analysis of BERT’s Attention
Yue Guan | Jingwen Leng | Chao Li | Quan Chen | Minyi Guo

Recent research on the multi-head attention mechanism, especially that in pre-trained models such as BERT, has shown us heuristics and clues in analyzing various aspects of the mechanism. As most of the research focus on probing tasks or hidden states, previous works have found some primitive patterns of attention head behavior by heuristic analytical methods, but a more systematic analysis specific on the attention patterns still remains primitive. In this work, we clearly cluster the attention heatmaps into significantly different patterns through unsupervised clustering on top of a set of proposed features, which corroborates with previous observations. We further study their corresponding functions through analytical study. In addition, our proposed features can be used to explain and calibrate different attention heads in Transformer models.

pdf bib
An Analysis of Simple Data Augmentation for Named Entity Recognition
Xiang Dai | Heike Adel

Simple yet effective data augmentation techniques have been proposed for sentence-level and sentence-pair natural language processing tasks. Inspired by these efforts, we design and compare data augmentation for named entity recognition, which is usually modeled as a token-level sequence labeling problem. Through experiments on two data sets from the biomedical and materials science domains (i2b2-2010 and MaSciP), we show that simple augmentation can boost performance for both recurrent and transformer-based models, especially for small training sets.

pdf bib
Integrating Domain Terminology into Neural Machine Translation
Elise Michon | Josep Crego | Jean Senellart

This paper extends existing work on terminology integration into Neural Machine Translation, a common industrial practice to dynamically adapt translation to a specific domain. Our method, based on the use of placeholders complemented with morphosyntactic annotation, efficiently taps into the ability of the neural network to deal with symbolic knowledge to surpass the surface generalization shown by alternative techniques. We compare our approach to state-of-the-art systems and benchmark them through a well-defined evaluation framework, focusing on actual application of terminology and not just on the overall performance. Results indicate the suitability of our method in the use-case where terminology is used in a system trained on generic data only.

pdf bib
Neural Machine Translation Models with Back-Translation for the Extremely Low-Resource Indigenous Language BribriBribri
Isaac Feldman | Rolando Coto-Solano

This paper presents a neural machine translation model and dataset for the Chibchan language Bribri, with an average performance of BLEU 16.91.7. This was trained on an extremely small dataset (5923 Bribri-Spanish pairs), providing evidence for the applicability of NMT in extremely low-resource environments. We discuss the challenges entailed in managing training input from languages without standard orthographies, we provide evidence of successful learning of Bribri grammar, and also examine the translations of structures that are infrequent in major Indo-European languages, such as positional verbs, ergative markers, numerical classifiers and complex demonstrative systems. In addition to this, we perform an experiment of augmenting the dataset through iterative back-translation (Sennrich et al., 2016a ; Hoang et al., 2018) by using Spanish sentences to create synthetic Bribri sentences. This improves the score by an average of 1.0 BLEU, but only when the new Spanish sentences belong to the same domain as the other Spanish examples. This contributes to the small but growing body of research on Chibchan NLP.

pdf bib
Dynamic Curriculum Learning for Low-Resource Neural Machine Translation
Chen Xu | Bojie Hu | Yufan Jiang | Kai Feng | Zeyang Wang | Shen Huang | Qi Ju | Tong Xiao | Jingbo Zhu

Large amounts of data has made neural machine translation (NMT) a big success in recent years. But it is still a challenge if we train these models on small-scale corpora. In this case, the way of using data appears to be more important. Here, we investigate the effective use of training data for low-resource NMT. In particular, we propose a dynamic curriculum learning (DCL) method to reorder training samples in training. Unlike previous work, we do not use a static scoring function for reordering. Instead, the order of training samples is dynamically determined in two ways-loss decline and model competence. This eases training by highlighting easy samples that the current model has enough competence to learn. We test our DCL method in a Transformer-based system. Experimental results show that DCL outperforms several strong baselines on three low-resource machine translation benchmarks and different sized data of WMT’16 En-De.

pdf bib
How LSTM Encodes Syntax : Exploring Context Vectors and Semi-Quantization on Natural TextLSTM Encodes Syntax: Exploring Context Vectors and Semi-Quantization on Natural Text
Chihiro Shibata | Kei Uchiumi | Daichi Mochihashi

Long Short-Term Memory recurrent neural network (LSTM) is widely used and known to capture informative long-term syntactic dependencies. However, how such information are reflected in its internal vectors for natural text has not yet been sufficiently investigated. We analyze them by learning a language model where syntactic structures are implicitly given. We empirically show that the context update vectors, i.e. outputs of internal gates, are approximately quantized to binary or ternary values to help the language model to count the depth of nesting accurately, as Suzgun et al. (2019) recently show for synthetic Dyck languages. For some dimensions in the context vector, we show that their activations are highly correlated with the depth of phrase structures, such as VP and NP. Moreover, with an L1 regularization, we also found that it can accurately predict whether a word is inside a phrase structure or not from a small number of components of the context vector. Even for the case of learning from raw text, context vectors are shown to still correlate well with the phrase structures. Finally, we show that natural clusters of the functional words and the part of speeches that trigger phrases are represented in a small but principal subspace of the context-update vector of LSTM.

pdf bib
When and Who? Conversation Transition Based on Bot-Agent Symbiosis Learning Network
Yipeng Yu | Ran Guan | Jie Ma | Zhuoxuan Jiang | Jingchang Huang

In online customer service applications, multiple chatbots that are specialized in various topics are typically developed separately and are then merged with other human agents to a single platform, presenting to the users with a unified interface. Ideally the conversation can be transparently transferred between different sources of customer support so that domain-specific questions can be answered timely and this is what we coined as a Bot-Agent symbiosis. Conversation transition is a major challenge in such online customer service and our work formalises the challenge as two core problems, namely, when to transfer and which bot or agent to transfer to and introduces a deep neural networks based approach that addresses these problems. Inspired by the net promoter score (NPS), our research reveals how the problems can be effectively solved by providing user feedback and developing deep neural networks that predict the conversation category distribution and the NPS of the dialogues. Experiments on realistic data generated from an online service support platform demonstrate that the proposed approach outperforms state-of-the-art methods and shows promising perspective for transparent conversation transition.

pdf bib
Translation vs. Dialogue : A Comparative Analysis of Sequence-to-Sequence Modeling
Wenpeng Hu | Ran Le | Bing Liu | Jinwen Ma | Dongyan Zhao | Rui Yan

Understanding neural models is a major topic of interest in the deep learning community. In this paper, we propose to interpret a general neural model comparatively. Specifically, we study the sequence-to-sequence (Seq2Seq) model in the contexts of two mainstream NLP tasksmachine translation and dialogue response generationas they both use the seq2seq model. We investigate how the two tasks are different and how their task difference results in major differences in the behaviors of the resulting translation and dialogue generation systems. This study allows us to make several interesting observations and gain valuable insights, which can be used to help develop better translation and dialogue generation models. To our knowledge, no such comparative study has been done so far.

pdf bib
Diverse dialogue generation with context dependent dynamic loss function
Ayaka Ueyama | Yoshinobu Kano

Dialogue systems using deep learning have achieved generation of fluent response sentences to user utterances. Nevertheless, they tend to produce responses that are not diverse and which are less context-dependent. To address these shortcomings, we propose a new loss function, an Inverse N-gram loss (INF), which incorporates contextual fluency and diversity at the same time by a simple formula. Our INF loss can adjust its loss dynamically by a weight using the inverse frequency of the tokens’ n-gram applied to Softmax Cross-Entropy loss, so that rare tokens appear more likely while retaining the fluency of the generated sentences. We trained Transformer using English and Japanese Twitter replies as single-turn dialogues using different loss functions. Our INF loss model outperformed the baselines of SCE loss and ITF loss models in automatic evaluations such as DIST-N and ROUGE, and also achieved higher scores on our human evaluations of coherence and richness.

pdf bib
Towards Topic-Guided Conversational Recommender System
Kun Zhou | Yuanhang Zhou | Wayne Xin Zhao | Xiaoke Wang | Ji-Rong Wen

Conversational recommender systems (CRS) aim to recommend high-quality items to users through interactive conversations. To develop an effective CRS, the support of high-quality datasets is essential. Existing CRS datasets mainly focus on immediate requests from users, while lack proactive guidance to the recommendation scenario. In this paper, we contribute a new CRS dataset named TG-ReDial (Recommendation through Topic-Guided Dialog). Our dataset has two major features. First, it incorporates topic threads to enforce natural semantic transitions towards the recommendation scenario. Second, it is created in a semi-automatic way, hence human annotation is more reasonable and controllable. Based on TG-ReDial, we present the task of topic-guided conversational recommendation, and propose an effective approach to this task. Extensive experiments have demonstrated the effectiveness of our approach on three sub-tasks, namely topic prediction, item recommendation and response generation. TG-ReDial is available at blue.TG-ReDial (Recommendation through Topic-Guided Dialog). Our dataset has two major features. First, it incorporates topic threads to enforce natural semantic transitions towards the recommendation scenario. Second, it is created in a semi-automatic way, hence human annotation is more reasonable and controllable. Based on TG-ReDial, we present the task of topic-guided conversational recommendation, and propose an effective approach to this task. Extensive experiments have demonstrated the effectiveness of our approach on three sub-tasks, namely topic prediction, item recommendation and response generation. TG-ReDial is available at blue

pdf bib
Summarize before Aggregate : A Global-to-local Heterogeneous Graph Inference Network for Conversational Emotion Recognition
Dongming Sheng | Dong Wang | Ying Shen | Haitao Zheng | Haozhuang Liu

Conversational Emotion Recognition (CER) is a crucial task in Natural Language Processing (NLP) with wide applications. Prior works in CER generally focus on modeling emotion influences solely with utterance-level features, with little attention paid on phrase-level semantic connection between utterances. Phrases carry sentiments when they are referred to emotional events under certain topics, providing a global semantic connection between utterances throughout the entire conversation. In this work, we propose a two-stage Summarization and Aggregation Graph Inference Network (SumAggGIN), which seamlessly integrates inference for topic-related emotional phrases and local dependency reasoning over neighbouring utterances in a global-to-local fashion. Topic-related emotional phrases, which constitutes the global topic-related emotional connections, are recognized by our proposed heterogeneous Summarization Graph. Local dependencies, which captures short-term emotional effects between neighbouring utterances, are further injected via an Aggregation Graph to distinguish the subtle differences between utterances containing emotional phrases. The two steps of graph inference are tightly-coupled for a comprehensively understanding of emotional fluctuation. Experimental results on three CER benchmark datasets verify the effectiveness of our proposed model, which outperforms the state-of-the-art approaches.

pdf bib
Deconstruct to Reconstruct a Configurable Evaluation Metric for Open-Domain Dialogue Systems
Vitou Phy | Yang Zhao | Akiko Aizawa

Many automatic evaluation metrics have been proposed to score the overall quality of a response in open-domain dialogue. Generally, the overall quality is comprised of various aspects, such as relevancy, specificity, and empathy, and the importance of each aspect differs according to the task. For instance, specificity is mandatory in a food-ordering dialogue task, whereas fluency is preferred in a language-teaching dialogue system. However, existing metrics are not designed to cope with such flexibility. For example, BLEU score fundamentally relies only on word overlapping, whereas BERTScore relies on semantic similarity between reference and candidate response. Thus, they are not guaranteed to capture the required aspects, i.e., specificity. To design a metric that is flexible to a task, we first propose making these qualities manageable by grouping them into three groups : understandability, sensibleness, and likability, where likability is a combination of qualities that are essential for a task. We also propose a simple method to composite metrics of each aspect to obtain a single metric called USL-H, which stands for Understandability, Sensibleness, and Likability in Hierarchy. We demonstrated that USL-H score achieves good correlations with human judgment and maintains its configurability towards different aspects and metrics.

pdf bib
HiTrans : A Transformer-Based Context- and Speaker-Sensitive Model for Emotion Detection in ConversationsHiTrans: A Transformer-Based Context- and Speaker-Sensitive Model for Emotion Detection in Conversations
Jingye Li | Donghong Ji | Fei Li | Meishan Zhang | Yijiang Liu

Emotion detection in conversations (EDC) is to detect the emotion for each utterance in conversations that have multiple speakers. Different from the traditional non-conversational emotion detection, the model for EDC should be context-sensitive (e.g., understanding the whole conversation rather than one utterance) and speaker-sensitive (e.g., understanding which utterance belongs to which speaker). In this paper, we propose a transformer-based context- and speaker-sensitive model for EDC, namely HiTrans, which consists of two hierarchical transformers. We utilize BERT as the low-level transformer to generate local utterance representations, and feed them into another high-level transformer so that utterance representations could be sensitive to the global context of the conversation. Moreover, we exploit an auxiliary task to make our model speaker-sensitive, called pairwise utterance speaker verification (PUSV), which aims to classify whether two utterances belong to the same speaker. We evaluate our model on three benchmark datasets, namely EmoryNLP, MELD and IEMOCAP. Results show that our model outperforms previous state-of-the-art models.

pdf bib
A Co-Attentive Cross-Lingual Neural Model for Dialogue Breakdown Detection
Qian Lin | Souvik Kundu | Hwee Tou Ng

Ensuring smooth communication is essential in a chat-oriented dialogue system, so that a user can obtain meaningful responses through interactions with the system. Most prior work on dialogue research does not focus on preventing dialogue breakdown. One of the major challenges is that a dialogue system may generate an undesired utterance leading to a dialogue breakdown, which degrades the overall interaction quality. Hence, it is crucial for a machine to detect dialogue breakdowns in an ongoing conversation. In this paper, we propose a novel dialogue breakdown detection model that jointly incorporates a pretrained cross-lingual language model and a co-attention network. Our proposed model leverages effective word embeddings trained on one hundred different languages to generate contextualized representations. Co-attention aims to capture the interaction between the latest utterance and the conversation history, and thereby determines whether the latest utterance causes a dialogue breakdown. Experimental results show that our proposed model outperforms all previous approaches on all evaluation metrics in both the Japanese and English tracks in Dialogue Breakdown Detection Challenge 4 (DBDC4 at IWSDS2019).

pdf bib
Improving Low-Resource NMT through Relevance Based Linguistic Features IncorporationNMT through Relevance Based Linguistic Features Incorporation
Abhisek Chakrabarty | Raj Dabre | Chenchen Ding | Masao Utiyama | Eiichiro Sumita

In this study, linguistic knowledge at different levels are incorporated into the neural machine translation (NMT) framework to improve translation quality for language pairs with extremely limited data. Integrating manually designed or automatically extracted features into the NMT framework is known to be beneficial. However, this study emphasizes that the relevance of the features is crucial to the performance. Specifically, we propose two methods, 1) self relevance and 2) word-based relevance, to improve the representation of features for NMT. Experiments are conducted on translation tasks from English to eight Asian languages, with no more than twenty thousand sentences for training. The proposed methods improve translation quality for all tasks by up to 3.09 BLEU points. Discussions with visualization provide the explainability of the proposed methods where we show that the relevance methods provide weights to features thereby enhancing their impact on low-resource machine translation.

pdf bib
Filtering Back-Translated Data in Unsupervised Neural Machine Translation
Jyotsana Khatri | Pushpak Bhattacharyya

Unsupervised neural machine translation (NMT) utilizes only monolingual data for training. The quality of back-translated data plays an important role in the performance of NMT systems. In back-translation, all generated pseudo parallel sentence pairs are not of the same quality. Taking inspiration from domain adaptation where in-domain sentences are given more weight in training, in this paper we propose an approach to filter back-translated data as part of the training process of unsupervised NMT. Our approach gives more weight to good pseudo parallel sentence pairs in the back-translation phase. We calculate the weight of each pseudo parallel sentence pair using sentence-wise round-trip BLEU score which is normalized batch-wise. We compare our approach with the current state of the art approaches for unsupervised NMT.

pdf bib
Lost in Back-Translation : Emotion Preservation in Neural Machine Translation
Enrica Troiano | Roman Klinger | Sebastian Padó

Machine translation provides powerful methods to convert text between languages, and is therefore a technology enabling a multilingual world. An important part of communication, however, takes place at the non-propositional level (e.g., politeness, formality, emotions), and it is far from clear whether current MT methods properly translate this information. This paper investigates the specific hypothesis that the non-propositional level of emotions is at least partially lost in MT. We carry out a number of experiments in a back-translation setup and establish that (1) emotions are indeed partially lost during translation ; (2) this tendency can be reversed almost completely with a simple re-ranking approach informed by an emotion classifier, taking advantage of diversity in the n-best list ; (3) the re-ranking approach can also be applied to change emotions, obtaining a model for emotion style transfer. An in-depth qualitative analysis reveals that there are recurring linguistic changes through which emotions are toned down or amplified, such as change of modality.

pdf bib
Context-Aware Cross-Attention for Non-Autoregressive Translation
Liang Ding | Longyue Wang | Di Wu | Dacheng Tao | Zhaopeng Tu

Non-autoregressive translation (NAT) significantly accelerates the inference process by predicting the entire target sequence. However, due to the lack of target dependency modelling in the decoder, the conditional generation process heavily depends on the cross-attention. In this paper, we reveal a localness perception problem in NAT cross-attention, for which it is difficult to adequately capture source context. To alleviate this problem, we propose to enhance signals of neighbour source tokens into conventional cross-attention. Experimental results on several representative datasets show that our approach can consistently improve translation quality over strong NAT baselines. Extensive analyses demonstrate that the enhanced cross-attention achieves better exploitation of source contexts by leveraging both local and global information.

pdf bib
Does Gender Matter? Towards Fairness in Dialogue Systems
Haochen Liu | Jamell Dacon | Wenqi Fan | Hui Liu | Zitao Liu | Jiliang Tang

Recently there are increasing concerns about the fairness of Artificial Intelligence (AI) in real-world applications such as computer vision and recommendations. For example, recognition algorithms in computer vision are unfair to black people such as poorly detecting their faces and inappropriately identifying them as gorillas. As one crucial application of AI, dialogue systems have been extensively applied in our society. They are usually built with real human conversational data ; thus they could inherit some fairness issues which are held in the real world. However, the fairness of dialogue systems has not been well investigated. In this paper, we perform a pioneering study about the fairness issues in dialogue systems. In particular, we construct a benchmark dataset and propose quantitative measures to understand fairness in dialogue models. Our studies demonstrate that popular dialogue models show significant prejudice towards different genders and races. Besides, to mitigate the bias in dialogue systems, we propose two simple but effective debiasing methods. Experiments show that our methods can reduce the bias in dialogue systems significantly. The dataset and the implementation are released to foster fairness research in dialogue systems.

pdf bib
Knowledge Aware Emotion Recognition in Textual Conversations via Multi-Task Incremental Transformer
Duzhen Zhang | Xiuyi Chen | Shuang Xu | Bo Xu

Emotion recognition in textual conversations (ERTC) plays an important role in a wide range of applications, such as opinion mining, recommender systems, and so on. ERTC, however, is a challenging task. For one thing, speakers often rely on the context and commonsense knowledge to express emotions ; for another, most utterances contain neutral emotion in conversations, as a result, the confusion between a few non-neutral utterances and much more neutral ones restrains the emotion recognition performance. In this paper, we propose a novel Knowledge Aware Incremental Transformer with Multi-task Learning (KAITML) to address these challenges. Firstly, we devise a dual-level graph attention mechanism to leverage commonsense knowledge, which augments the semantic information of the utterance. Then we apply the Incremental Transformer to encode multi-turn contextual utterances. Moreover, we are the first to introduce multi-task learning to alleviate the aforementioned confusion and thus further improve the emotion recognition performance. Extensive experimental results show that our KAITML model outperforms the state-of-the-art models across five benchmark datasets.

pdf bib
Leveraging Discourse Rewards for Document-Level Neural Machine Translation
Inigo Jauregi Unanue | Nazanin Esmaili | Gholamreza Haffari | Massimo Piccardi

Document-level machine translation focuses on the translation of entire documents from a source to a target language. It is widely regarded as a challenging task since the translation of the individual sentences in the document needs to retain aspects of the discourse at document level. However, document-level translation models are usually not trained to explicitly ensure discourse quality. Therefore, in this paper we propose a training approach that explicitly optimizes two established discourse metrics, lexical cohesion and coherence, by using a reinforcement learning objective. Experiments over four different language pairs and three translation domains have shown that our training approach has been able to achieve more cohesive and coherent document translations than other competitive approaches, yet without compromising the faithfulness to the reference translation. In the case of the Zh-En language pair, our method has achieved an improvement of 2.46 percentage points (pp) in LC and 1.17 pp in COH over the runner-up, while at the same time improving 0.63 pp in BLEU score and 0.47 pp in F-BERT.

pdf bib
Effective Use of Target-side Context for Neural Machine Translation
Hideya Mino | Hitoshi Ito | Isao Goto | Ichiro Yamada | Takenobu Tokunaga

In this paper, we deal with two problems in Japanese-English machine translation of news articles. The first problem is the quality of parallel corpora. Neural machine translation (NMT) systems suffer degraded performance when trained with noisy data. Because there is no clean Japanese-English parallel data for news articles, we build a novel parallel news corpus consisting of Japanese news articles translated into English in a content-equivalent manner. This is the first content-equivalent Japanese-English news corpus translated specifically for training NMT systems. The second problem involves the domain-adaptation technique. NMT systems suffer degraded performance when trained with mixed data having different features, such as noisy data and clean data. Though the existing methods try to overcome this problem by using tags for distinguishing the differences between corpora, it is not sufficient. We thus extend a domain-adaptation method using multi-tags to train an NMT model effectively with the clean corpus and existing parallel news corpora with some types of noise. Experimental results show that our corpus increases the translation quality, and that our domain-adaptation method is more effective for learning with the multiple types of corpora than existing domain-adaptation methods are.

pdf bib
Is MAP Decoding All You Need? The Inadequacy of the Mode in Neural Machine TranslationMAP Decoding All You Need? The Inadequacy of the Mode in Neural Machine Translation
Bryan Eikema | Wilker Aziz

Recent studies have revealed a number of pathologies of neural machine translation (NMT) systems. Hypotheses explaining these mostly suggest there is something fundamentally wrong with NMT as a model or its training algorithm, maximum likelihood estimation (MLE). Most of this evidence was gathered using maximum a posteriori (MAP) decoding, a decision rule aimed at identifying the highest-scoring translation, i.e. the mode. We argue that the evidence corroborates the inadequacy of MAP decoding more than casts doubt on the model and its training algorithm. In this work, we show that translation distributions do reproduce various statistics of the data well, but that beam search strays from such statistics. We show that some of the known pathologies and biases of NMT are due to MAP decoding and not to NMT’s statistical assumptions nor MLE. In particular, we show that the most likely translations under the model accumulate so little probability mass that the mode can be considered essentially arbitrary. We therefore advocate for the use of decision rules that take into account the translation distribution holistically. We show that an approximation to minimum Bayes risk decoding gives competitive results confirming that NMT models do capture important aspects of translation well in expectation.

pdf bib
Domain Transfer based Data Augmentation for Neural Query Translation
Liang Yao | Baosong Yang | Haibo Zhang | Boxing Chen | Weihua Luo

Query translation (QT) serves as a critical factor in successful cross-lingual information retrieval (CLIR). Due to the lack of parallel query samples, neural-based QT models are usually optimized with synthetic data which are derived from large-scale monolingual queries. Nevertheless, such kind of pseudo corpus is mostly produced by a general-domain translation model, making it be insufficient to guide the learning of QT model. In this paper, we extend the data augmentation with a domain transfer procedure, thus to revise synthetic candidates to search-aware examples. Specifically, the domain transfer model is built upon advanced Transformer, in which layer coordination and mixed attention are exploited to speed up the refining process and leverage parameters from a pre-trained cross-lingual language model. In order to examine the effectiveness of the proposed method, we collected French-to-English and Spanish-to-English QT test sets, each of which consists of 10,000 parallel query pairs with careful manual-checking. Qualitative and quantitative analyses reveal that our model significantly outperforms strong baselines and the related domain transfer methods on both translation quality and retrieval accuracy.

pdf bib
Aspectuality Across Genre : A Distributional Semantics Approach
Thomas Kober | Malihe Alikhani | Matthew Stone | Mark Steedman

The interpretation of the lexical aspect of verbs in English plays a crucial role in tasks such as recognizing textual entailment and learning discourse-level inferences. We show that two elementary dimensions of aspectual class, states vs. events, and telic vs. atelic events, can be modelled effectively with distributional semantics. We find that a verb’s local context is most indicative of its aspectual class, and we demonstrate that closed class words tend to be stronger discriminating contexts than content words. Our approach outperforms previous work on three datasets. Further, we present a new dataset of human-human conversations annotated with lexical aspects and present experiments that show the correlation of telicity with genre and discourse goals.

pdf bib
Joint Persian Word Segmentation Correction and Zero-Width Non-Joiner Recognition Using BERTPersian Word Segmentation Correction and Zero-Width Non-Joiner Recognition Using BERT
Ehsan Doostmohammadi | Minoo Nassajian | Adel Rahimi

Words are properly segmented in the Persian writing system ; in practice, however, these writing rules are often neglected, resulting in single words being written disjointedly and multiple words written without any white spaces between them. This paper addresses the problems of word segmentation and zero-width non-joiner (ZWNJ) recognition in Persian, which we approach jointly as a sequence labeling problem. We achieved a macro-averaged F1-score of 92.40 % on a carefully collected corpus of 500 sentences with a high level of difficulty.

pdf bib
Syllable-based Neural Thai Word SegmentationThai Word Segmentation
Pattarawat Chormai | Ponrawee Prasertsom | Jin Cheevaprawatdomrong | Attapol Rutherford

Word segmentation is a challenging pre-processing step for Thai Natural Language Processing due to the lack of explicit word boundaries. The previous systems rely on powerful neural network architecture alone and ignore linguistic substructures of Thai words. We utilize the linguistic observation that Thai strings can be segmented into syllables, which should narrow down the search space for the word boundaries and provide helpful features. Here, we propose a neural Thai Word Segmenter that uses syllable embeddings to capture linguistic constraints and uses dilated CNN filters to capture the environment of each character. Within this goal, we develop the first ML-based Thai orthographical syllable segmenter, which yields syllable embeddings to be used as features by the word segmenter. Our word segmentation system outperforms the previous state-of-the-art system in both speed and accuracy on both in-domain and out-domain datasets.

pdf bib
Morphological disambiguation from stemming data
Antoine Nzeyimana

Morphological analysis and disambiguation is an important task and a crucial preprocessing step in natural language processing of morphologically rich languages. Kinyarwanda, a morphologically rich language, currently lacks tools for automated morphological analysis. While linguistically curated finite state tools can be easily developed for morphological analysis, the morphological richness of the language allows many ambiguous analyses to be produced, requiring effective disambiguation. In this paper, we propose learning to morphologically disambiguate Kinyarwanda verbal forms from a new stemming dataset collected through crowd-sourcing. Using feature engineering and a feed-forward neural network based classifier, we achieve about 89 % non-contextualized disambiguation accuracy. Our experiments reveal that inflectional properties of stems and morpheme association rules are the most discriminative features for disambiguation.

pdf bib
Cross-lingual Transfer Learning for Grammatical Error Correction
Ikumi Yamashita | Satoru Katsumata | Masahiro Kaneko | Aizhan Imankulova | Mamoru Komachi

In this study, we explore cross-lingual transfer learning in grammatical error correction (GEC) tasks. Many languages lack the resources required to train GEC models. Cross-lingual transfer learning from high-resource languages (the source models) is effective for training models of low-resource languages (the target models) for various tasks. However, in GEC tasks, the possibility of transferring grammatical knowledge (e.g., grammatical functions) across languages is not evident. Therefore, we investigate cross-lingual transfer learning methods for GEC. Our results demonstrate that transfer learning from other languages can improve the accuracy of GEC. We also demonstrate that proximity to source languages has a significant impact on the accuracy of correcting certain types of errors.

pdf bib
ContraCAT : Contrastive Coreference Analytical Templates for Machine TranslationContraCAT: Contrastive Coreference Analytical Templates for Machine Translation
Dario Stojanovski | Benno Krojer | Denis Peskov | Alexander Fraser

Recent high scores on pronoun translation using context-aware neural machine translation have suggested that current approaches work well. ContraPro is a notable example of a contrastive challenge set for EnglishGerman pronoun translation. The high scores achieved by transformer models may suggest that they are able to effectively model the complicated set of inferences required to carry out pronoun translation. This entails the ability to determine which entities could be referred to, identify which entity a source-language pronoun refers to (if any), and access the target-language grammatical gender for that entity. We first show through a series of targeted adversarial attacks that in fact current approaches are not able to model all of this information well. Inserting small amounts of distracting information is enough to strongly reduce scores, which should not be the case. We then create a new template test set ContraCAT, designed to individually assess the ability to handle the specific steps necessary for successful pronoun translation. Our analyses show that current approaches to context-aware NMT rely on a set of surface heuristics, which break down when translations require real reasoning. We also propose an approach for augmenting the training data, with some improvements.

pdf bib
A Human Evaluation of AMR-to-English Generation SystemsAMR-to-English Generation Systems
Emma Manning | Shira Wein | Nathan Schneider

Most current state-of-the art systems for generating English text from Abstract Meaning Representation (AMR) have been evaluated only using automated metrics, such as BLEU, which are known to be problematic for natural language generation. In this work, we present the results of a new human evaluation which collects fluency and adequacy scores, as well as categorization of error types, for several recent AMR generation systems. We discuss the relative quality of these systems and how our results compare to those of automatic metrics, finding that while the metrics are mostly successful in ranking systems overall, collecting human judgments allows for more nuanced comparisons. We also analyze common errors made by these systems.

pdf bib
Manual Clustering and Spatial Arrangement of Verbs for Multilingual Evaluation and Typology Analysis
Olga Majewska | Ivan Vulić | Diana McCarthy | Anna Korhonen

We present the first evaluation of the applicability of a spatial arrangement method (SpAM) to a typologically diverse language sample, and its potential to produce semantic evaluation resources to support multilingual NLP, with a focus on verb semantics. We demonstrate SpAM’s utility in allowing for quick bottom-up creation of large-scale evaluation datasets that balance cross-lingual alignment with language specificity. Starting from a shared sample of 825 English verbs, translated into Chinese, Japanese, Finnish, Polish, and Italian, we apply a two-phase annotation process which produces (i) semantic verb classes and (ii) fine-grained similarity scores for nearly 130 thousand verb pairs. We use the two types of verb data to (a) examine cross-lingual similarities and variation, and (b) evaluate the capacity of static and contextualised representation models to accurately reflect verb semantics, contrasting the performance of large language specific pretraining models with their multilingual equivalent on semantic clustering and lexical similarity, across different domains of verb meaning. We release the data from both phases as a large-scale multilingual resource, comprising 85 verb classes and nearly 130k pairwise similarity scores, offering a wealth of possibilities for further evaluation and research on multilingual verb semantics.

pdf bib
Measuring Correlation-to-Causation Exaggeration in Press Releases
Bei Yu | Jun Wang | Lu Guo | Yingya Li

Press releases have an increasingly strong influence on media coverage of health research ; however, they have been found to contain seriously exaggerated claims that can misinform the public and undermine public trust in science. In this study we propose an NLP approach to identify exaggerated causal claims made in health press releases that report on observational studies, which are designed to establish correlational findings, but are often exaggerated as causal. We developed a new corpus and trained models that can identify causal claims in the main statements in a press release. By comparing the claims made in a press release with the corresponding claims in the original research paper, we found that 22 % of press releases made exaggerated causal claims from correlational findings in observational studies. Furthermore, universities exaggerated more often than journal publishers by a ratio of 1.5 to 1. Encouragingly, the exaggeration rate has slightly decreased over the past 10 years, despite the increase of the total number of press releases. More research is needed to understand the cause of the decreasing pattern.

pdf bib
Inflating Topic Relevance with Ideology : A Case Study of Political Ideology Bias in Social Topic Detection Models
Meiqi Guo | Rebecca Hwa | Yu-Ru Lin | Wen-Ting Chung

We investigate the impact of political ideology biases in training data. Through a set of comparison studies, we examine the propagation of biases in several widely-used NLP models and its effect on the overall retrieval accuracy. Our work highlights the susceptibility of large, complex models to propagating the biases from human-selected input, which may lead to a deterioration of retrieval accuracy, and the importance of controlling for these biases. Finally, as a way to mitigate the bias, we propose to learn a text representation that is invariant to political ideology while still judging topic relevance.

pdf bib
Balanced Joint Adversarial Training for Robust Intent Detection and Slot Filling
Xu Cao | Deyi Xiong | Chongyang Shi | Chao Wang | Yao Meng | Changjian Hu

Joint intent detection and slot filling has recently achieved tremendous success in advancing the performance of utterance understanding. However, many joint models still suffer from the robustness problem, especially on noisy inputs or rare / unseen events. To address this issue, we propose a Joint Adversarial Training (JAT) model to improve the robustness of joint intent detection and slot filling, which consists of two parts : (1) automatically generating joint adversarial examples to attack the joint model, and (2) training the model to defend against the joint adversarial examples so as to robustify the model on small perturbations. As the generated joint adversarial examples have different impacts on the intent detection and slot filling loss, we further propose a Balanced Joint Adversarial Training (BJAT) model that applies a balance factor as a regularization term to the final loss function, which yields a stable training procedure. Extensive experiments and analyses on the lightweight models show that our proposed methods achieve significantly higher scores and substantially improve the robustness of both intent detection and slot filling. In addition, the combination of our BJAT with BERT-large achieves state-of-the-art results on two datasets.

pdf bib
Understanding Unnatural Questions Improves Reasoning over Text
Xiaoyu Guo | Yuan-Fang Li | Gholamreza Haffari

Complex question answering (CQA) over raw text is a challenging task. A prominent approach to this task is based on the programmer-interpreter framework, where the programmer maps the question into a sequence of reasoning actions and the interpreter then executes these actions on the raw text. Learning an effective CQA model requires large amounts of human-annotated data, consisting of the ground-truth sequence of reasoning actions, which is time-consuming and expensive to collect at scale. In this paper, we address the challenge of learning a high-quality programmer (parser) by projecting natural human-generated questions into unnatural machine-generated questions which are more convenient to parse. We firstly generate synthetic (question, action sequence) pairs by a data generator, and train a semantic parser that associates synthetic questions with their corresponding action sequences. To capture the diversity when applied to natural questions, we learn a projection model to map natural questions into their most similar unnatural questions for which the parser can work well. Without any natural training data, our projection model provides high-quality action sequences for the CQA task. Experimental results show that the QA model trained exclusively with synthetic data outperforms its state-of-the-art counterpart trained on human-labeled data.

pdf bib
A Mixture-of-Experts Model for Learning Multi-Facet Entity Embeddings
Rana Alshaikh | Zied Bouraoui | Shelan Jeawak | Steven Schockaert

Various methods have already been proposed for learning entity embeddings from text descriptions. Such embeddings are commonly used for inferring properties of entities, for recommendation and entity-oriented search, and for injecting background knowledge into neural architectures, among others. Entity embeddings essentially serve as a compact encoding of a similarity relation, but similarity is an inherently multi-faceted notion. By representing entities as single vectors, existing methods leave it to downstream applications to identify these different facets, and to select the most relevant ones. In this paper, we propose a model that instead learns several vectors for each entity, each of which intuitively captures a different aspect of the considered domain. We use a mixture-of-experts formulation to jointly learn these facet-specific embeddings. The individual entity embeddings are learned using a variant of the GloVe model, which has the advantage that we can easily identify which properties are modelled well in which of the learned embeddings. This is exploited by an associated gating network, which uses pre-trained word vectors to encourage the properties that are modelled by a given embedding to be semantically coherent, i.e. to encourage each of the individual embeddings to capture a meaningful facet.

pdf bib
Classifier Probes May Just Learn from Linear Context Features
Jenny Kunz | Marco Kuhlmann

Classifiers trained on auxiliary probing tasks are a popular tool to analyze the representations learned by neural sentence encoders such as BERT and ELMo. While many authors are aware of the difficulty to distinguish between extracting the linguistic structure encoded in the representations and learning the probing task, the validity of probing methods calls for further research. Using a neighboring word identity prediction task, we show that the token embeddings learned by neural sentence encoders contain a significant amount of information about the exact linear context of the token, and hypothesize that, with such information, learning standard probing tasks may be feasible even without additional linguistic structure. We develop this hypothesis into a framework in which analysis efforts can be scrutinized and argue that, with current models and baselines, conclusions that representations contain linguistic structure are not well-founded. Current probing methodology, such as restricting the classifier’s expressiveness or using strong baselines, can help to better estimate the complexity of learning, but not build a foundation for speculations about the nature of the linguistic structure encoded in the learned representations.

pdf bib
Priorless Recurrent Networks Learn Curiously
Jeff Mitchell | Jeffrey Bowers

Recently, domain-general recurrent neural networks, without explicit linguistic inductive biases, have been shown to successfully reproduce a range of human language behaviours, such as accurately predicting number agreement between nouns and verbs. We show that such networks will also learn number agreement within unnatural sentence structures, i.e. structures that are not found within any natural languages and which humans struggle to process. These results suggest that the models are learning from their input in a manner that is substantially different from human language acquisition, and we undertake an analysis of how the learned knowledge is stored in the weights of the network. We find that while the model has an effective understanding of singular versus plural for individual sentences, there is a lack of a unified concept of number agreement connecting these processes across the full range of inputs. Moreover, the weights handling natural and unnatural structures overlap substantially, in a way that underlines the non-human-like nature of the knowledge learned by the network.

pdf bib
Identifying Motion Entities in Natural Language and A Case Study for Named Entity Recognition
Ngoc Phuoc An Vo | Irene Manotas | Vadim Sheinin | Octavian Popescu

Motion recognition is one of the basic cognitive capabilities of many life forms, however, detecting and understanding motion in text is not a trivial task. In addition, identifying motion entities in natural language is not only challenging but also beneficial for a better natural language understanding. In this paper, we present a Motion Entity Tagging (MET) model to identify entities in motion in a text using the Literal-Motion-in-Text (LiMiT) dataset for training and evaluating the model. Then we propose a new method to split clauses and phrases from complex and long motion sentences to improve the performance of our MET model. We also present results showing that motion features, in particular, entity in motion benefits the Named-Entity Recognition (NER) task. Finally, we present an analysis for the special co-occurrence relation between the person category in NER and animate entities in motion, which significantly improves the classification performance for the person category in NER.

pdf bib
User Memory Reasoning for Conversational Recommendation
Hu Xu | Seungwhan Moon | Honglei Liu | Bing Liu | Pararth Shah | Bing Liu | Philip Yu

We study an end-to-end approach for conversational recommendation that dynamically manages and reasons over users’ past (offline) preferences and current (online) requests through a structured and cumulative user memory knowledge graph. This formulation extends existing state tracking beyond the boundary of a single dialog to user state tracking (UST). For this study, we create a new Memory Graph (MG)-Conversational Recommendation parallel corpus called MGConvRex with 7K+ human-to-human role-playing dialogs, grounded on a large-scale user memory bootstrapped from real-world user scenarios. MGConvRex captures human-level reasoning over user memory and has disjoint training / testing sets of users for zero-shot (cold-start) reasoning for recommendation. We propose a simple yet expandable formulation for constructing and updating the MG, and an end-to-end graph-based reasoning model that updates MG from unstructured utterances and predicts optimal dialog policies (eg recommendation) based on updated MG. The prediction of our proposed model inherits the graph structure, providing a natural way to explain policies. Experiments are conducted for both offline metrics and online simulation, showing competitive results.

pdf bib
Diverse and Non-redundant Answer Set Extraction on Community QA based on DPPsQA based on DPPs
Shogo Fujita | Tomohide Shibata | Manabu Okumura

In community-based question answering (CQA) platforms, it takes time for a user to get useful information from among many answers. Although one solution is an answer ranking method, the user still needs to read through the top-ranked answers carefully. This paper proposes a new task of selecting a diverse and non-redundant answer set rather than ranking the answers. Our method is based on determinantal point processes (DPPs), and it calculates the answer importance and similarity between answers by using BERT. We built a dataset focusing on a Japanese CQA site, and the experiments on this dataset demonstrated that the proposed method outperformed several baseline methods.

pdf bib
An empirical analysis of existing systems and datasets toward general simple question answering
Namgi Han | Goran Topic | Hiroshi Noji | Hiroya Takamura | Yusuke Miyao

In this paper, we evaluate the progress of our field toward solving simple factoid questions over a knowledge base, a practically important problem in natural language interface to database. As in other natural language understanding tasks, a common practice for this task is to train and evaluate a model on a single dataset, and recent studies suggest that SimpleQuestions, the most popular and largest dataset, is nearly solved under this setting. However, this common setting does not evaluate the robustness of the systems outside of the distribution of the used training data. We rigorously evaluate such robustness of existing systems using different datasets. Our analysis, including shifting of training and test datasets and training on a union of the datasets, suggests that our progress in solving SimpleQuestions dataset does not indicate the success of more general simple question answering. We discuss a possible future direction toward this goal.

pdf bib
Scientific Keyphrase Identification and Classification by Pre-Trained Language Models Intermediate Task Transfer Learning
Seoyeon Park | Cornelia Caragea

Scientific keyphrase identification and classification is the task of detecting and classifying keyphrases from scholarly text with their types from a set of predefined classes. This task has a wide range of benefits, but it is still challenging in performance due to the lack of large amounts of labeled data required for training deep neural models. In order to overcome this challenge, we explore pre-trained language models BERT and SciBERT with intermediate task transfer learning, using 42 data-rich related intermediate-target task combinations. We reveal that intermediate task transfer learning on SciBERT induces a better starting point for target task fine-tuning compared with BERT and achieves competitive performance in scientific keyphrase identification and classification compared to both previous works and strong baselines. Interestingly, we observe that BERT with intermediate task transfer learning fails to improve the performance of scientific keyphrase identification and classification potentially due to significant catastrophic forgetting. This result highlights that scientific knowledge achieved during the pre-training of language models on large scientific collections plays an important role in the target tasks. We also observe that sequence tagging related intermediate tasks, especially syntactic structure learning tasks such as POS Tagging, tend to work best for scientific keyphrase identification and classification.

pdf bib
Exploiting Microblog Conversation Structures to Detect Rumors
Jiawen Li | Yudianto Sujana | Hung-Yu Kao

As one of the most popular social media platforms, Twitter has become a primary source of information for many people. Unfortunately, both valid information and rumors are propagated on Twitter due to the lack of an automatic information verification system. Twitter users communicate by replying to other users’ messages, forming a conversation structure. Using this structure, users can decide whether the information in the source tweet is a rumor by reading the tweet’s replies, which voice other users’ stances on the tweet. The majority of rumor detection researchers process such tweets based on time, ignoring the conversation structure. To reap the benefits of the Twitter conversation structure, we developed a model to detect rumors by modeling conversation structure as a graph. Thus, our model’s improved representation of the conversation structure enhances its rumor detection accuracy. The experimental results on two rumor datasets show that our model outperforms several baseline models, including a state-of-the-art model

pdf bib
Words are the Window to the Soul : Language-based User Representations for Fake News Detection
Marco Del Tredici | Raquel Fernández

Cognitive and social traits of individuals are reflected in language use. Moreover, individuals who are prone to spread fake news online often share common traits. Building on these ideas, we introduce a model that creates representations of individuals on social media based only on the language they produce, and use them to detect fake news. We show that language-based user representations are beneficial for this task. We also present an extended analysis of the language of fake news spreaders, showing that its main features are mostly domain independent and consistent across two English datasets. Finally, we exploit the relation between language use and connections in the social graph to assess the presence of the Echo Chamber effect in our data.

pdf bib
Go Simple and Pre-Train on Domain-Specific Corpora : On the Role of Training Data for Text Classification
Aleksandra Edwards | Jose Camacho-Collados | Hélène De Ribaupierre | Alun Preece

Pre-trained language models provide the foundations for state-of-the-art performance across a wide range of natural language processing tasks, including text classification. However, most classification datasets assume a large amount labeled data, which is commonly not the case in practical settings. In particular, in this paper we compare the performance of a light-weight linear classifier based on word embeddings, i.e., fastText (Joulin et al., 2017), versus a pre-trained language model, i.e., BERT (Devlin et al., 2019), across a wide range of datasets and classification tasks. In general, results show the importance of domain-specific unlabeled data, both in the form of word embeddings or language models. As for the comparison, BERT outperforms all baselines in standard datasets with large training sets. However, in settings with small training datasets a simple method like fastText coupled with domain-specific word embeddings performs equally well or better than BERT, even when pre-trained on domain-specific data.

pdf bib
Exploiting Narrative Context and A Priori Knowledge of Categories in Textual Emotion Classification
Hikari Tanabe | Tetsuji Ogawa | Tetsunori Kobayashi | Yoshihiko Hayashi

Recognition of the mental state of a human character in text is a major challenge in natural language processing. In this study, we investigate the efficacy of the narrative context in recognizing the emotional states of human characters in text and discuss an approach to make use of a priori knowledge regarding the employed emotion category system. Specifically, we experimentally show that the accuracy of emotion classification is substantially increased by encoding the preceding context of the target sentence using a BERT-based text encoder. We also compare ways to incorporate a priori knowledge of emotion categories by altering the loss function used in training, in which our proposal of multi-task learning that jointly learns to classify positive / negative polarity of emotions is included. The experimental results suggest that, when using Plutchik’s Wheel of Emotions, it is better to jointly classify the basic emotion categories with positive / negative polarity rather than directly exploiting its characteristic structure in which eight basic categories are arranged in a wheel.

pdf bib
Few-Shot Text Classification with Edge-Labeling Graph Neural Network-Based Prototypical Network
Chen Lyu | Weijie Liu | Ping Wang

In this paper, we propose a new few-shot text classification method. Compared with supervised learning methods which require a large corpus of labeled documents, our method aims to make it possible to classify unlabeled text with few labeled data. To achieve this goal, we take advantage of advanced pre-trained language model to extract the semantic features of each document. Furthermore, we utilize an edge-labeling graph neural network to implicitly models the intra-cluster similarity and the inter-cluster dissimilarity of the documents. Finally, we take the results of the graph neural network as the input of a prototypical network to classify the unlabeled texts. We verify the effectiveness of our method on a sentiment analysis dataset and a relation classification dataset and achieve the state-of-the-art performance on both tasks.

pdf bib
Automatically Identifying Words That Can Serve as Labels for Few-Shot Text Classification
Timo Schick | Helmut Schmid | Hinrich Schütze

A recent approach for few-shot text classification is to convert textual inputs to cloze questions that contain some form of task description, process them with a pretrained language model and map the predicted words to labels. Manually defining this mapping between words and labels requires both domain expertise and an understanding of the language model’s abilities. To mitigate this issue, we devise an approach that automatically finds such a mapping given small amounts of training data. For a number of tasks, the mapping found by our approach performs almost as well as hand-crafted label-to-word mappings.

pdf bib
IntKB : A Verifiable Interactive Framework for Knowledge Base CompletionIntKB: A Verifiable Interactive Framework for Knowledge Base Completion
Bernhard Kratzwald | Guo Kunpeng | Stefan Feuerriegel | Dennis Diefenbach

Knowledge bases (KBs) are essential for many downstream NLP tasks, yet their prime shortcoming is that they are often incomplete. State-of-the-art frameworks for KB completion often lack sufficient accuracy to work fully automated without human supervision. As a remedy, we propose : a novel interactive framework for KB completion from text based on a question answering pipeline. Our framework is tailored to the specific needs of a human-in-the-loop paradigm : (i) We generate facts that are aligned with text snippets and are thus immediately verifiable by humans. (ii) Our system is designed such that it continuously learns during the KB completion task and, therefore, significantly improves its performance upon initial zero- and few-shot relations over time. (iii) We only trigger human interactions when there is enough information for a correct prediction. Therefore, we train our system with negative examples and a fold-option if there is no answer. Our framework yields a favorable performance : it achieves a hit@1 ratio of 29.7 % for initially unseen relations, upon which it gradually improves to 46.2 %.

pdf bib
Multimodal Sentence Summarization via Multimodal Selective Encoding
Haoran Li | Junnan Zhu | Jiajun Zhang | Xiaodong He | Chengqing Zong

This paper studies the problem of generating a summary for a given sentence-image pair. Existing multimodal sequence-to-sequence approaches mainly focus on enhancing the decoder by visual signals, while ignoring that the image can improve the ability of the encoder to identify highlights of a news event or a document. Thus, we propose a multimodal selective gate network that considers reciprocal relationships between textual and multi-level visual features, including global image descriptor, activation grids, and object proposals, to select highlights of the event when encoding the source sentence. In addition, we introduce a modality regularization to encourage the summary to capture the highlights embedded in the image more accurately. To verify the generalization of our model, we adopt the multimodal selective gate to the text-based decoder and multimodal-based decoder. Experimental results on a public multimodal sentence summarization dataset demonstrate the advantage of our models over baselines. Further analysis suggests that our proposed multimodal selective gate network can effectively select important information in the input sentence.

pdf bib
How Domain Terminology Affects Meeting Summarization Performance
Jia Jin Koay | Alexander Roustai | Xiaojin Dai | Dillon Burns | Alec Kerrigan | Fei Liu

Meetings are essential to modern organizations. Numerous meetings are held and recorded daily, more than can ever be comprehended. A meeting summarization system that identifies salient utterances from the transcripts to automatically generate meeting minutes can help. It empowers users to rapidly search and sift through large meeting collections. To date, the impact of domain terminology on the performance of meeting summarization remains understudied, despite that meetings are rich with domain knowledge. In this paper, we create gold-standard annotations for domain terminology on a sizable meeting corpus ; they are known as jargon terms. We then analyze the performance of a meeting summarization system with and without jargon terms. Our findings reveal that domain terminology can have a substantial impact on summarization performance. We publicly release all domain terminology to advance research in meeting summarization.

pdf bib
On the Faithfulness for E-commerce Product SummarizationE-commerce Product Summarization
Peng Yuan | Haoran Li | Song Xu | Youzheng Wu | Xiaodong He | Bowen Zhou

In this work, we present a model to generate e-commerce product summaries. The consistency between the generated summary and the product attributes is an essential criterion for the ecommerce product summarization task. To enhance the consistency, first, we encode the product attribute table to guide the process of summary generation. Second, we identify the attribute words from the vocabulary, and we constrain these attribute words can be presented in the summaries only through copying from the source, i.e., the attribute words not in the source can not be generated. We construct a Chinese e-commerce product summarization dataset, and the experimental results on this dataset demonstrate that our models significantly improve the faithfulness.

pdf bib
Variation in Coreference Strategies across Genres and Production Media
Berfin Aktaş | Manfred Stede

In response to (i) inconclusive results in the literature as to the properties of coreference chains in written versus spoken language, and (ii) a general lack of work on automatic coreference resolution on both spoken language and social media, we undertake a corpus study involving the various genre sections of Ontonotes, the Switchboard corpus, and a corpus of Twitter conversations. Using a set of measures that previously have been applied individually to different data sets, we find fairly clear patterns of behavior for the different genres / media. Besides their role for psycholinguistic investigation (why do we employ different coreference strategies when we write or speak) and for the placement of Twitter in the spokenwritten continuum, we see our results as a contribution to approaching genre-/media-specific coreference resolution.

pdf bib
Using Eye-tracking Data to Predict the Readability of Brazilian Portuguese Sentences in Single-task, Multi-task and Sequential Transfer Learning ApproachesBrazilian Portuguese Sentences in Single-task, Multi-task and Sequential Transfer Learning Approaches
Sidney Evaldo Leal | João Marcos Munguba Vieira | Erica dos Santos Rodrigues | Elisângela Nogueira Teixeira | Sandra Aluísio

Sentence complexity assessment is a relatively new task in Natural Language Processing. One of its aims is to highlight in a text which sentences are more complex to support the simplification of contents for a target audience (e.g., children, cognitively impaired users, non-native speakers and low-literacy readers (Scarton and Specia, 2018)). This task is evaluated using datasets of pairs of aligned sentences including the complex and simple version of the same sentence. For Brazilian Portuguese, the task was addressed by (Leal et al., 2018), who set up the first dataset to evaluate the task in this language, reaching 87.8 % of accuracy with linguistic features. The present work advances these results, using models inspired by (Gonzalez-Garduno and Sgaard, 2018), which hold the state-of-the-art for the English language, with multi-task learning and eye-tracking measures. First-Pass Duration, Total Regression Duration and Total Fixation Duration were used in two moments ; first to select a subset of linguistic features and then as an auxiliary task in the multi-task and sequential learning models. The best model proposed here reaches the new state-of-the-art for Portuguese with 97.5 % accuracy 1, an increase of almost 10 points compared to the best previous results, in addition to proposing improvements in the public dataset after analysing the errors of our best model.

pdf bib
Retrieving Skills from Job Descriptions : A Language Model Based Extreme Multi-label Classification Framework
Akshay Bhola | Kishaloy Halder | Animesh Prasad | Min-Yen Kan

We introduce a deep learning model to learn the set of enumerated job skills associated with a job description. In our analysis of a large-scale government job portal, we observe that as much as 65 % of job descriptions miss describing a significant number of relevant skills. Our model addresses this task from the perspective of an extreme multi-label classification (XMLC) problem, where descriptions are the evidence for the binary relevance of thousands of individual skills. Building upon the current state-of-the-art language modeling approaches such as BERT, we show our XMLC method improves on an existing baseline solution by over 9 % and 7 % absolute improvements in terms of recall and normalized discounted cumulative gain. We further show that our approach effectively addresses the missing skills problem, and helps in recovering relevant skills that were missed out in the job postings by taking into account the structured semantic representation of skills and their co-occurrences through a Correlation Aware Bootstrapping process. We further show that our approach, to ensure the BERT-XMLC model accounts for structured semantic representation of skills and their co-occurrences through a Correlation Aware Bootstrapping process, effectively addresses the missing skills problem, and helps in recovering relevant skills that were missed out in the job postings. To facilitate future research and replication of our work, we have made the dataset and the implementation of our model publicly available.

pdf bib
An Analysis of Dataset Overlap on Winograd-Style TasksWinograd-Style Tasks
Ali Emami | Kaheer Suleman | Adam Trischler | Jackie Chi Kit Cheung

The Winograd Schema Challenge (WSC) and variants inspired by it have become important benchmarks for common-sense reasoning (CSR). Model performance on the WSC has quickly progressed from chance-level to near-human using neural language models trained on massive corpora. In this paper, we analyze the effects of varying degrees of overlaps that occur between these corpora and the test instances in WSC-style tasks. We find that a large number of test instances overlap considerably with the pretraining corpora on which state-of-the-art models are trained, and that a significant drop in classification accuracy occurs when models are evaluated on instances with minimal overlap. Based on these results, we provide the WSC-Web dataset, consisting of over 60k pronoun disambiguation problems scraped from web data, being both the largest corpus to date, and having a significantly lower proportion of overlaps with current pretraining corpora.

pdf bib
Do n’t Patronize Me ! An Annotated Dataset with Patronizing and Condescending Language towards Vulnerable Communities
Carla Perez Almendros | Luis Espinosa Anke | Steven Schockaert

In this paper, we introduce a new annotated dataset which is aimed at supporting the development of NLP models to identify and categorize language that is patronizing or condescending towards vulnerable communities (e.g. refugees, homeless people, poor families). While the prevalence of such language in the general media has long been shown to have harmful effects, it differs from other types of harmful language, in that it is generally used unconsciously and with good intentions. We furthermore believe that the often subtle nature of patronizing and condescending language (PCL) presents an interesting technical challenge for the NLP community. Our analysis of the proposed dataset shows that identifying PCL is hard for standard NLP models, with language models such as BERT achieving the best results.

pdf bib
WikiUMLS : Aligning UMLS to Wikipedia via Cross-lingual Neural RankingWikiUMLS: Aligning UMLS to Wikipedia via Cross-lingual Neural Ranking
Afshin Rahimi | Timothy Baldwin | Karin Verspoor

We present our work on aligning the Unified Medical Language System (UMLS) to Wikipedia, to facilitate manual alignment of the two resources. We propose a cross-lingual neural reranking model to match a UMLS concept with a Wikipedia page, which achieves a recall@1of 72 %, a substantial improvement of 20 % over word- and char-level BM25, enabling manual alignment with minimal effort. We release our resources, including ranked Wikipedia pages for 700k UMLSconcepts, and WikiUMLS, a dataset for training and evaluation of alignment models between UMLS and Wikipedia collected from Wikidata. This will provide easier access to Wikipedia for health professionals, patients, and NLP systems, including in multilingual settings.

pdf bib
The Transference Architecture for Automatic Post-Editing
Santanu Pal | Hongfei Xu | Nico Herbig | Sudip Kumar Naskar | Antonio Krüger | Josef van Genabith

In automatic post-editing (APE) it makes sense to condition post-editing (pe) decisions on both the source (src) and the machine translated text (mt) as input. This has led to multi-encoder based neural APE approaches. A research challenge now is the search for architectures that best support the capture, preparation and provision of src and mt information and its integration with pe decisions. In this paper we present an efficient multi-encoder based APE model, called transference. Unlike previous approaches, it (i) uses a transformer encoder block for src, (ii) followed by a decoder block, but without masking for self-attention on mt, which effectively acts as second encoder combining src mt, and (iii) feeds this representation into a final decoder block generating pe. Our model outperforms the best performing systems by 1 BLEU point on the WMT 2016, 2017, and 2018 EnglishGerman APE shared tasks (PBSMT and NMT). Furthermore, the results of our model on the WMT 2019 APE task using NMT data shows a comparable performance to the state-of-the-art system. The inference time of our model is similar to the vanilla transformer-based NMT system although our model deals with two separate encoders. We further investigate the importance of our newly introduced second encoder and find that a too small amount of layers does hurt the performance, while reducing the number of layers of the decoder does not matter much.

pdf bib
A Simple and Effective Approach to Robust Unsupervised Bilingual Dictionary Induction
Yanyang Li | Yingfeng Luo | Ye Lin | Quan Du | Huizhen Wang | Shujian Huang | Tong Xiao | Jingbo Zhu

Unsupervised Bilingual Dictionary Induction methods based on the initialization and the self-learning have achieved great success in similar language pairs, e.g., English-Spanish. But they still fail and have an accuracy of 0 % in many distant language pairs, e.g., English-Japanese. In this work, we show that this failure results from the gap between the actual initialization performance and the minimum initialization performance for the self-learning to succeed. We propose Iterative Dimension Reduction to bridge this gap. Our experiments show that this simple method does not hamper the performance of similar language pairs and achieves an accuracy of 13.64 55.53 % between English and four distant languages, i.e., Chinese, Japanese, Vietnamese and Thai.

pdf bib
Data Selection for Bilingual Lexicon Induction from Specialized Comparable Corpora
Martin Laville | Amir Hazem | Emmanuel Morin | Phillippe Langlais

Narrow specialized comparable corpora are often small in size. This particularity makes it difficult to build efficient models to acquire translation equivalents, especially for less frequent and rare words. One way to overcome this issue is to enrich the specialized corpora with out-of-domain resources. Although some recent studies have shown improvements using data augmentation, the enrichment method was roughly conducted by adding out-of-domain data with no particular attention given to how to enrich words and how to do it optimally. In this paper, we contrast several data selection techniques to improve bilingual lexicon induction from specialized comparable corpora. We first apply two well-established data selection techniques often used in machine translation that is : Tf-Idf and cross entropy. Then, we propose to exploit BERT for data selection. Overall, all the proposed techniques improve the quality of the extracted bilingual lexicons by a large margin. The best performing model is the cross entropy, obtaining a gain of about 4 points in MAP while decreasing computation time by a factor of 10.

pdf bib
A Locally Linear Procedure for Word Translation
Soham Dan | Hagai Taitelbaum | Jacob Goldberger

Learning a mapping between word embeddings of two languages given a dictionary is an important problem with several applications. A common mapping approach is using an orthogonal matrix. The Orthogonal Procrustes Analysis (PA) algorithm can be applied to find the optimal orthogonal matrix. This solution restricts the expressiveness of the translation model which may result in sub-optimal translations. We propose a natural extension of the PA algorithm that uses multiple orthogonal translation matrices to model the mapping and derive an algorithm to learn these multiple matrices. We achieve better performance in a bilingual word translation task and a cross-lingual word similarity task compared to the single matrix baseline. We also show how multiple matrices can model multiple senses of a word.

pdf bib
The SADID Evaluation Datasets for Low-Resource Spoken Language Machine Translation of Arabic DialectsSADID Evaluation Datasets for Low-Resource Spoken Language Machine Translation of Arabic Dialects
Wael Abid

Low-resource Machine Translation recently gained a lot of popularity, and for certain languages, it has made great strides. However, it is still difficult to track progress in other languages for which there is no publicly available evaluation data. In this paper, we introduce benchmark datasets for Arabic and its dialects. We describe our design process and motivations and analyze the datasets to understand their resulting properties. Numerous successful attempts use large monolingual corpora to augment low-resource pairs. We try to approach augmentation differently and investigate whether it is possible to improve MT models without any external sources of data. We accomplish this by bootstrapping existing parallel sentences and complement this with multilingual training to achieve strong baselines.

pdf bib
Understanding Translationese in Multi-view Embedding Spaces
Koel Dutta Chowdhury | Cristina España-Bonet | Josef van Genabith

Recent studies use a combination of lexical and syntactic features to show that footprints of the source language remain visible in translations, to the extent that it is possible to predict the original source language from the translation. In this paper, we focus on embedding-based semantic spaces, exploiting departures from isomorphism between spaces built from original target language and translations into this target language to predict relations between languages in an unsupervised way. We use different views of the data words, parts of speech, semantic tags and synsets to track translationese. Our analysis shows that (i) semantic distances between original target language and translations into this target language can be detected using the notion of isomorphism, (ii) language family ties with characteristics similar to linguistically motivated phylogenetic trees can be inferred from the distances and (iii) with delexicalised embeddings exhibiting source-language interference most significantly, other levels of abstraction display the same tendency, indicating the lexicalised results to be not just due to possible topic differences between original and translated texts. To the best of our knowledge, this is the first time departures from isomorphism between embedding spaces are used to track translationese.

pdf bib
Building The First English-Brazilian Portuguese Corpus for Automatic Post-EditingEnglish-Brazilian Portuguese Corpus for Automatic Post-Editing
Felipe Almeida Costa | Thiago Castro Ferreira | Adriana Pagano | Wagner Meira

This paper introduces the first corpus for Automatic Post-Editing of English and a low-resource language, Brazilian Portuguese. The source English texts were extracted from the WebNLG corpus and automatically translated into Portuguese using a state-of-the-art industrial neural machine translator. Post-edits were then obtained in an experiment with native speakers of Brazilian Portuguese. To assess the quality of the corpus, we performed error analysis and computed complexity indicators measuring how difficult the APE task would be. We report preliminary results of Phrase-Based and Neural Machine Translation Models on this new corpus. Data and code publicly available in our repository.

pdf bib
Analysing cross-lingual transfer in lemmatisation for Indian languagesIndian languages
Kumar Saurav | Kumar Saunack | Pushpak Bhattacharyya

Lemmatization aims to reduce the sparse data problem by relating the inflected forms of a word to its dictionary form. However, most of the prior work on this topic has focused on high resource languages. In this paper, we evaluate cross-lingual approaches for low resource languages, especially in the context of morphologically rich Indian languages. We test our model on six languages from two different families and develop linguistic insights into each model’s performance.

pdf bib
Neural Automated Essay Scoring Incorporating Handcrafted Features
Masaki Uto | Yikuan Xie | Maomi Ueno

Automated essay scoring (AES) is the task of automatically assigning scores to essays as an alternative to grading by human raters. Conventional AES typically relies on handcrafted features, whereas recent studies have proposed AES models based on deep neural networks (DNNs) to obviate the need for feature engineering. Furthermore, hybrid methods that integrate handcrafted features in a DNN-AES model have been recently developed and have achieved state-of-the-art accuracy. One of the most popular hybrid methods is formulated as a DNN-AES model with an additional recurrent neural network (RNN) that processes a sequence of handcrafted sentence-level features. However, this method has the following problems : 1) It can not incorporate effective essay-level features developed in previous AES research. 2) It greatly increases the numbers of model parameters and tuning parameters, increasing the difficulty of model training. 3) It has an additional RNN to process sentence-level features, enabling extension to various DNN-AES models complex. To resolve these problems, we propose a new hybrid method that integrates handcrafted essay-level features into a DNN-AES model. Specifically, our method concatenates handcrafted essay-level features to a distributed essay representation vector, which is obtained from an intermediate layer of a DNN-AES model. Our method is a simple DNN-AES extension, but significantly improves scoring accuracy.

pdf bib
A Straightforward Approach to Narratologically Grounded Character Identification
Labiba Jahan | Rahul Mittal | W. Victor Yarlott | Mark Finlayson

One of the most fundamental elements of narrative is character : if we are to understand a narrative, we must be able to identify the characters of that narrative. Therefore, character identification is a critical task in narrative natural language understanding. Most prior work has lacked a narratologically grounded definition of character, instead relying on simplified or implicit definitions that do not capture essential distinctions between characters and other referents in narratives. In prior work we proposed a preliminary definition of character that was based in clear narratological principles : a character is an animate entity that is important to the plot. Here we flesh out this concept, demonstrate that it can be reliably annotated (0.78 Cohen’s), and provide annotations of 170 narrative texts, drawn from 3 different corpora, containing 1,347 character co-reference chains and 21,999 non-character chains that include 3,937 animate chains. Furthermore, we have shown that a supervised classifier using a simple set of easily computable features can effectively identify these characters (overall F1 of 0.90). A detailed error analysis shows that character identification is first and foremost affected by co-reference quality, and further, that the shorter a chain is the harder it is to effectively identify as a character. We release our code and data for the benefit of other researchers

pdf bib
Fine-grained Information Status Classification Using Discourse Context-Aware BERTBERT
Yufang Hou

Previous work on bridging anaphora recognition (Hou et al., 2013) casts the problem as a subtask of learning fine-grained information status (IS). However, these systems heavily depend on many hand-crafted linguistic features. In this paper, we propose a simple discourse context-aware BERT model for fine-grained IS classification. On the ISNotes corpus (Markert et al., 2012), our model achieves new state-of-the-art performances on fine-grained IS classification, obtaining a 4.8 absolute overall accuracy improvement compared to Hou et al. More importantly, we also show an improvement of 10.5 F1 points for bridging anaphora recognition without using any complex hand-crafted semantic features designed for capturing the bridging phenomenon. We further analyze the trained model and find that the most attended signals for each IS category correspond well to linguistic notions of information status.

pdf bib
Text Classification by Contrastive Learning and Cross-lingual Data Augmentation for Alzheimer’s Disease DetectionAlzheimer’s Disease Detection
Zhiqiang Guo | Zhaoci Liu | Zhenhua Ling | Shijin Wang | Lingjing Jin | Yunxia Li

Data scarcity is always a constraint on analyzing speech transcriptions for automatic Alzheimer’s disease (AD) detection, especially when the subjects are non-English speakers. To deal with this issue, this paper first proposes a contrastive learning method to obtain effective representations for text classification based on monolingual embeddings of BERT. Furthermore, a cross-lingual data augmentation method is designed by building autoencoders to learn the text representations shared by both languages. Experiments on a Mandarin AD corpus show that the contrastive learning method can achieve better detection accuracy than conventional CNN-based and BERTbased methods. Our cross-lingual data augmentation method also outperforms other compared methods when using another English AD corpus for augmentation. Finally, a best detection accuracy of 81.6 % is obtained by our proposed methods on the Mandarin AD corpus.

pdf bib
Hierarchical Text Segmentation for Medieval Manuscripts
Amir Hazem | Beatrice Daille | Dominique Stutzmann | Christopher Kermorvant | Louis Chevalier

In this paper, we address the segmentation of books of hours, Latin devotional manuscripts of the late Middle Ages, that exhibit challenging issues : a complex hierarchical entangled structure, variable content, noisy transcriptions with no sentence markers, and strong correlations between sections for which topical information is no longer sufficient to draw segmentation boundaries. We show that the main state-of-the-art segmentation methods are either inefficient or inapplicable for books of hours and propose a bottom-up greedy approach that considerably enhances the segmentation results. We stress the importance of such hierarchical segmentation of books of hours for historians to explore their overarching differences underlying conception about Church.

pdf bib
Are We Ready for this Disaster? Towards Location Mention Recognition from Crisis Tweets
Reem Suwaileh | Muhammad Imran | Tamer Elsayed | Hassan Sajjad

The widespread usage of Twitter during emergencies has provided a new opportunity and timely resource to crisis responders for various disaster management tasks. Geolocation information of pertinent tweets is crucial for gaining situational awareness and delivering aid. However, the majority of tweets do not come with geoinformation. In this work, we focus on the task of location mention recognition from crisis-related tweets. Specifically, we investigate the influence of different types of labeled training data on the performance of a BERT-based classification model. We explore several training settings such as combing in- and out-domain data from news articles and general-purpose and crisis-related tweets. Furthermore, we investigate the effect of geospatial proximity while training on near or far-away events from the target event. Using five different datasets, our extensive experiments provide answers to several critical research questions that are useful for the research community to foster research in this important direction. For example, results show that, for training a location mention recognition model, Twitter-based data is preferred over general-purpose data ; and crisis-related data is preferred over general-purpose Twitter data. Furthermore, training on data from geographically-nearby disaster events to the target event boosts the performance compared to training on distant events.

pdf bib
Regularized Attentive Capsule Network for Overlapped Relation Extraction
Tianyi Liu | Xiangyu Lin | Weijia Jia | Mingliang Zhou | Wei Zhao

Distantly supervised relation extraction has been widely applied in knowledge base construction due to its less requirement of human efforts. However, the automatically established training datasets in distant supervision contain low-quality instances with noisy words and overlapped relations, introducing great challenges to the accurate extraction of relations. To address this problem, we propose a novel Regularized Attentive Capsule Network (RA-CapNet) to better identify highly overlapped relations in each informal sentence. To discover multiple relation features in an instance, we embed multi-head attention into the capsule network as the low-level capsules, where the subtraction of two entities acts as a new form of relation query to select salient features regardless of their positions. To further discriminate overlapped relation features, we devise disagreement regularization to explicitly encourage the diversity among both multiple attention heads and low-level capsules. Extensive experiments conducted on widely used datasets show that our model achieves significant improvements in relation extraction.

pdf bib
Graph Convolution over Multiple Dependency Sub-graphs for Relation Extraction
Angrosh Mandya | Danushka Bollegala | Frans Coenen

We propose a contextualised graph convolution network over multiple dependency-based sub-graphs for relation extraction. A novel method to construct multiple sub-graphs using words in shortest dependency path and words linked to entities in the dependency parse is proposed. Graph convolution operation is performed over the resulting multiple sub-graphs to obtain more informative features useful for relation extraction. Our experimental results show that the proposed method achieves superior performance over the existing GCN-based models achieving state-of-the-art performance on cross-sentence n-ary relation extraction dataset and SemEval 2010 Task 8 sentence-level relation extraction dataset. Our model also achieves a comparable performance to the SoTA on the TACRED dataset.

pdf bib
NYTWIT : A Dataset of Novel Words in the New York TimesNYTWIT: A Dataset of Novel Words in the New York Times
Yuval Pinter | Cassandra L. Jacobs | Max Bittker

We present the New York Times Word Innovation Types dataset, or NYTWIT, a collection of over 2,500 novel English words published in the New York Times between November 2017 and March 2019, manually annotated for their class of novelty (such as lexical derivation, dialectal variation, blending, or compounding). We present baseline results for both uncontextual and contextual prediction of novelty class, showing that there is room for improvement even for state-of-the-art NLP systems. We hope this resource will prove useful for linguists and NLP practitioners by providing a real-world environment of novel word appearance.

pdf bib
XED : A Multilingual Dataset for Sentiment Analysis and Emotion DetectionXED: A Multilingual Dataset for Sentiment Analysis and Emotion Detection
Emily Öhman | Marc Pàmies | Kaisla Kajava | Jörg Tiedemann

We introduce XED, a multilingual fine-grained emotion dataset. The dataset consists of human-annotated Finnish (25k) and English sentences (30k), as well as projected annotations for 30 additional languages, providing new resources for many low-resource languages. We use Plutchik’s core emotions to annotate the dataset with the addition of neutral to create a multilabel multiclass dataset. The dataset is carefully evaluated using language-specific BERT models and SVMs to show that XED performs on par with other similar datasets and is therefore a useful tool for sentiment analysis and emotion detection.

pdf bib
Human or Neural Translation?
Shivendra Bhardwaj | David Alfonso Hermelo | Phillippe Langlais | Gabriel Bernier-Colborne | Cyril Goutte | Michel Simard

Deep neural models tremendously improved machine translation. In this context, we investigate whether distinguishing machine from human translations is still feasible. We trained and applied 18 classifiers under two settings : a monolingual task, in which the classifier only looks at the translation ; and a bilingual task, in which the source text is also taken into consideration. We report on extensive experiments involving 4 neural MT systems (Google Translate, DeepL, as well as two systems we trained) and varying the domain of texts. We show that the bilingual task is the easiest one and that transfer-based deep-learning classifiers perform best, with mean accuracies around 85 % in-domain and 75 % out-of-domain.

pdf bib
Constructing A Multi-hop QA Dataset for Comprehensive Evaluation of Reasoning StepsQA Dataset for Comprehensive Evaluation of Reasoning Steps
Xanh Ho | Anh-Khoa Duong Nguyen | Saku Sugawara | Akiko Aizawa

A multi-hop question answering (QA) dataset aims to test reasoning and inference skills by requiring a model to read multiple paragraphs to answer a given question. However, current datasets do not provide a complete explanation for the reasoning process from the question to the answer. Further, previous studies revealed that many examples in existing multi-hop datasets do not require multi-hop reasoning to answer a question. In this study, we present a new multi-hop QA dataset, called 2WikiMultiHopQA, which uses structured and unstructured data. In our dataset, we introduce the evidence information containing a reasoning path for multi-hop questions. The evidence information has two benefits : (i) providing a comprehensive explanation for predictions and (ii) evaluating the reasoning skills of a model. We carefully design a pipeline and a set of templates when generating a question-answer pair that guarantees the multi-hop steps and the quality of the questions. We also exploit the structured format in Wikidata and use logical rules to create questions that are natural but still require multi-hop reasoning. Through experiments, we demonstrate that our dataset is challenging for multi-hop models and it ensures that multi-hop reasoning is required.

pdf bib
Exploring the Language of Data
Gábor Bella | Linda Gremes | Fausto Giunchiglia

We set out to uncover the unique grammatical properties of an important yet so far under-researched type of natural language text : that of short labels typically found within structured datasets. We show that such labels obey a specific type of abbreviated grammar that we call the Language of Data, with properties significantly different from the kinds of text typically addressed in computational linguistics and NLP, such as ‘standard’ written language or social media messages. We analyse orthography, parts of speech, and syntax over a large, bilingual, hand-annotated corpus of data labels collected from a variety of domains. We perform experiments on tokenisation, part-of-speech tagging, and named entity recognition over real-world structured data, demonstrating that models adapted to the Language of Data outperform those trained on standard text. These observations point in a new direction to be explored as future research, in order to develop new NLP tools and models dedicated to the Language of Data.

pdf bib
Creation of Corpus and analysis in Code-Mixed Kannada-English Twitter data for Emotion PredictionKannada-English Twitter data for Emotion Prediction
Abhinav Reddy Appidi | Vamshi Krishna Srirangam | Darsi Suhas | Manish Shrivastava

Emotion prediction is a critical task in the field of Natural Language Processing (NLP). There has been a significant amount of work done in emotion prediction for resource-rich languages. There has been work done on code-mixed social media corpus but not on emotion prediction of Kannada-English code-mixed Twitter data. In this paper, we analyze the problem of emotion prediction on corpus obtained from code-mixed Kannada-English extracted from Twitter annotated with their respective ‘Emotion’ for each tweet. We experimented with machine learning prediction models using features like Character N-Grams, Word N-Grams, Repetitive characters, and others on SVM and LSTM on our corpus, which resulted in an accuracy of 30 % and 32 % respectively.

pdf bib
Fair Evaluation in Concept Normalization : a Large-scale Comparative Analysis for BERT-based ModelsBERT-based Models
Elena Tutubalina | Artur Kadurin | Zulfat Miftahutdinov

Linking of biomedical entity mentions to various terminologies of chemicals, diseases, genes, adverse drug reactions is a challenging task, often requiring non-syntactic interpretation. A large number of biomedical corpora and state-of-the-art models have been introduced in the past five years. However, there are no general guidelines regarding the evaluation of models on these corpora in single- and cross-terminology settings. In this work, we perform a comparative evaluation of various benchmarks and study the efficiency of state-of-the-art neural architectures based on Bidirectional Encoder Representations from Transformers (BERT) for linking of three entity types across three domains : research abstracts, drug labels, and user-generated texts on drug therapy in English. We have made the source code and results available at

pdf bib
Multilingual Neural RST Discourse ParsingRST Discourse Parsing
Zhengyuan Liu | Ke Shi | Nancy Chen

Text discourse parsing plays an important role in understanding information flow and argumentative structure in natural language. Previous research under the Rhetorical Structure Theory (RST) has mostly focused on inducing and evaluating models from the English treebank. However, the parsing tasks for other languages such as German, Dutch, and Portuguese are still challenging due to the shortage of annotated data. In this work, we investigate two approaches to establish a neural, cross-lingual discourse parser via : (1) utilizing multilingual vector representations ; and (2) adopting segment-level translation of the source content. Experiment results show that both methods are effective even with limited training data, and achieve state-of-the-art performance on cross-lingual, document-level discourse parsing on all sub-tasks.

pdf bib
Tree Representations in Transition System for RST ParsingRST Parsing
Jinfen Li | Lu Xiao

The transition-based systems in the past studies propose a series of actions, to build a right-heavy binarized tree for the RST parsing. However, the nodes of the binary-nuclear relations (e.g., Contrast) have the same nuclear type with those of the multi-nuclear relations (e.g., Joint) in the binary tree structure. In addition, the reduce action only construct binary trees instead of multi-branch trees, which is the original RST tree structure. In our paper, we design a new nuclear type for the multi-nuclear relations, and a new action to construct a multi-branch tree. We enrich the feature set by extracting additional refined dependency feature of texts from the Bi-Affine model. We also compare the performance of two approaches for RST parsing in the transition-based system : a joint action of reduce-shift and nuclear type (i.e., Reduce-SN) vs a separate one that applies Reduce action first and then assigns nuclear type. We find that the new devised nuclear type and action are more capable of capturing the multi-nuclear relation and the joint action is more suitable than the separate one. Our multi-branch tree structure obtains the state-of-the-art performance for all the 18 coarse relations.

pdf bib
Resource Constrained Dialog Policy Learning Via Differentiable Inductive Logic Programming
Zhenpeng Zhou | Ahmad Beirami | Paul Crook | Pararth Shah | Rajen Subba | Alborz Geramifard

Motivated by the needs of resource constrained dialog policy learning, we introduce dialog policy via differentiable inductive logic (DILOG). We explore the tasks of one-shot learning and zero-shot domain transfer with DILOG on SimDial and MultiWoZ. Using a single representative dialog from the restaurant domain, we train DILOG on the SimDial dataset and obtain 99+% in-domain test accuracy. We also show that the trained DILOG zero-shot transfers to all other domains with 99+% accuracy, proving the suitability of DILOG to slot-filling dialogs. We further extend our study to the MultiWoZ dataset achieving 90+% inform and success metrics. We also observe that these metrics are not capturing some of the shortcomings of DILOG in terms of false positives, prompting us to measure an auxiliary Action F1 score. We show that DILOG is 100x more data efficient than state-of-the-art neural approaches on MultiWoZ while achieving similar performance metrics. We conclude with a discussion on the strengths and weaknesses of DILOG.

pdf bib
German’s Next Language ModelGerman’s Next Language Model
Branden Chan | Stefan Schweter | Timo Möller

In this work we present the experiments which lead to the creation of our BERT and ELECTRA based German language models, GBERT and GELECTRA. By varying the input training data, model size, and the presence of Whole Word Masking (WWM) we were able to attain SoTA performance across a set of document classification and named entity recognition (NER) tasks for both models of base and large size. We adopt an evaluation driven approach in training these models and our results indicate that both adding more data and utilizing WWM improve model performance. By benchmarking against existing German models, we show that these models are the best German models to date. All trained models will be made publicly available to the research community.

pdf bib
Do n’t Invite BERT to Drink a Bottle : Modeling the Interpretation of Metonymies Using BERT and Distributional RepresentationsBERT to Drink a Bottle: Modeling the Interpretation of Metonymies Using BERT and Distributional Representations
Paolo Pedinotti | Alessandro Lenci

In this work, we carry out two experiments in order to assess the ability of BERT to capture the meaning shift associated with metonymic expressions. We test the model on a new dataset that is representative of the most common types of metonymy. We compare BERT with the Structured Distributional Model (SDM), a model for the representation of words in context which is based on the notion of Generalized Event Knowledge. The results reveal that, while BERT ability to deal with metonymy is quite limited, SDM is good at predicting the meaning of metonymic expressions, providing support for an account of metonymy based on event knowledge.

pdf bib
Interpretable Multi-headed Attention for Abstractive Summarization at Controllable Lengths
Ritesh Sarkhel | Moniba Keymanesh | Arnab Nandi | Srinivasan Parthasarathy

Abstractive summarization at controllable lengths is a challenging task in natural language processing. It is even more challenging for domains where limited training data is available or scenarios in which the length of the summary is not known beforehand. At the same time, when it comes to trusting machine-generated summaries, explaining how a summary was constructed in human-understandable terms may be critical. We propose Multi-level Summarizer (MLS), a supervised method to construct abstractive summaries of a text document at controllable lengths. The key enabler of our method is an interpretable multi-headed attention mechanism that computes attention distribution over an input document using an array of timestep independent semantic kernels. Each kernel optimizes a human-interpretable syntactic or semantic property. Exhaustive experiments on two low-resource datasets in English show that MLS outperforms strong baselines by up to 14.70 % in the METEOR score. Human evaluation of the summaries also suggests that they capture the key concepts of the document at various length-budgets.

pdf bib
CharacterBERT : Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From CharactersCharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary Representations From Characters
Hicham El Boukkouri | Olivier Ferret | Thomas Lavergne | Hiroshi Noji | Pierre Zweigenbaum | Jun’ichi Tsujii

Due to the compelling improvements brought by BERT, many recent representation models adopted the Transformer architecture as their main building block, consequently inheriting the wordpiece tokenization system despite it not being intrinsically linked to the notion of Transformers. While this system is thought to achieve a good balance between the flexibility of characters and the efficiency of full words, using predefined wordpiece vocabularies from the general domain is not always suitable, especially when building models for specialized domains (e.g., the medical domain). Moreover, adopting a wordpiece tokenization shifts the focus from the word level to the subword level, making the models conceptually more complex and arguably less convenient in practice. For these reasons, we propose CharacterBERT, a new variant of BERT that drops the wordpiece system altogether and uses a Character-CNN module instead to represent entire words by consulting their characters. We show that this new model improves the performance of BERT on a variety of medical domain tasks while at the same time producing robust, word-level, and open-vocabulary representations.

pdf bib
Autoregressive Reasoning over Chains of Facts with Transformers
Ruben Cartuyvels | Graham Spinks | Marie-Francine Moens

This paper proposes an iterative inference algorithm for multi-hop explanation regeneration, that retrieves relevant factual evidence in the form of text snippets, given a natural language question and its answer. Combining multiple sources of evidence or facts for multi-hop reasoning becomes increasingly hard when the number of sources needed to make an inference grows. Our algorithm copes with this by decomposing the selection of facts from a corpus autoregressively, conditioning the next iteration on previously selected facts. This allows us to use a pairwise learning-to-rank loss. We validate our method on datasets of the TextGraphs 2019 and 2020 Shared Tasks for explanation regeneration. Existing work on this task either evaluates facts in isolation or artificially limits the possible chains of facts, thus limiting multi-hop inference. We demonstrate that our algorithm, when used with a pre-trained transformer model, outperforms the previous state-of-the-art in terms of precision, training time and inference efficiency.

pdf bib
Augmenting NLP models using Latent Feature InterpolationsNLP models using Latent Feature Interpolations
Amit Jindal | Arijit Ghosh Chowdhury | Aniket Didolkar | Di Jin | Ramit Sawhney | Rajiv Ratn Shah

Models with a large number of parameters are prone to over-fitting and often fail to capture the underlying input distribution. We introduce Emix, a data augmentation method that uses interpolations of word embeddings and hidden layer representations to construct virtual examples. We show that Emix shows significant improvements over previously used interpolation based regularizers and data augmentation techniques. We also demonstrate how our proposed method is more robust to sparsification. We highlight the merits of our proposed methodology by performing thorough quantitative and qualitative assessments.


bib (full) Proceedings of the 28th International Conference on Computational Linguistics: System Demonstrations

pdf bib
Proceedings of the 28th International Conference on Computational Linguistics: System Demonstrations
Michal Ptaszynski | Bartosz Ziolko

pdf bib
Fast Word Predictor for On-Device Application
Huy Tien Nguyen | Khoi Tuan Nguyen | Anh Tuan Nguyen | Thanh Lac Thi Tran

Learning on large text corpora, deep neural networks achieve promising results in the next word prediction task. However, deploying these huge models on devices has to deal with constraints of low latency and a small binary size. To address these challenges, we propose a fast word predictor performing efficiently on mobile devices. Compared with a standard neural network which has a similar word prediction rate, the proposed model obtains 60 % reduction in memory size and 100X faster inference time on a middle-end mobile device. The method is developed as a feature for a chat application which serves more than 100 million users.

pdf bib
Discussion Tracker : Supporting Teacher Learning about Students’ Collaborative Argumentation in High School Classrooms
Luca Lugini | Christopher Olshefski | Ravneet Singh | Diane Litman | Amanda Godley

Teaching collaborative argumentation is an advanced skill that many K-12 teachers struggle to develop. To address this, we have developed Discussion Tracker, a classroom discussion analytics system based on novel algorithms for classifying argument moves, specificity, and collaboration. Results from a classroom deployment indicate that teachers found the analytics useful, and that the underlying classifiers perform with moderate to substantial agreement with humans.

pdf bib
An Online Readability Leveled Arabic ThesaurusArabic Thesaurus
Zhengyang Jiang | Nizar Habash | Muhamed Al Khalil

This demo paper introduces the online Readability Leveled Arabic Thesaurus interface. For a given user input word, this interface provides the word’s possible lemmas, roots, English glosses, related Arabic words and phrases, and readability on a five-level readability scale. This interface builds on and connects multiple existing Arabic resources and processing tools. This one-of-a-kind system enables Arabic speakers and learners to benefit from advances in Arabic computational linguistics technologies. Feedback from users of the system will help the developers to identify lexical coverage gaps and errors. A live link to the demo is available at :

pdf bib
TrainX Named Entity Linking with Active Sampling and Bi-EncodersTrainX – Named Entity Linking with Active Sampling and Bi-Encoders
Tom Oberhauser | Tim Bischoff | Karl Brendel | Maluna Menke | Tobias Klatt | Amy Siu | Felix Alexander Gers | Alexander Löser

We demonstrate TrainX, a system for Named Entity Linking for medical experts. It combines state-of-the-art entity recognition and linking architectures, such as Flair and fine-tuned Bi-Encoders based on BERT, with an easy-to-use interface for healthcare professionals. We support medical experts in annotating training data by using active sampling strategies to forward informative samples to the annotator. We demonstrate that our model is capable of linking against large knowledge bases, such as UMLS (3.6 million entities), and supporting zero-shot cases, where the linker has never seen the entity before. Those zero-shot capabilities help to mitigate the problem of rare and expensive training data that is a common issue in the medical domain.

pdf bib
Epistolary Education in 21st Century : A System to Support Composition of E-mails by Students to Superiors in JapaneseE-mails by Students to Superiors in Japanese
Kenji Ryu | Michal Ptaszynski

E-mail is a communication tool widely used by people of all ages on the Internet today, often in business and formal situations, especially in Japan. Moreover, Japanese E-mail communication has a set of specific rules taught using specialized guidebooks. E-mail literacy education for many Japanese students is typically provided in a traditional, yet inefficient lecture-based way. We propose a system to support Japanese students in writing E-mails to superiors (teachers, job hunting representatives, etc.). We firstly make an investigation into the importance of formal E-mails in Japan, and what is needed to successfully write a formal E-mail. Next, we develop the system with accordance to those rules. Finally, we evaluated the system twofold. The results, although performed on a small number of samples, were generally positive, and clearly indicated additional ways to improve the system.


bib (full) Proceedings of the 28th International Conference on Computational Linguistics: Industry Track

pdf bib
Proceedings of the 28th International Conference on Computational Linguistics: Industry Track
Ann Clifton | Courtney Napoles

pdf bib
Query Distillation : BERT-based Distillation for Ensemble RankingBERT-based Distillation for Ensemble Ranking
Wangshu Zhang | Junhong Liu | Zujie Wen | Yafang Wang | Gerard de Melo

Recent years have witnessed substantial progress in the development of neural ranking networks, but also an increasingly heavy computational burden due to growing numbers of parameters and the adoption of model ensembles. Knowledge Distillation (KD) is a common solution to balance the effectiveness and efficiency. However, it is not straightforward to apply KD to ranking problems. Ranking Distillation (RD) has been proposed to address this issue, but only shows effectiveness on recommendation tasks. We present a novel two-stage distillation method for ranking problems that allows a smaller student model to be trained while benefitting from the better performance of the teacher model, providing better control of the inference latency and computational burden. We design a novel BERT-based ranking model structure for list-wise ranking to serve as our student model. All ranking candidates are fed to the BERT model simultaneously, such that the self-attention mechanism can enable joint inference to rank the document list. Our experiments confirm the advantages of our method, not just with regard to the inference latency but also in terms of higher-quality rankings compared to the original teacher model.

pdf bib
Interactive Question Clarification in Dialogue via Reinforcement Learning
Xiang Hu | Zujie Wen | Yafang Wang | Xiaolong Li | Gerard de Melo

Coping with ambiguous questions has been a perennial problem in real-world dialogue systems. Although clarification by asking questions is a common form of human interaction, it is hard to define appropriate questions to elicit more specific intents from a user. In this work, we propose a reinforcement model to clarify ambiguous questions by suggesting refinements of the original query. We first formulate a collection partitioning problem to select a set of labels enabling us to distinguish potential unambiguous intents. We list the chosen labels as intent phrases to the user for further confirmation. The selected label along with the original user query then serves as a refined query, for which a suitable response can more easily be identified. The model is trained using reinforcement learning with a deep policy network. We evaluate our model based on real-world user clicks and demonstrate significant improvements across several different experiments.

pdf bib
Towards building a Robust Industry-scale Question Answering System
Rishav Chakravarti | Anthony Ferritto | Bhavani Iyer | Lin Pan | Radu Florian | Salim Roukos | Avi Sil

Industry-scale NLP systems necessitate two features. Robustness : zero-shot transfer learning (ZSTL) performance has to be commendable and 2. Efficiency : systems have to train efficiently and respond instantaneously. In this paper, we introduce the development of a production model called GAAMA (Go Ahead Ask Me Anything) which possess the above two characteristics. For robustness, it trains on the recently introduced Natural Questions (NQ) dataset. NQ poses additional challenges over older datasets like SQuAD : (a) QA systems need to read and comprehend an entire Wikipedia article rather than a small passage, and (b) NQ does not suffer from observation bias during construction, resulting in less lexical overlap between the question and the article. GAAMA consists of Attention-over-Attention, diversity among attention heads, hierarchical transfer learning, and synthetic data augmentation while being computationally inexpensive. Building on top of the powerful BERTQA model, GAAMA provides a 2.0 % absolute boost in F1 over the industry-scale state-of-the-art (SOTA) system on NQ. Further, we show that GAAMA transfers zero-shot to unseen real life and important domains as it yields respectable performance on two benchmarks : the BioASQ and the newly introduced CovidQA datasets.

pdf bib
Learning Domain Terms-Empirical Methods to Enhance Enterprise Text Analytics Performance
Gargi Roy | Lipika Dey | Mohammad Shakir | Tirthankar Dasgupta

Performance of standard text analytics algorithms are known to be substantially degraded on consumer generated data, which are often very noisy. These algorithms also do not work well on enterprise data which has a very different nature from News repositories, storybooks or Wikipedia data. Text cleaning is a mandatory step which aims at noise removal and correction to improve performance. However, enterprise data need special cleaning methods since it contains many domain terms which appear to be noise against a standard dictionary, but in reality are not so. In this work we present detailed analysis of characteristics of enterprise data and suggest unsupervised methods for cleaning these repositories after domain terms have been automatically segregated from true noise terms. Noise terms are thereafter corrected in a contextual fashion. The effectiveness of the method is established through careful manual evaluation of error corrections over several standard data sets, including those available for hate speech detection, where there is deliberate distortion to avoid detection. We also share results to show enhancement in classification accuracy after noise correction.

pdf bib
ScopeIt : Scoping Task Relevant Sentences in DocumentsScopeIt: Scoping Task Relevant Sentences in Documents
Barun Patra | Vishwas Suryanarayanan | Chala Fufa | Pamela Bhattacharya | Charles Lee

A prominent problem faced by conversational agents working with large documents (Eg : email-based assistants) is the frequent presence of information in the document that is irrelevant to the assistant. This in turn makes it harder for the agent to accurately detect intents, extract entities relevant to those intents and perform the desired action. To address this issue we present a neural model for scoping relevant information for the agent from a large document. We show that when used as the first step in a popularly used email-based assistant for helping users schedule meetings, our proposed model helps improve the performance of the intent detection and entity extraction tasks required by the agent for correctly scheduling meetings : across a suite of 6 downstream tasks, by using our proposed method, we observe an average gain of 35 % in precision without any drop in recall. Additionally, we demonstrate that the same approach can be used for component level analysis in large documents, such as signature block identification.


bib (full) Proceedings of the 7th Workshop on Argument Mining

pdf bib
Proceedings of the 7th Workshop on Argument Mining
Elena Cabrio | Serena Villata

pdf bib
DebateSum : A large-scale argument mining and summarization datasetDebateSum: A large-scale argument mining and summarization dataset
Allen Roush | Arvind Balaji

Prior work in Argument Mining frequently alludes to its potential applications in automatic debating systems. Despite this focus, almost no datasets or models exist which apply natural language processing techniques to problems found within competitive formal debate. To remedy this, we present the DebateSum dataset. DebateSum consists of 187,386 unique pieces of evidence with corresponding argument and extractive summaries. DebateSum was made using data compiled by competitors within the National Speech and Debate Association over a 7year period. We train several transformer summarization models to benchmark summarization performance on DebateSum. We also introduce a set of fasttext word-vectors trained on DebateSum called debate2vec. Finally, we present a search engine for this dataset which is utilized extensively by members of the National Speech and Debate Association today. The DebateSum search engine is available to the public here :

pdf bib
Annotating Topics, Stance, Argumentativeness and Claims in Dutch Social Media Comments : A Pilot StudyDutch Social Media Comments: A Pilot Study
Nina Bauwelinck | Els Lefever

One of the major challenges currently facing the field of argumentation mining is the lack of consensus on how to analyse argumentative user-generated texts such as online comments. The theoretical motivations underlying the annotation guidelines used to generate labelled corpora rarely include motivation for the use of a particular theoretical basis. This pilot study reports on the annotation of a corpus of 100 Dutch user comments made in response to politically-themed news articles on Facebook. The annotation covers topic and aspect labelling, stance labelling, argumentativeness detection and claim identification. Our IAA study reports substantial agreement scores for argumentativeness detection (0.76 Fleiss’ kappa) and moderate agreement for claim labelling (0.45 Fleiss’ kappa). We provide a clear justification of the theories and definitions underlying the design of our guidelines. Our analysis of the annotations signal the importance of adjusting our guidelines to include allowances for missing context information and defining the concept of argumentativeness in connection with stance. Our annotated corpus and associated guidelines are made publicly available.

pdf bib
Aspect-Based Argument Mining
Dietrich Trautmann

Computational Argumentation in general and Argument Mining in particular are important research fields. In previous works, many of the challenges to automatically extract and to some degree reason over natural language arguments were addressed. The tools to extract argument units are increasingly available and further open problems can be addressed. In this work, we are presenting the task of Aspect-Based Argument Mining (ABAM), with the essential subtasks of Aspect Term Extraction (ATE) and Nested Segmentation (NS). At the first instance, we create and release an annotated corpus with aspect information on the token-level. We consider aspects as the main point(s) argument units are addressing. This information is important for further downstream tasks such as argument ranking, argument summarization and generation, as well as the search for counter-arguments on the aspect-level. We present several experiments using state-of-the-art supervised architectures and demonstrate their performance for both of the subtasks. The annotated benchmark is available at

pdf bib
Annotation and Detection of Arguments in Tweets
Robin Schaefer | Manfred Stede

Notwithstanding the increasing role Twitter plays in modern political and social discourse, resources built for conducting argument mining on tweets remain limited. In this paper, we present a new corpus of German tweets annotated for argument components. To the best of our knowledge, this is the first corpus containing not only annotated full tweets but also argumentative spans within tweets. We further report first promising results using supervised classification (F1 : 0.82) and sequence labeling (F1 : 0.72) approaches.

pdf bib
ECHR : Legal Corpus for Argument MiningECHR: Legal Corpus for Argument Mining
Prakash Poudyal | Jaromir Savelka | Aagje Ieven | Marie Francine Moens | Teresa Goncalves | Paulo Quaresma

In this paper, we publicly release an annotated corpus of 42 decisions of the European Court of Human Rights (ECHR). The corpus is annotated in terms of three types of clauses useful in argument mining : premise, conclusion, and non-argument parts of the text. Furthermore, relationships among the premises and conclusions are mapped. We present baselines for three tasks that lead from unstructured texts to structured arguments. The tasks are argument clause recognition, clause relation prediction, and premise / conclusion recognition. Despite a straightforward application of the bidirectional encoders from Transformers (BERT), we obtained very promising results F1 0.765 on argument recognition, 0.511 on relation prediction, and 0.859/0.628 on premise / conclusion recognition). The results suggest the usefulness of pre-trained language models based on deep neural network architectures in argument mining. Because of the simplicity of the baselines, there is ample space for improvement in future work based on the released corpus.

pdf bib
Annotating argumentation in Swedish social mediaSwedish social media
Anna Lindahl

This paper presents a small study of annotating argumentation in Swedish social media. Annotators were asked to annotate spans of argumentation in 9 threads from two discussion forums. At the post level, Cohen’s k and Krippendorff’s alpha 0.48 was achieved. When manually inspecting the annotations the annotators seemed to agree when conditions in the guidelines were explicitly met, but implicit argumentation and opinions, resulting in annotators having to interpret what’s missing in the text, caused disagreements.


bib (full) Proceedings of the Workshop on the Cognitive Aspects of the Lexicon

pdf bib
Proceedings of the Workshop on the Cognitive Aspects of the Lexicon
Michael Zock | Emmanuele Chersoni | Alessandro Lenci | Enrico Santus

pdf bib
Individual corpora predict fast memory retrieval during reading
Markus J. Hofmann | Lara Müller | Andre Rölke | Ralph Radach | Chris Biemann

The corpus, from which a predictive language model is trained, can be considered the experience of a semantic system. We recorded everyday reading of two participants for two months on a tablet, generating individual corpus samples of 300/500 K tokens. Then we trained word2vec models from individual corpora and a 70 million-sentence newspaper corpus to obtain individual and norm-based long-term memory structure. To test whether individual corpora can make better predictions for a cognitive task of long-term memory retrieval, we generated stimulus materials consisting of 134 sentences with uncorrelated individual and norm-based word probabilities. For the subsequent eye tracking study 1-2 months later, our regression analyses revealed that individual, but not norm-corpus-based word probabilities can account for first-fixation duration and first-pass gaze duration. Word length additionally affected gaze duration and total viewing duration. The results suggest that corpora representative for an individual’s long-term memory structure can better explain reading performance than a norm corpus, and that recently acquired information is lexically accessed rapidly.

pdf bib
Less is Better : A cognitively inspired unsupervised model for language segmentation
Jinbiao Yang | Stefan L. Frank | Antal van den Bosch

Language users process utterances by segmenting them into many cognitive units, which vary in their sizes and linguistic levels. Although we can do such unitization / segmentation easily, its cognitive mechanism is still not clear. This paper proposes an unsupervised model, Less-is-Better (LiB), to simulate the human cognitive process with respect to language unitization / segmentation. LiB follows the principle of least effort and aims to build a lexicon which minimizes the number of unit tokens (alleviating the effort of analysis) and number of unit types (alleviating the effort of storage) at the same time on any given corpus. LiB’s workflow is inspired by empirical cognitive phenomena. The design makes the mechanism of LiB cognitively plausible and the computational requirement light-weight. The lexicon generated by LiB performs the best among different types of lexicons (e.g. ground-truth words) both from an information-theoretical view and a cognitive view, which suggests that the LiB lexicon may be a plausible proxy of the mental lexicon.

pdf bib
The CogALex Shared Task on Monolingual and Multilingual Identification of Semantic RelationsCogALex Shared Task on Monolingual and Multilingual Identification of Semantic Relations
Rong Xiang | Emmanuele Chersoni | Luca Iacoponi | Enrico Santus

The shared task of the CogALex-VI workshop focuses on the monolingual and multilingual identification of semantic relations. We provided training and validation data for the following languages : English, German and Chinese. Given a word pair, systems had to be trained to identify which relation holds between them, with possible choices being synonymy, antonymy, hypernymy and no relation at all. Two test sets were released for evaluating the participating systems. One containing pairs for each of the training languages (systems were evaluated in a monolingual fashion) and the other proposing a surprise language to test the crosslingual transfer capabilities of the systems. Among the submitted systems, top performance was achieved by a transformer-based model in both the monolingual and in the multilingual setting, for all the tested languages, proving the potentials of this recently-introduced neural architecture. The shared task description and the results are available at

pdf bib
CogALex-VI Shared Task : Transrelation-A Robust Multilingual Language Model for Multilingual Relation IdentificationCogALex-VI Shared Task: Transrelation - A Robust Multilingual Language Model for Multilingual Relation Identification
Lennart Wachowiak | Christian Lang | Barbara Heinisch | Dagmar Gromann

We describe our submission to the CogALex-VI shared task on the identification of multilingual paradigmatic relations building on XLM-RoBERTa (XLM-R), a robustly optimized and multilingual BERT model. In spite of several experiments with data augmentation, data addition and ensemble methods with a Siamese Triple Net, Translrelation, the XLM-R model with a linear classifier adapted to this specific task, performed best in testing and achieved the best results in the final evaluation of the shared task, even for a previously unseen language.

pdf bib
Translating Collocations : The Need for Task-driven Word Associations
Oi Yee Kwong

Existing dictionaries may help collocation translation by suggesting associated words in the form of collocations, thesaurus, and example sentences. We propose to enhance them with task-driven word associations, illustrating the need by a few scenarios and outlining a possible approach based on word embedding. An example is given, using pre-trained word embedding, while more extensive investigation with more refined methods and resources is underway.

pdf bib
Characterizing Dynamic Word Meaning Representations in the Brain
Nora Aguirre-Celis | Risto Miikkulainen

During sentence comprehension, humans adjust word meanings according to the combination of the concepts that occur in the sentence. This paper presents a neural network model called CEREBRA (Context-dEpendent meaning REpresentation in the BRAin) that demonstrates this process based on fMRI sentence patterns and the Concept Attribute Rep-resentation (CAR) theory. In several experiments, CEREBRA is used to quantify conceptual combination effect and demonstrate that it matters to humans. Such context-based representations could be used in future natural language processing systems allowing them to mirror human performance more accurately.

pdf bib
Contextualized Word Embeddings Encode Aspects of Human-Like Word Sense Knowledge
Sathvik Nair | Mahesh Srinivasan | Stephan Meylan

Understanding context-dependent variation in word meanings is a key aspect of human language comprehension supported by the lexicon. Lexicographic resources (e.g., WordNet) capture only some of this context-dependent variation ; for example, they often do not encode how closely senses, or discretized word meanings, are related to one another. Our work investigates whether recent advances in NLP, specifically contextualized word embeddings, capture human-like distinctions between English word senses, such as polysemy and homonymy. We collect data from a behavioral, web-based experiment, in which participants provide judgments of the relatedness of multiple WordNet senses of a word in a two-dimensional spatial arrangement task. We find that participants’ judgments of the relatedness between senses are correlated with distances between senses in the BERT embedding space. Specifically, homonymous senses (e.g., bat as mammal vs. bat as sports equipment) are reliably more distant from one another in the embedding space than polysemous ones (e.g., chicken as animal vs. chicken as meat). Our findings point towards the potential utility of continuous-space representations of sense meanings.


bib (full) Proceedings of the Third Workshop on Computational Models of Reference, Anaphora and Coreference

pdf bib
Proceedings of the Third Workshop on Computational Models of Reference, Anaphora and Coreference
Maciej Ogrodniczuk | Vincent Ng | Yulia Grishina | Sameer Pradhan

pdf bib
It’s absolutely divine ! Can fine-grained sentiment analysis benefit from coreference resolution?
Orphee De Clercq | Veronique Hoste

While it has been claimed that anaphora or coreference resolution plays an important role in opinion mining, it is not clear to what extent coreference resolution actually boosts performance, if at all. In this paper, we investigate the potential added value of coreference resolution for the aspect-based sentiment analysis of restaurant reviews in two languages, English and Dutch. We focus on the task of aspect category classification and investigate whether including coreference information prior to classification to resolve implicit aspect mentions is beneficial. Because coreference resolution is not a solved task in NLP, we rely on both automatically-derived and gold-standard coreference relations, allowing us to investigate the true upper bound. By training a classifier on a combination of lexical and semantic features, we show that resolving the coreferential relations prior to classification is beneficial in a joint optimization setup. However, this is only the case when relying on gold-standard relations and the result is more outspoken for English than for Dutch. When validating the optimal models, however, we found that only the Dutch pipeline is able to achieve a satisfying performance on a held-out test set and does so regardless of whether coreference information was included.

pdf bib
Anaphoric Zero Pronoun Identification : A Multilingual Approach
Abdulrahman Aloraini | Massimo Poesio

Pro-drop languages such as Arabic, Chinese, Italian or Japanese allow morphologically null but referential arguments in certain syntactic positions, called anaphoric zero-pronouns. Much NLP work on anaphoric zero-pronouns (AZP) is based on gold mentions, but models for their identification are a fundamental prerequisite for their resolution in real-life applications. Such identification requires complex language understanding and knowledge of real-world entities. Transfer learning models, such as BERT, have recently shown to learn surface, syntactic, and semantic information, which can be very useful in recognizing AZPs. We propose a BERT-based multilingual model for AZP identification from predicted zero pronoun positions, and evaluate it on the Arabic and Chinese portions of OntoNotes 5.0. As far as we know, this is the first neural network model of AZP identification for Arabic ; and our approach outperforms the stateof-the-art for Chinese. Experiment results suggest that BERT implicitly encode information about AZPs through their surrounding context.

pdf bib
Predicting Coreference in Abstract Meaning RepresentationsAbstract Meaning Representations
Tatiana Anikina | Alexander Koller | Michael Roth

This work addresses coreference resolution in Abstract Meaning Representation (AMR) graphs, a popular formalism for semantic parsing. We evaluate several current coreference resolution techniques on a recently published AMR coreference corpus, establishing baselines for future work. We also demonstrate that coreference resolution can improve the accuracy of a state-of-the-art semantic parser on this corpus.

pdf bib
TwiConv : A Coreference-annotated Corpus of Twitter ConversationsTwiConv: A Coreference-annotated Corpus of Twitter Conversations
Berfin Aktaş | Annalena Kohnert

This article introduces TwiConv, an English coreference-annotated corpus of microblog conversations from Twitter. We describe the corpus compilation process and the annotation scheme, and release the corpus publicly, along with this paper. We manually annotated nominal coreference in 1756 tweets arranged in 185 conversation threads. The annotation achieves satisfactory annotation agreement results. We also present a new method for mapping the tweet contents with distributed stand-off annotations, which can easily be adapted to different annotation tasks.

pdf bib
Reference to Discourse Topics : Introducing Global Shell Nouns
Fabian Simonjetz

Shell nouns (SNs) are abstract nouns like fact, issue, and decision, which are capable of refer- ring to non-nominal antecedents, much like anaphoric pronouns. As an extension of classical anaphora resolution, the automatic detection of SNs alongside their respective antecedents has received a growing research interest in recent years but proved to be a challenging task. This paper critically examines the assumption prevalent in previous research that SNs are typically accompanied by a specific antecedent, arguing that SNs like issue and decision are frequently used to refer, not to specific antecedents, but to global discourse topics, in which case they are out of reach of previously proposed resolution strategies that are tailored to SNs with explicit antecedents. The contribution of this work is three-fold. First, the notion of global SNs is defined ; second, their qualitative and quantitative impact on previous SN research is investigated ; and third, implications for previous and future approaches to SN resolution are discussed.

pdf bib
Partially-supervised Mention Detection
Lesly Miculicich | James Henderson

Learning to detect entity mentions without using syntactic information can be useful for integration and joint optimization with other tasks. However, it is common to have partially annotated data for this problem. Here, we investigate two approaches to deal with partial annotation of mentions : weighted loss and soft-target classification. We also propose two neural mention detection approaches : a sequence tagging, and an exhaustive search. We evaluate our methods with coreference resolution as a downstream task, using multitask learning. The results show that the recall and F1 score improve for all methods.

pdf bib
Enhanced Labelling in Active Learning for Coreference Resolution
Vebjørn Espeland | Beatrice Alex | Benjamin Bach

In this paper we describe our attempt to increase the amount of information that can be retrieved through active learning sessions compared to previous approaches. We optimise the annotator’s labelling process using active learning in the context of coreference resolution. Using simulated active learning experiments, we suggest three adjustments to ensure the labelling time is spent as efficiently as possible. All three adjustments provide more information to the machine learner than the baseline, though a large impact on the F1 score over time is not observed. Compared to previous models, we report a marginal F1 improvement on the final coreference models trained using for two out of the three approaches tested when applied to the English OntoNotes 2012 Coreference Resolution data. Our best-performing model achieves 58.01 F1, an increase of 0.93 F1 over the baseline model.

pdf bib
Reference in Team Communication for Robot-Assisted Disaster Response : An Initial Analysis
Natalia Skachkova | Ivana Kruijff-Korbayova

We analyze reference phenomena in a corpus of robot-assisted disaster response team communication. The annotation scheme we designed for this purpose distinguishes different types of entities, roles, reference units and relations. We focus particularly on mission-relevant objects, locations and actors and also annotate a rich set of reference links, including co-reference and various other kinds of relations. We explain the categories used in our annotation, present their distribution in the corpus and discuss challenging cases.

pdf bib
Resolving Pronouns in Twitter Streams : Context can Help !Twitter Streams: Context can Help!
Anietie Andy | Chris Callison-Burch | Derry Tanti Wijaya

Many people live-tweet televised events like Presidential debates and popular TV-shows and discuss people or characters in the event. Naturally, many tweets make pronominal reference to these people / characters. We propose an algorithm for resolving personal pronouns that make reference to people involved in an event, in tweet streams collected during the event.

pdf bib
Coreference Strategies in English-German TranslationEnglish-German Translation
Ekaterina Lapshinova-Koltunski | Marie-Pauline Krielke | Christian Hardmeier

We present a study focusing on variation of coreferential devices in English original TED talks and news texts and their German translations. Using exploratory techniques we contemplate a diverse set of coreference devices as features which we assume indicate language-specific and register-based variation as well as potential translation strategies. Our findings reflect differences on both dimensions with stronger variation along the lines of register than between languages. By exposing interactions between text type and cross-linguistic variation, they can also inform multilingual NLP applications, especially machine translation.


bib (full) Proceedings of the Second International Workshop on Designing Meaning Representations

pdf bib
Proceedings of the Second International Workshop on Designing Meaning Representations
Nianwen Xue | Johan Bos | William Croft | Jan Hajič | Chu-Ren Huang | Stephan Oepen | Martha Palmer | James Pustejovsky

pdf bib
Cross-lingual annotation : a road map for low- and no-resource languages
Meagan Vigus | Jens E. L. Van Gysel | Tim O’Gorman | Andrew Cowell | Rosa Vallejos | William Croft

This paper presents a road map for the annotation of semantic categories in typologically diverse languages, with potentially few linguistic resources, and often no existing computational resources. Past semantic annotation efforts have focused largely on high-resource languages, or relatively low-resource languages with a large number of native speakers. However, there are certain typological traits, namely the synthesis of multiple concepts into a single word, that are more common in languages with a smaller speech community. For example, what is expressed as a sentence in a more analytic language like English, may be expressed as a single word in a more synthetic language like Arapaho. This paper proposes solutions for annotating analytic and synthetic languages in a comparable way based on existing typological research, and introduces a road map for the annotation of languages with a dearth of resources.

pdf bib
K-SNACS : Annotating Korean Adposition SemanticsSNACS: Annotating Korean Adposition Semantics
Jena D. Hwang | Hanwool Choe | Na-Rae Han | Nathan Schneider

While many languages use adpositions to encode semantic relationships between content words in a sentence (e.g., agentivity or temporality), the details of how adpositions work vary widely across languages with respect to both form and meaning. In this paper, we empirically adapt the SNACS framework (Schneider et al., 2018) to Korean, a language that is typologically distant from Englishthe language SNACS was based on. We apply the SNACS framework to annotate the highly popular novellaThe Little Prince with semantic supersense labels over allKorean postpositions. Thus, we introduce the first broad-coverage corpus annotated with Korean postposition semantics and provide a detailed analysis of the corpus with an apples-to-apples comparison between Korean and English annotations


bib (full) Proceedings of Workshop on Natural Language Processing in E-Commerce

pdf bib
Proceedings of Workshop on Natural Language Processing in E-Commerce
Huasha Zhao | Parikshit Sondhi | Nguyen Bach | Sanjika Hewavitharana | Yifan He | Luo Si | Heng Ji

pdf bib
BERT-based similarity learning for product matchingBERT-based similarity learning for product matching
Janusz Tracz | Piotr Iwo Wójcik | Kalina Jasinska-Kobus | Riccardo Belluzzo | Robert Mroczkowski | Ireneusz Gawlik

Product matching, i.e., being able to infer the product being sold for a merchant-created offer, is crucial for any e-commerce marketplace, enabling product-based navigation, price comparisons, product reviews, etc. This problem proves a challenging task, mostly due to the extent of product catalog, data heterogeneity, missing product representants, and varying levels of data quality. Moreover, new products are being introduced every day, making it difficult to cast the problem as a classification task. In this work, we apply BERT-based models in a similarity learning setup to solve the product matching problem. We provide a thorough ablation study, showing the impact of architecture and training objective choices. Application of transformer-based architectures and proper sampling techniques significantly boosts performance for a range of e-commerce domains, allowing for production deployment.


bib (full) Proceedings of the 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation

pdf bib
Proceedings of the 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation
Dr Mahmoud El-Haj | Dr Vasiliki Athanasakou | Dr Sira Ferradans | Dr Catherine Salzedo | Dr Ans Elhag | Dr Houda Bouamor | Dr Marina Litvak | Dr Paul Rayson | Dr George Giannakopoulos | Nikiforos Pittaras

pdf bib
The Financial Document Causality Detection Shared Task (FinCausal 2020)FinCausal 2020)
Dominique Mariko | Hanna Abi-Akl | Estelle Labidurie | Stephane Durfort | Hugues De Mazancourt | Mahmoud El-Haj

We present the FinCausal 2020 Shared Task on Causality Detection in Financial Documents and the associated FinCausal dataset, and discuss the participating systems and results. Two sub-tasks are proposed : a binary classification task (Task 1) and a relation extraction task (Task 2). A total of 16 teams submitted runs across the two Tasks and 13 of them contributed with a system description paper. This workshop is associated to the Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation (FNP-FNS 2020), held at The 28th International Conference on Computational Linguistics (COLING’2020), Barcelona, Spain on September 12, 2020.

pdf bib
JDD @ FinCausal 2020, Task 2 : Financial Document Causality DetectionJDD @ FinCausal 2020, Task 2: Financial Document Causality Detection
Toshiya Imoto | Tomoki Ito

This paper describes the approach we built for the Financial Document Causality Detection Shared Task (FinCausal-2020) Task 2 : Cause and Effect Detection. Our approach is based on a multi-class classifier using BiLSTM with Graph Convolutional Neural Network (GCN) trained by minimizing the binary cross entropy loss. In our approach, we have not used any extra data source apart from combining the trial and practice dataset. We achieve weighted F1 score to 75.61 percent and are ranked at 7-th place.

pdf bib
NITK NLP at FinCausal-2020 Task 1 Using BERT and Linear models.NITK NLP at FinCausal-2020 Task 1 Using BERT and Linear models.
Hariharan R L | Anand Kumar M

FinCausal-2020 is the shared task which focuses on the causality detection of factual data for financial analysis. The financial data facts do n’t provide much explanation on the variability of these data. This paper aims to propose an efficient method to classify the data into one which is having any financial cause or not. Many models were used to classify the data, out of which SVM model gave an F-Score of 0.9435, BERT with specific fine-tuning achieved best results with F-Score of 0.9677.

pdf bib
Fraunhofer IAIS at FinCausal 2020, Tasks 1 & 2 : Using Ensemble Methods and Sequence Tagging to Detect Causality in Financial DocumentsIAIS at FinCausal 2020, Tasks 1 & 2: Using Ensemble Methods and Sequence Tagging to Detect Causality in Financial Documents
Maren Pielka | Rajkumar Ramamurthy | Anna Ladi | Eduardo Brito | Clayton Chapman | Paul Mayer | Rafet Sifa

The FinCausal 2020 shared task aims to detect causality on financial news and identify those parts of the causal sentences related to the underlying cause and effect. We apply ensemble-based and sequence tagging methods for identifying causality, and extracting causal subsequences. Our models yield promising results on both sub-tasks, with the prospect of further improvement given more time and computing resources. With respect to task 1, we achieved an F1 score of 0.9429 on the evaluation data, and a corresponding ranking of 12/14. For task 2, we were ranked 6/10, with an F1 score of 0.76 and an ExactMatch score of 0.1912.

pdf bib
NTUNLPL at FinCausal 2020, Task 2 : Improving Causality Detection Using Viterbi DecoderNTUNLPL at FinCausal 2020, Task 2:Improving Causality Detection Using Viterbi Decoder
Pei-Wei Kao | Chung-Chi Chen | Hen-Hsen Huang | Hsin-Hsi Chen

In order to provide an explanation of machine learning models, causality detection attracts lots of attention in the artificial intelligence research community. In this paper, we explore the cause-effect detection in financial news and propose an approach, which combines the BIO scheme with the Viterbi decoder for addressing this challenge. Our approach is ranked the first in the official run of cause-effect detection (Task 2) of the FinCausal-2020 shared task. We not only report the implementation details and ablation analysis in this paper, but also publish our code for academic usage.

pdf bib
FiNLP at FinCausal 2020 Task 1 : Mixture of BERTs for Causal Sentence Identification in Financial TextsFiNLP at FinCausal 2020 Task 1: Mixture of BERTs for Causal Sentence Identification in Financial Texts
Sarthak Gupta

This paper describes our system developed for the sub-task 1 of the FinCausal shared task in the FNP-FNS workshop held in conjunction with COLING-2020. The system classifies whether a financial news text segment contains causality or not. To address this task, we fine-tune and ensemble the generic and domain-specific BERT language models pre-trained on financial text corpora. The task data is highly imbalanced with the majority non-causal class ; therefore, we train the models using strategies such as under-sampling, cost-sensitive learning, and data augmentation. Our best system achieves a weighted F1-score of 96.98 securing 4th position on the evaluation leaderboard. The code is available at

pdf bib
Domino at FinCausal 2020, Task 1 and 2 : Causal Extraction SystemFinCausal 2020, Task 1 and 2: Causal Extraction System
Sharanya Chakravarthy | Tushar Kanakagiri | Karthik Radhakrishnan | Anjana Umapathy

Automatic identification of cause-effect relationships from data is a challenging but important problem in artificial intelligence. Identifying semantic relationships has become increasingly important for multiple downstream applications like Question Answering, Information Retrieval and Event Prediction. In this work, we tackle the problem of causal relationship extraction from financial news using the FinCausal 2020 dataset. We tackle two tasks-1) Detecting the presence of causal relationships and 2) Extracting segments corresponding to cause and effect from news snippets. We propose Transformer based sequence and token classification models with post-processing rules which achieve an F1 score of 96.12 and 79.60 on Tasks 1 and 2 respectively.

pdf bib
IITkgp at FinCausal 2020, Shared Task 1 : Causality Detection using Sentence Embeddings in Financial ReportsIITkgp at FinCausal 2020, Shared Task 1: Causality Detection using Sentence Embeddings in Financial Reports
Arka Mitra | Harshvardhan Srivastava | Yugam Tiwari

The paper describes the work that the team submitted to FinCausal 2020 Shared Task. This work is associated with the first sub-task of identifying causality in sentences. The various models used in the experiments tried to obtain a latent space representation for each of the sentences. Linear regression was performed on these representations to classify whether the sentence is causal or not. The experiments have shown BERT (Large) performed the best, giving a F1 score of 0.958, in the task of detecting the causality of sentences in financial texts and reports. The class imbalance was dealt with a modified loss function to give a better metric score for the evaluation.

pdf bib
Extractive Financial Narrative Summarisation based on DPPsDPPs
Lei Li | Yafei Jiang | Yinan Liu

We participate in the FNS-Summarisation 2020 shared task to be held at FNP 2020 workshop at COLING 2020. Based on Determinantal Point Processes (DPPs), we build an extractive automatic financial summarisation system for the specific task. In this system, we first analyze the long report data to select the important narrative parts and generate an intermediate document. Next, we build the kernel Matrix L for the intermediate document, which represents the quality of its sentences. On the basis of L, we then can use the DPPs sampling algorithm to choose those sentences with high quality and diversity as the final summary sentences.

pdf bib
End-to-end Training For Financial Report Summarization
Moreno La Quatra | Luca Cagliero

Quoted companies are requested to periodically publish financial reports in textual form. The annual financial reports typically include detailed financial and business information, thus giving relevant insights into company outlooks. However, a manual exploration of these financial reports could be very time consuming since most of the available information can be deemed as non-informative or redundant by expert readers. Hence, an increasing research interest has been devoted to automatically extracting domain-specific summaries, which include only the most relevant information. This paper describes the SumTO system architecture, which addresses the Shared Task of the Financial Narrative Summarisation (FNS) 2020 contest. The main task objective is to automatically extract the most informative, domain-specific textual content from financial, English-written documents. The aim is to create a summary of each company report covering all the business-relevant key points. To address the above-mentioned goal, we propose an end-to-end training method relying on Deep NLP techniques. The idea behind the system is to exploit the syntactic overlap between input sentences and ground-truth summaries to fine-tune pre-trained BERT embedding models, thus making such models tailored to the specific context. The achieved results confirm the effectiveness of the proposed method, especially when the goal is to select relatively long text snippets.

pdf bib
AMEX AI-Labs : An Investigative Study on Extractive Summarization of Financial DocumentsAMEX AI-Labs: An Investigative Study on Extractive Summarization of Financial Documents
Piyush Arora | Priya Radhakrishnan

We describe the work carried out by AMEX AI-LABS on an extractive summarization benchmark task focused on Financial Narratives Summarization (FNS). This task focuses on summarizing annual financial reports which poses two main challenges as compared to typical news document summarization tasks : i) annual reports are more lengthier (average length about 80 pages) as compared to typical news documents, and ii) annual reports are more loosely structured e.g. comprising of tables, charts, textual data and images, which makes it challenging to effectively summarize. To address this summarization task we investigate a range of unsupervised, supervised and ensemble based techniques. We find that ensemble based techniques perform relatively better as compared to using only the unsupervised and supervised based techniques. Our ensemble based model achieved the highest rank of 9 out of 31 systems submitted for the benchmark task based on Rouge-L evaluation metric.

pdf bib : Multilingual Document Structure Extraction using Transfer LearningFinTOC-2020: Multilingual Document Structure Extraction using Transfer Learning
Frederic Haase | Steffen Kirchhoff

In this paper we describe our system submitted to the FinTOC-2020 shared task on financial doc- ument structure extraction. We propose a two-step approach to identify titles in financial docu- ments and to extract their table of contents (TOC). First, we identify text blocks as candidates for titles using unsupervised learning based on character-level information of each document. Then, we apply supervised learning on a self-constructed regression task to predict the depth of each text block in the document structure hierarchy using transfer learning combined with document features and layout features. It is noteworthy that our single multilingual model performs well on both tasks and on different languages, which indicates the usefulness of transfer learning for title detection and TOC generation. Moreover, our approach is independent of the presence of actual TOC pages in the documents. It is also one of the few submissions to the FinTOC-2020 shared task addressing both subtasks in both languages, English and French, with one single model.

pdf bib
A Computational Analysis of Financial and Environmental Narratives within Financial Reports and its Value for Investors
Felix Armbrust | Henry Schäfer | Roman Klinger

Public companies are obliged to include financial and non-financial information within their cor- porate filings under Regulation S-K, in the United States (SEC, 2010). However, the requirements still allow for manager’s discretion. This raises the question to which extent the information is actually included and if this information is at all relevant for investors. We answer this question by training and evaluating an end-to-end deep learning approach (based on BERT and GloVe embeddings) to predict the financial and environmental performance of the company from the Management’s Discussion and Analysis of Financial Conditions and Results of Operations (MD&A) section of 10-K (yearly) and 10-Q (quarterly) filings. We further analyse the mediating effect of the environmental performance on the relationship between the company’s disclosures and financial performance. Hereby, we address the results of previous studies regarding environ- mental performance. We find that the textual information contained within the MD&A section does not allow for conclusions about the future (corporate) financial performance. However, there is evidence that the environmental performance can be extracted by natural language processing methods.

pdf bib
Mitigating Silence in Compliance Terminology during Parsing of Utterances
Esme Manandise | Conrad de Peuter

This paper reports on an approach to increase multi-token-term recall in a parsing task. We use a compliance-domain parser to extract, during the process of parsing raw text, terms that are unlisted in the terminology. The parser uses a similarity measure (Generalized Dice Coefficient) between listed terms and unlisted term candidates to (i) determine term status, (ii) serve putative terms to the parser, (iii) decrease parsing complexity by glomming multi-tokens as lexical singletons, and (iv) automatically augment the terminology after parsing of an utterance completes. We illustrate a small experiment with examples from the tax-and-regulations domain. Bootstrapping the parsing process to detect out- of-vocabulary terms at runtime increases parsing accuracy in addition to producing other benefits to a natural-language-processing pipeline, which translates arithmetic calculations written in English into computer-executable operations.


bib (full) Proceedings of the Second Workshop on Gender Bias in Natural Language Processing

pdf bib
Proceedings of the Second Workshop on Gender Bias in Natural Language Processing
Marta R. Costa-jussà | Christian Hardmeier | Will Radford | Kellie Webster

pdf bib
Interdependencies of Gender and Race in Contextualized Word Embeddings
May Jiang | Christiane Fellbaum

Recent years have seen a surge in research on the biases in word embeddings with respect to gender and, to a lesser extent, race. Few of these studies, however, have given attention to the critical intersection of race and gender. In this case study, we analyze the dimensions of gender and race in contextualized word embeddings of given names, taken from BERT, and investigate the nature and nuance of their interaction. We find that these demographic axes, though typically treated as physically and conceptually separate, are in fact interdependent and thus inadvisable to consider in isolation. Further, we show that demographic dimensions predicated on default settings in language, such as in pronouns, may risk rendering groups with multiple marginalized identities invisible. We conclude by discussing the importance and implications of intersectionality for future studies on bias and debiasing in NLP.

pdf bib
Fine-tuning Neural Machine Translation on Gender-Balanced Datasets
Marta R. Costa-jussà | Adrià de Jorge

Misrepresentation of certain communities in datasets is causing big disruptions in artificial intelligence applications. In this paper, we propose using an automatically extracted gender-balanced dataset parallel corpus from Wikipedia. This balanced set is used to perform fine-tuning techniques from a bigger model trained on unbalanced datasets to mitigate gender biases in neural machine translation.

pdf bib
Conversational Assistants and Gender Stereotypes : Public Perceptions and Desiderata for Voice Personas
Amanda Cercas Curry | Judy Robertson | Verena Rieser

Conversational voice assistants are rapidly developing from purely transactional systems to social companions with personality. UNESCO recently stated that the female and submissive personality of current digital assistants gives rise for concern as it reinforces gender stereotypes. In this work, we present results from a participatory design workshop, where we invite people to submit their preferences for a what their ideal persona might look like, both in drawings as well as in a multiple choice questionnaire. We find no clear consensus which suggests that one possible solution is to let people configure / personalise their assistants. We then outline a multi-disciplinary project of how we plan to address the complex question of gender and stereotyping in digital assistants.


bib (full) Proceedings of the Second Workshop on Beyond Vision and LANguage: inTEgrating Real-world kNowledge (LANTERN)

pdf bib
Proceedings of the Second Workshop on Beyond Vision and LANguage: inTEgrating Real-world kNowledge (LANTERN)
Aditya Mogadala | Sandro Pezzelle | Dietrich Klakow | Marie-Francine Moens | Zeynep Akata

pdf bib
Leveraging Visual Question Answering to Improve Text-to-Image Synthesis
Stanislav Frolov | Shailza Jolly | Jörn Hees | Andreas Dengel

Generating images from textual descriptions has recently attracted a lot of interest. While current models can generate photo-realistic images of individual objects such as birds and human faces, synthesising images with multiple objects is still very difficult. In this paper, we propose an effective way to combine Text-to-Image (T2I) synthesis with Visual Question Answering (VQA) to improve the image quality and image-text alignment of generated images by leveraging the VQA 2.0 dataset. We create additional training samples by concatenating question and answer (QA) pairs and employ a standard VQA model to provide the T2I model with an auxiliary learning signal. We encourage images generated from QA pairs to look realistic and additionally minimize an external VQA loss. Our method lowers the FID from 27.84 to 25.38 and increases the R-prec. from 83.82 % to 84.79 % when compared to the baseline, which indicates that T2I synthesis can successfully be improved using a standard VQA model.


bib (full) Proceedings of the The 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

pdf bib
Proceedings of the The 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature
Stefania DeGaetano | Anna Kazantseva | Nils Reiter | Stan Szpakowicz

pdf bib
Automatic Topological Field Identification in (Historical) German TextsGerman Texts
Katrin Ortmann

For the study of certain linguistic phenomena and their development over time, large amounts of textual data must be enriched with relevant annotations. Since the manual creation of such annotations requires a lot of effort, automating the process with NLP methods would be convenient. But the required amounts of training data are usually not available for non-standard or historical language. The present study investigates whether models trained on modern newspaper text can be used to automatically identify topological fields, i.e. syntactic structures, in different modern and historical German texts. The evaluation shows that, in general, it is possible to transfer a parser model to other registers or time periods with overall F1-scores 92 %. However, an error analysis makes clear that additional rules and domain-specific training data would be beneficial if sentence structures differ significantly from the training data, e.g. in the case of Early New High German.

pdf bib
Neural Machine Translation of Artwork Titles Using Iconclass Codes
Nikolay Banar | Walter Daelemans | Mike Kestemont

We investigate the use of Iconclass in the context of neural machine translation for NL-EN artwork titles. Iconclass is a widely used iconographic classification system used in the cultural heritage domain to describe and retrieve subjects represented in the visual arts. The resource contains keywords and definitions to encode the presence of objects, people, events and ideas depicted in artworks, such as paintings. We propose a simple concatenation approach that improves the quality of automatically generated title translations for artworks, by leveraging textual information extracted from Iconclass. Our results demonstrate that a neural machine translation system is able to exploit this metadata to boost the translation performance of artwork titles. This technology enables interesting applications of machine learning in resource-scarce domains in the cultural sector.

pdf bib
Vital Records : Uncover the past from historical handwritten records
Herve Dejean | Jean-Luc Meunier

We present Vital Records, a demonstrator based on deep-learning approaches to handwritten-text recognition, table processing and information extraction, which enables data from century-old documents to be parsed and analysed, making it possible to explore death records in space and time. This demonstrator provides a user interface for browsing and visualising data extracted from 80,000 handwritten pages of tabular data.

pdf bib
Life still goes on : Analysing Australian WW1 Diaries through Distant ReadingAustralian WW1 Diaries through Distant Reading
Ashley Dennis-Henderson | Matthew Roughan | Lewis Mitchell | Jonathan Tuke

An increasing amount of historic data is now available in digital (text) formats. This gives quantitative researchers an opportunity to use distant reading techniques, as opposed to traditional close reading, in order to analyse larger quantities of historic data. Distant reading allows researchers to view overall patterns within the data and reduce researcher bias. One such data set that has recently been transcribed is a collection of over 500 Australian World War I (WW1) diaries held by the State Library of New South Wales. Here we apply distant reading techniques to this corpus to understand what soldiers wrote about and how they felt over the course of the war. Extracting dates accurately is important as it allows us to perform our analysis over time, however, it is very challenging due to the variety of date formats and abbreviations diarists use. But with that data, topic modelling and sentiment analysis can then be applied to show trends, for instance, that despite the horrors of war, Australians in WW1 primarily wrote about their everyday routines and experiences. Our results detail some of the challenges likely to be encountered by quantitative researchers intending to analyse historical texts, and provide some approaches to these issues.

pdf bib
Results of a Single Blind Literary Taste Test with Short Anonymized Novel Fragments
Andreas van Cranenburgh | Corina Koolen

It is an open question to what extent perceptions of literary quality are derived from text-intrinsic versus social factors. While supervised models can predict literary quality ratings from textual factors quite successfully, as shown in the Riddle of Literary Quality project (Koolen et al., 2020), this does not prove that social factors are not important, nor can we assume that readers make judgments on literary quality in the same way and based on the same information as machine learning models. We report the results of a pilot study to gauge the effect of textual features on literary ratings of Dutch-language novels by participants in a controlled experiment with 48 participants. In an exploratory analysis, we compare the ratings to those from the large reader survey of the Riddle in which social factors were not excluded, and to machine learning predictions of those literary ratings. We find moderate to strong correlations of questionnaire ratings with the survey ratings, but the predictions are closer to the survey ratings. Code and data :

pdf bib
Interpretation of Sentiment Analysis in Aeschylus’s Greek TragedyGreek Tragedy
Vijaya Kumari Yeruva | Mayanka ChandraShekar | Yugyung Lee | Jeff Rydberg-Cox | Virginia Blanton | Nathan A Oyler

Recent advancements in NLP and machine learning have created unique challenges and opportunities for digital humanities research. In particular, there are ample opportunities for NLP and machine learning researchers to analyze data from literary texts and to broaden our understanding of human sentiment in classical Greek tragedy. In this paper, we will explore the challenges and benefits from the human and machine collaboration for sentiment analysis in Greek tragedy and address some open questions related to the collaborative annotation for the sentiments in literary texts. We focus primarily on (i) an analysis of the challenges in sentiment analysis tasks for humans and machines, and (ii) whether consistent annotation results are generated from the multiple human annotators and multiple machine annotators. For human annotators, we have used a survey-based approach with about 60 college students. We have selected three popular sentiment analysis tools for machine annotators, including VADER, CoreNLP’s sentiment annotator, and TextBlob. We have conducted a qualitative and quantitative evaluation and confirmed our observations on sentiments in Greek tragedy.

pdf bib
Finding and Generating a Missing Part for Story Completion
Yusuke Mori | Hiroaki Yamane | Yusuke Mukuta | Tatsuya Harada

Creating a story is difficult. Professional writers often experience a writer’s block. Thus, providing automatic support to writers is crucial but also challenging. Recently, in the field of generating and understanding stories, story completion (SC) has been proposed as a method for generating missing parts of an incomplete story. Despite this method’s usefulness in providing creative support, its applicability is currently limited because it requires the user to have prior knowledge of the missing part of a story. Writers do not always know which part of their writing is flawed. To overcome this problem, we propose a novel approach called missing position prediction (MPP). Given an incomplete story, we aim to predict the position of the missing part. We also propose a novel method for MPP and SC. We first conduct an experiment focusing on MPP, and our analysis shows that highly accurate predictions can be obtained when the missing part of a story is the beginning or the end. This suggests that if a story has a specific beginning or end, they play significant roles. We conduct an experiment on SC using MPP, and our proposed method demonstrates promising results.


bib (full) Proceedings of the 14th Linguistic Annotation Workshop

pdf bib
Proceedings of the 14th Linguistic Annotation Workshop
Stefanie Dipper | Amir Zeldes

pdf bib
Cookpad Parsed Corpus : Linguistic Annotations of Japanese RecipesJapanese Recipes
Jun Harashima | Makoto Hiramatsu

It has become increasingly common for people to share cooking recipes on the Internet. Along with the increase in the number of shared recipes, there have been corresponding increases in recipe-related studies and datasets. However, there are still few datasets that provide linguistic annotations for the recipe-related studies even though such annotations should form the basis of the studies. This paper introduces a novel recipe-related dataset, named Cookpad Parsed Corpus, which contains linguistic annotations for Japanese recipes. We randomly extracted 500 recipes from the largest recipe-related dataset, the Cookpad Recipe Dataset, and annotated 4 ; 738 sentences in the recipes with morphemes, named entities, and dependency relations. This paper also reports benchmark results on our corpus for Japanese morphological analysis, named entity recognition, and dependency parsing. We show that there is still room for improvement in the analyses of recipes.

pdf bib
PASTRIE : A Corpus of Prepositions Annotated with Supersense Tags in Reddit International EnglishPASTRIE: A Corpus of Prepositions Annotated with Supersense Tags in Reddit International English
Michael Kranzlein | Emma Manning | Siyao Peng | Shira Wein | Aryaman Arora | Nathan Schneider

We present the Prepositions Annotated with Supsersense Tags in Reddit International English (PASTRIE) corpus, a new dataset containing manually annotated preposition supersenses of English data from presumed speakers of four L1s : English, French, German, and Spanish. The annotations are comprehensive, covering all preposition types and tokens in the sample. Along with the corpus, we provide analysis of distributional patterns across the included L1s and a discussion of the influence of L1s on L2 preposition choice.

pdf bib
Querent Intent in Multi-Sentence Questions
Laurie Burchell | Jie Chi | Tom Hosking | Nina Markl | Bonnie Webber

Multi-sentence questions (MSQs) are sequences of questions connected by relations which, unlike sequences of standalone questions, need to be answered as a unit. Following Rhetorical Structure Theory (RST), we recognise that different question discourse relations between the subparts of MSQs reflect different speaker intents, and consequently elicit different answering strategies. Correctly identifying these relations is therefore a crucial step in automatically answering MSQs. We identify five different types of MSQs in English, and define five novel relations to describe them. We extract over 162,000 MSQs from Stack Exchange to enable future research. Finally, we implement a high-precision baseline classifier based on surface features.

pdf bib
Annotating Coherence Relations for Studying Topic Transitions in Social Talk
Alex Luu | Sophia A. Malamud

This study develops the strand of research on topic transitions in social talk which aims to gain a better understanding of interlocutors’ conversational goals. Lu and Malamud (2020) proposed that one way to identify such transitions is to annotate coherence relations, and then to identify utterances potentially expressing new topics as those that fail to participate in these relations. This work validates and refines their suggested annotation methodology, focusing on annotating most prominent coherence relations in face-to-face social dialogue. The result is a publicly accessible gold standard corpus with efficient and reliable annotation, whose broad coverage provides a foundation for future steps of identifying and classifying new topic utterances.


bib (full) Proceedings of the Third Workshop on Multilingual Surface Realisation

pdf bib
Proceedings of the Third Workshop on Multilingual Surface Realisation
Anya Belz | Bernd Bohnet | Thiago Castro Ferreira | Yvette Graham | Simon Mille | Leo Wanner

pdf bib
The Third Multilingual Surface Realisation Shared Task (SR’20): Overview and Evaluation ResultsSR’20): Overview and Evaluation Results
Simon Mille | Anya Belz | Bernd Bohnet | Thiago Castro Ferreira | Yvette Graham | Leo Wanner

This paper presents results from the Third Shared Task on Multilingual Surface Realisation (SR’20) which was organised as part of the COLING’20 Workshop on Multilingual Surface Realisation. As in SR’18 and SR’19, the shared task comprised two tracks : (1) a Shallow Track where the inputs were full UD structures with word order information removed and tokens lemmatised ; and (2) a Deep Track where additionally, functional words and morphological information were removed. Moreover, each track had two subtracks : (a) restricted-resource, where only the data provided or approved as part of a track could be used for training models, and (b) open-resource, where any data could be used. The Shallow Track was offered in 11 languages, whereas the Deep Track in 3 ones. Systems were evaluated using both automatic metrics and direct assessment by human evaluators in terms of Readability and Meaning Similarity to reference outputs. We present the evaluation results, along with descriptions of the SR’19 tracks, data and evaluation methods, as well as brief summaries of the participating systems. For full descriptions of the participating systems, please see the separate system reports elsewhere in this volume.

pdf bib
BME-TUW at SR’20 : Lexical grammar induction for surface realizationBME-TUW at SR’20: Lexical grammar induction for surface realization
Gábor Recski | Ádám Kovács | Kinga Gémes | Judit Ács | Andras Kornai

We present a system for mapping Universal Dependency structures to raw text which learns to restore word order by training an Interpreted Regular Tree Grammar (IRTG) that establishes a mapping between string and graph operations. The reinflection step is handled by a standard sequence-to-sequence architecture with a biLSTM encoder and an LSTM decoder with attention. We modify our 2019 system (Kovcs et al., 2019) with a new grammar induction mechanism that allows IRTG rules to operate on lemmata in addition to part-of-speech tags and ensures that each word and its dependents are reordered using the most specific set of learned patterns. We also introduce a hierarchical approach to word order restoration that independently determines the word order of each clause in a sentence before arranging them with respect to the main clause, thereby improving overall readability and also making the IRTG parsing task tractable. We participated in the 2020 Surface Realization Shared task, subtrack T1a (shallow, closed). Human evaluation shows we achieve significant improvements on two of the three out-of-domain datasets compared to the 2019 system we modified. Both components of our system are available on GitHub under an MIT license.

pdf bib
NILC at SR’20 : Exploring Pre-Trained Models in Surface RealisationNILC at SR’20: Exploring Pre-Trained Models in Surface Realisation
Marco Antonio Sobrevilla Cabezudo | Thiago Pardo

This paper describes the submission by the NILC Computational Linguistics research group of the University of S ao Paulo / Brazil to the English Track 2 (closed sub-track) at the Surface Realisation Shared Task 2020. The success of the current pre-trained models like BERT or GPT-2 in several tasks is well-known, however, this is not the case for data-to-text generation tasks and just recently some initiatives focused on it. This way, we explore how a pre-trained model (GPT-2) performs on the UD-to-text generation task. In general, the achieved results were poor, but there are some interesting ideas to explore. Among the learned lessons we may note that it is necessary to study strategies to represent UD inputs and to introduce structural knowledge into these pre-trained models.


bib (full) Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons

pdf bib
Proceedings of the Joint Workshop on Multiword Expressions and Electronic Lexicons
Stella Markantonatou | John McCrae | Jelena Mitrović | Carole Tiberius | Carlos Ramisch | Ashwini Vaidya | Petya Osenova | Agata Savary

pdf bib
CollFrEn : Rich Bilingual EnglishFrench Collocation ResourceCollFrEn: Rich Bilingual English–French Collocation Resource
Beatriz Fisas | Luis Espinosa Anke | Joan Codina-Filbá | Leo Wanner

Collocations in the sense of idiosyncratic lexical co-occurrences of two syntactically bound words traditionally pose a challenge to language learners and many Natural Language Processing (NLP) applications alike. Reliable ground truth (i.e., ideally manually compiled) resources are thus of high value. We present a manually compiled bilingual EnglishFrench collocation resource with 7,480 collocations in English and 6,733 in French. Each collocation is enriched with information that facilitates its downstream exploitation in NLP tasks such as machine translation, word sense disambiguation, natural language generation, relation classification, and so forth. Our proposed enrichment covers : the semantic category of the collocation (its lexical function), its vector space representation (for each individual word as well as their joint collocation embedding), a subcategorization pattern of both its elements, as well as their corresponding BabelNet i d, and finally, indices of their occurrences in large scale reference corpora.

pdf bib
Hierarchy-aware Learning of Sequential Tool Usage via Semi-automatically Constructed Taxonomies
Nima Nabizadeh | Martin Heckmann | Dorothea Kolossa

When repairing a device, humans employ a series of tools that corresponds to the arrangement of the device components. Such sequences of tool usage can be learned from repair manuals, so that at each step, having observed the previously applied tools, a sequential model can predict the next required tool. In this paper, we improve the tool prediction performance of such methods by additionally taking the hierarchical relationships among the tools into account. To this aim, we build a taxonomy of tools with hyponymy and hypernymy relations from the data by decomposing all multi-word expressions of tool names. We then develop a sequential model that performs a binary prediction for each node in the taxonomy. The evaluation of the method on a dataset of repair manuals shows that encoding the tools with the constructed taxonomy and using a top-down beam search for decoding increases the prediction accuracy and yields an interpretable taxonomy as a potentially valuable byproduct.

pdf bib
AlphaMWE : Construction of Multilingual Parallel Corpora with MWE AnnotationsAlphaMWE: Construction of Multilingual Parallel Corpora with MWE Annotations
Lifeng Han | Gareth Jones | Alan Smeaton

In this work, we present the construction of multilingual parallel corpora with annotation of multiword expressions (MWEs). MWEs include verbal MWEs (vMWEs) defined in the PARSEME shared task that have a verb as the head of the studied terms. The annotated vMWEs are also bilingually and multilingually aligned manually. The languages covered include English, Chinese, Polish, and German. Our original English corpus is taken from the PARSEME shared task in 2018. We performed machine translation of this source corpus followed by human post editing and annotation of target MWEs. Strict quality control was applied for error limitation, i.e., each MT output sentence received first manual post editing and annotation plus second manual quality rechecking. One of our findings during corpora preparation is that accurate translation of MWEs presents challenges to MT systems. To facilitate further MT research, we present a categorisation of the error types encountered by MT systems in performing MWE related translation. To acquire a broader view of MT issues, we selected four popular state-of-the-art MT models for comparisons namely : Microsoft Bing Translator, GoogleMT, Baidu Fanyi and DeepL MT. Because of the noise removal, translation post editing and MWE annotation by human professionals, we believe our AlphaMWE dataset will be an asset for cross-lingual and multilingual research, such as MT and information extraction. Our multilingual corpora are available as open access at

pdf bib
Annotating Verbal MWEs in Irish for the PARSEME Shared Task 1.2MWEs in Irish for the PARSEME Shared Task 1.2
Abigail Walsh | Teresa Lynn | Jennifer Foster

This paper describes the creation of two Irish corpora (labelled and unlabelled) for verbal MWEs for inclusion in the PARSEME Shared Task 1.2 on automatic identification of verbal MWEs, and the process of developing verbal MWE categories for Irish. A qualitative analysis on the two corpora is presented, along with discussion of Irish verbal MWEs.

pdf bib
Multi-word Expressions for Abusive Speech Detection in SerbianSerbian
Ranka Stanković | Jelena Mitrović | Danka Jokić | Cvetana Krstev

This paper presents our work on the refinement and improvement of the Serbian language part of Hurtlex, a multilingual lexicon of words to hurt. We pay special attention to adding Multi-word expressions that can be seen as abusive, as such lexical entries are very important in obtaining good results in a plethora of abusive language detection tasks. We use Serbian morphological dictionaries as a basis for data cleaning and MWE dictionary creation. A connection to other lexical and semantic resources in Serbian is outlined and building of abusive language detection systems based on that connection is foreseen.

pdf bib
Comparing word2vec and GloVe for Automatic Measurement of MWE CompositionalityGloVe for Automatic Measurement of MWE Compositionality
Thomas Pickard

This paper explores the use of word2vec and GloVe embeddings for unsupervised measurement of the semantic compositionality of MWE candidates. Through comparison with several human-annotated reference sets, we find word2vec to be substantively superior to GloVe for this task. We also find Simple English Wikipedia to be a poor-quality resource for compositionality assessment, but demonstrate that a sample of 10 % of sentences in the English Wikipedia can provide a conveniently tractable corpus with only moderate reduction in the quality of outputs.

pdf bib
MultiVitaminBooster at PARSEME Shared Task 2020 : Combining Window- and Dependency-Based Features with Multilingual Contextualised Word Embeddings for VMWE DetectionMultiVitaminBooster at PARSEME Shared Task 2020: Combining Window- and Dependency-Based Features with Multilingual Contextualised Word Embeddings for VMWE Detection
Sebastian Gombert | Sabine Bartsch

In this paper, we present MultiVitaminBooster, a system implemented for the PARSEME shared task on semi-supervised identification of verbal multiword expressions-edition 1.2. For our approach, we interpret detecting verbal multiword expressions as a token classification task aiming to decide whether a token is part of a verbal multiword expression or not. For this purpose, we train gradient boosting-based models. We encode tokens as feature vectors combining multilingual contextualized word embeddings provided by the XLM-RoBERTa language model with a more traditional linguistic feature set relying on context windows and dependency relations. Our system was ranked 7th in the official open track ranking of the shared task evaluations with an encoding-related bug distorting the results. For this reason we carry out further unofficial evaluations. Unofficial versions of our systems would have achieved higher ranks.


bib (full) Proceedings of the 3rd NLP4IF Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda

pdf bib
Proceedings of the 3rd NLP4IF Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda
Giovanni Da San Martino | Chris Brew | Giovanni Luca Ciampaglia | Anna Feldman | Chris Leberknight | Preslav Nakov

pdf bib
Two Stage Transformer Model for COVID-19 Fake News Detection and Fact CheckingCOVID-19 Fake News Detection and Fact Checking
Rutvik Vijjali | Prathyush Potluri | Siddharth Kumar | Sundeep Teki

The rapid advancement of technology in online communication via social media platforms has led to a prolific rise in the spread of misinformation and fake news. Fake news is especially rampant in the current COVID-19 pandemic, leading to people believing in false and potentially harmful claims and stories. Detecting fake news quickly can alleviate the spread of panic, chaos and potential health hazards. We developed a two stage automated pipeline for COVID-19 fake news detection using state of the art machine learning models for natural language processing. The first model leverages a novel fact checking algorithm that retrieves the most relevant facts concerning user queries about particular COVID-19 claims. The second model verifies the level of truth in the queried claim by computing the textual entailment between the claim and the true facts retrieved from a manually curated COVID-19 dataset. The dataset is based on a publicly available knowledge source consisting of more than 5000 COVID-19 false claims and verified explanations, a subset of which was internally annotated and cross-validated to train and evaluate our models. We evaluate a series of models based on classical text-based features to more contextual Transformer based models and observe that a model pipeline based on BERT and ALBERT for the two stages respectively yields the best results.

pdf bib
Measuring Alignment to Authoritarian State Media as Framing Bias
Timothy Niven | Hung-Yu Kao

We introduce what is to the best of our knowledge a new task in natural language processing : measuring alignment to authoritarian state media. We operationalize alignment in terms of sociological definitions of media bias. We take as a case study the alignment of four Taiwanese media outlets to the Chinese Communist Party state media. We present the results of an initial investigation using the frequency of words in psychologically meaningful categories. Our findings suggest that the chosen word categories correlate with framing choices. We develop a calculation method that yields reasonable results for measuring alignment, agreeing well with the known labels. We confirm that our method does capture event selection bias, but whether it captures framing bias requires further investigation.


bib (full) Proceedings of the Third Workshop on Computational Modeling of People's Opinions, Personality, and Emotion's in Social Media

pdf bib
Proceedings of the Third Workshop on Computational Modeling of People's Opinions, Personality, and Emotion's in Social Media
Malvina Nissim | Viviana Patti | Barbara Plank | Esin Durmus

pdf bib
Persuasiveness of News Editorials depending on Ideology and Personality
Roxanne El Baff | Khalid Al Khatib | Benno Stein | Henning Wachsmuth

News editorials aim to shape the opinions of their readership and the general public on timely controversial issues. The impact of an editorial on the reader’s opinion does not only depend on its content and style, but also on the reader’s profile. Previous work has studied the effect of editorial style depending on general political ideologies (liberals vs.conservatives). In our work, we dig deeper into the persuasiveness of both content and style, exploring the role of the intensity of an ideology (lean vs.extreme) and the reader’s personality traits (agreeableness, conscientiousness, extraversion, neuroticism, and openness). Concretely, we train content- and style-based models on New York Times editorials for different ideology- and personality-specific groups. Our results suggest that particularly readers with extreme ideology and non role model personalities are impacted by style. We further analyze the importance of various text features with respect to the editorials’ impact, the readers’ profile, and the editorials’ geographical scope.

pdf bib
KanCMD : Kannada CodeMixed Dataset for Sentiment Analysis and Offensive Language DetectionKanCMD: Kannada CodeMixed Dataset for Sentiment Analysis and Offensive Language Detection
Adeep Hande | Ruba Priyadharshini | Bharathi Raja Chakravarthi

We introduce Kannada CodeMixed Dataset (KanCMD), a multi-task learning dataset for sentiment analysis and offensive language identification. The KanCMD dataset highlights two real-world issues from the social media text. First, it contains actual comments in code mixed text posted by users on YouTube social media, rather than in monolingual text from the textbook. Second, it has been annotated for two tasks, namely sentiment analysis and offensive language detection for under-resourced Kannada language. Hence, KanCMD is meant to stimulate research in under-resourced Kannada language on real-world code-mixed social media text and multi-task learning. KanCMD was obtained by crawling the YouTube, and a minimum of three annotators annotates each comment. We release KanCMD 7,671 comments for multitask learning research purpose.

pdf bib
Contextual Augmentation of Pretrained Language Models for Emotion Recognition in Conversations
Jonggu Kim | Hyeonmok Ko | Seoha Song | Saebom Jang | Jiyeon Hong

Since language model pretraining to learn contextualized word representations has been proposed, pretrained language models have made success in many natural language processing tasks. That is because it is helpful to use individual contextualized representations of self-attention layers as to initialize parameters for downstream tasks. Yet, unfortunately, use of pretrained language models for emotion recognition in conversations has not been studied enough. We firstly use ELECTRA which is a state-of-the-art pretrained language model and validate the performance on emotion recognition in conversations. Furthermore, we propose contextual augmentation of pretrained language models for emotion recognition in conversations, which is to consider not only previous utterances, but also conversation-related information such as speakers, speech acts and topics. We classify information based on what the information is related to, and propose position of words corresponding to the information in the entire input sequence. To validate the proposed method, we conduct experiments on the DailyDialog dataset which contains abundant annotated information of conversations. The experiments show that the proposed method achieves state-of-the-art F1 scores on the dataset and significantly improves the performance.

pdf bib
Multilingual Emoticon Prediction of Tweets about COVID-19COVID-19
Stefanos Stoikos | Mike Izbicki

Emojis are a widely used tool for encoding emotional content in informal messages such as tweets, and predicting which emoji corresponds to a piece of text can be used as a proxy for measuring the emotional content in the text. This paper presents the first model for predicting emojis in highly multilingual text. Our BERTmoticon model is a fine-tuned version of the BERT model, and it can predict emojis for text written in 102 different languages. We trained our BERTmoticon model on 54.2 million geolocated tweets sent in the first 6 months of 2020,and we apply the model to a case study analyzing the emotional reaction of Twitter users to news about the coronavirus. Example findings include a spike in sadness when the World Health Organization (WHO) declared that coronavirus was a global pandemic, and a spike in anger and disgust when the number of COVID-19 related deaths in the United States surpassed one hundred thousand. We provide an easy-to-use and open source python library for predicting emojis with BERTmoticon so that the model can easily be applied to other data mining tasks.

pdf bib
Experiencers, Stimuli, or Targets : Which Semantic Roles Enable Machine Learning to Infer the Emotions?
Laura Ana Maria Oberländer | Kevin Reich | Roman Klinger

Emotion recognition is predominantly formulated as text classification in which textual units are assigned to an emotion from a predefined inventory (e.g., fear, joy, anger, disgust, sadness, surprise, trust, anticipation). More recently, semantic role labeling approaches have been developed to extract structures from the text to answer questions like : who is described to feel the emotion? (experiencer), what causes this emotion? (stimulus), and at which entity is it directed? (target). Though it has been shown that jointly modeling stimulus and emotion category prediction is beneficial for both subtasks, it remains unclear which of these semantic roles enables a classifier to infer the emotion. Is it the experiencer, because the identity of a person is biased towards a particular emotion (X is always happy)? Is it a particular target (everybody loves X) or a stimulus (doing X makes everybody sad)? We answer these questions by training emotion classification models on five available datasets annotated with at least one semantic role by masking the fillers of these roles in the text in a controlled manner and find that across multiple corpora, stimuli and targets carry emotion information, while the experiencer might be considered a confounder. Further, we analyze if informing the model about the position of the role improves the classification decision. Particularly on literature corpora we find that the role information improves the emotion classification.


bib (full) Proceedings of the 3rd International Workshop on Rumours and Deception in Social Media (RDSM)

pdf bib
Proceedings of the 3rd International Workshop on Rumours and Deception in Social Media (RDSM)
Ahmet Aker | Arkaitz Zubiaga

pdf bib
Covid or not Covid? Topic Shift in Information Cascades on TwitterTwitter
Liana Ermakova | Diana Nurbakova | Irina Ovchinnikova

Social media have become a valuable source of information. However, its power to shape public opinion can be dangerous, especially in the case of misinformation. The existing studies on misinformation detection hypothesise that the initial message is fake. In contrast, we focus on information distortion occurring in cascades as the initial message is quoted or receives a reply. We show a significant topic shift in information cascades on Twitter during the Covid-19 pandemic providing valuable insights for the automatic analysis of information distortion.

pdf bib
Automatic Detection of Hungarian Clickbait and Entertaining Fake NewsHungarian Clickbait and Entertaining Fake News
Veronika Vincze | Martina Katalin Szabó

Online news do not always come from reliable sources and they are not always even realistic. The constantly growing number of online textual data has raised the need for detecting deception and bias in texts from different domains recently. In this paper, we identify different types of unrealistic news (clickbait and fake news written for entertainment purposes) written in Hungarian on the basis of a rich feature set and with the help of machine learning methods. Our tool achieves competitive scores : it is able to classify clickbait, fake news written for entertainment purposes and real news with an accuracy of over 80 %. It is also highlighted that morphological features perform the best in this classification task.

pdf bib
Fake or Real? A Study of Arabic Satirical Fake NewsArabic Satirical Fake News
Hadeel Saadany | Constantin Orasan | Emad Mohamed

One very common type of fake news is satire which comes in a form of a news website or an online platform that parodies reputable real news agencies to create a sarcastic version of reality. This type of fake news is often disseminated by individuals on their online platforms as it has a much stronger effect in delivering criticism than through a straightforward message. However, when the satirical text is disseminated via social media without mention of its source, it can be mistaken for real news. This study conducts several exploratory analyses to identify the linguistic properties of Arabic fake news with satirical content. It shows that although it parodies real news, Arabic satirical news has distinguishing features on the lexico-grammatical level. We exploit these features to build a number of machine learning models capable of identifying satirical fake news with an accuracy of up to 98.6 %. The study introduces a new dataset (3185 articles) scraped from two Arabic satirical news websites (‘Al-Hudood’ and ‘Al-Ahram Al-Mexici’) which consists of fake news. The real news dataset consists of 3710 articles collected from three official news sites : the ‘BBC-Arabic’, the ‘CNN-Arabic’ and ‘Al-Jazeera news’. Both datasets are concerned with political issues related to the Middle East.


bib (full) Proceedings of the Fourteenth Workshop on Semantic Evaluation

pdf bib
Proceedings of the Fourteenth Workshop on Semantic Evaluation
Aurelie Herbelot | Xiaodan Zhu | Alexis Palmer | Nathan Schneider | Jonathan May | Ekaterina Shutova

pdf bib
Discovery Team at SemEval-2020 Task 1 : Context-sensitive Embeddings Not Always Better than Static for Semantic Change DetectionSemEval-2020 Task 1: Context-sensitive Embeddings Not Always Better than Static for Semantic Change Detection
Matej Martinc | Syrielle Montariol | Elaine Zosa | Lidia Pivovarova

This paper describes the approaches used by the Discovery Team to solve SemEval-2020 Task 1-Unsupervised Lexical Semantic Change Detection. The proposed method is based on clustering of BERT contextual embeddings, followed by a comparison of cluster distributions across time. The best results were obtained by an ensemble of this method and static Word2Vec embeddings. According to the official results, our approach proved the best for Latin in Subtask 2.

pdf bib
GM-CTSC at SemEval-2020 Task 1 : Gaussian Mixtures Cross Temporal Similarity ClusteringGM-CTSC at SemEval-2020 Task 1: Gaussian Mixtures Cross Temporal Similarity Clustering
Pierluigi Cassotti | Annalina Caputo | Marco Polignano | Pierpaolo Basile

This paper describes the system proposed by the Random team for SemEval-2020 Task 1 : Unsupervised Lexical Semantic Change Detection. We focus our approach on the detection problem. Given the semantics of words captured by temporal word embeddings in different time periods, we investigate the use of unsupervised methods to detect when the target word has gained or lost senses. To this end, we define a new algorithm based on Gaussian Mixture Models to cluster the target similarities computed over the two periods. We compare the proposed approach with a number of similarity-based thresholds. We found that, although the performance of the detection methods varies across the word embedding algorithms, the combination of Gaussian Mixture with Temporal Referencing resulted in our best system.

pdf bib
RIJP at SemEval-2020 Task 1 : Gaussian-based Embeddings for Semantic Change DetectionRIJP at SemEval-2020 Task 1: Gaussian-based Embeddings for Semantic Change Detection
Ran Iwamoto | Masahiro Yukawa

This paper describes the model proposed and submitted by our RIJP team to SemEval 2020 Task1 : Unsupervised Lexical Semantic Change Detection. In the model, words are represented by Gaussian distributions. For Subtask 1, the model achieved average scores of 0.51 and 0.70 in the evaluation and post-evaluation processes, respectively. The higher score in the post-evaluation process than that in the evaluation process was achieved owing to appropriate parameter tuning. The results indicate that the proposed Gaussian-based embedding model is able to express semantic shifts while having a low computational

pdf bib
UiO-UvA at SemEval-2020 Task 1 : Contextualised Embeddings for Lexical Semantic Change DetectionUiO-UvA at SemEval-2020 Task 1: Contextualised Embeddings for Lexical Semantic Change Detection
Andrey Kutuzov | Mario Giulianelli

We apply contextualised word embeddings to lexical semantic change detection in the SemEval-2020 Shared Task 1. This paper focuses on Subtask 2, ranking words by the degree of their semantic drift over time. We analyse the performance of two contextualising architectures (BERT and ELMo) and three change detection algorithms. We find that the most effective algorithms rely on the cosine similarity between averaged token embeddings and the pairwise distances between token embeddings. They outperform strong baselines by a large margin (in the post-evaluation phase, we have the best Subtask 2 submission for SemEval-2020 Task 1), but interestingly, the choice of a particular algorithm depends on the distribution of gold scores in the test set.

pdf bib
BMEAUT at SemEval-2020 Task 2 : Lexical Entailment with Semantic GraphsBMEAUT at SemEval-2020 Task 2: Lexical Entailment with Semantic Graphs
Ádám Kovács | Kinga Gémes | Andras Kornai | Gábor Recski

In this paper we present a novel rule-based, language independent method for determining lexical entailment relations using semantic representations built from Wiktionary definitions. Combined with a simple WordNet-based method our system achieves top scores on the English and Italian datasets of the Semeval-2020 task Predicting Multilingual and Cross-lingual (graded) Lexical Entailment (Glava et al., 2020). A detailed error analysis of our output uncovers future di- rections for improving both the semantic parsing method and the inference process on semantic graphs.

pdf bib
BRUMS at SemEval-2020 Task 3 : Contextualised Embeddings for Predicting the (Graded) Effect of Context in Word SimilarityBRUMS at SemEval-2020 Task 3: Contextualised Embeddings for Predicting the (Graded) Effect of Context in Word Similarity
Hansi Hettiarachchi | Tharindu Ranasinghe

This paper presents the team BRUMS submission to SemEval-2020 Task 3 : Graded Word Similarity in Context. The system utilises state-of-the-art contextualised word embeddings, which have some task-specific adaptations, including stacked embeddings and average embeddings. Overall, the approach achieves good evaluation scores across all the languages, while maintaining simplicity. Following the final rankings, our approach is ranked within the top 5 solutions of each language while preserving the 1st position of Finnish subtask 2.

pdf bib
UZH at SemEval-2020 Task 3 : Combining BERT with WordNet Sense Embeddings to Predict Graded Word Similarity ChangesUZH at SemEval-2020 Task 3: Combining BERT with WordNet Sense Embeddings to Predict Graded Word Similarity Changes
Li Tang

CoSimLex is a dataset that can be used to evaluate the ability of context-dependent word embed- dings for modeling subtle, graded changes of meaning, as perceived by humans during reading. At SemEval-2020, task 3, subtask 1 is about predicting the (graded) effect of context in word similarity, using CoSimLex to quantify such a change of similarity for a pair of words, from one context to another. Here, a meaning shift is composed of two aspects, a) discrete changes observed between different word senses, and b) more subtle changes of meaning representation that are not captured in those discrete changes. Therefore, this SemEval task was designed to allow the evaluation of systems that can deal with a mix of both situations of semantic shift, as they occur in the human perception of meaning. The described system was developed to improve the BERT baseline provided with the task, by reducing distortions in the BERT semantic space, compared to the human semantic space. To this end, complementarity between 768- and 1024-dimensional BERT embeddings, and average word sense vectors were used. With this system, after some fine-tuning, the baseline performance of 0.705 (uncentered Pearson correlation with human semantic shift data from 27 annotators) was enhanced by more than 6 %, to 0.7645. We hope that this work can make a contribution to further our understanding of the semantic vector space of human perception, as it can be modeled with context-dependent word embeddings in natural language processing systems.

pdf bib
DCC-Uchile at SemEval-2020 Task 1 : Temporal Referencing Word EmbeddingsDCC-Uchile at SemEval-2020 Task 1: Temporal Referencing Word Embeddings
Frank D. Zamora-Reina | Felipe Bravo-Marquez

We present a system for the task of unsupervised lexical change detection : given a target word and two corpora spanning different periods of time, automatically detects whether the word has lost or gained senses from one corpus to another. Our system employs the temporal referencing method to obtain compatible representations of target words in different periods of time. This is done by concatenating corpora of different periods and performing a temporal referencing of target words i.e., treating occurrences of target words in different periods as two independent tokens. Afterwards, we train word embeddings on the joint corpus and compare the referenced vectors of each target word using cosine similarity. Our submission was ranked 7th among 34 teams for subtask 1, obtaining an average accuracy of 0.637, only 0.050 points behind the first ranked system.

pdf bib
SST-BERT at SemEval-2020 Task 1 : Semantic Shift Tracing by Clustering in BERT-based Embedding SpacesSST-BERT at SemEval-2020 Task 1: Semantic Shift Tracing by Clustering in BERT-based Embedding Spaces
Vani Kanjirangat | Sandra Mitrovic | Alessandro Antonucci | Fabio Rinaldi

Lexical semantic change detection (also known as semantic shift tracing) is a task of identifying words that have changed their meaning over time. Unsupervised semantic shift tracing, focal point of SemEval2020, is particularly challenging. Given the unsupervised setup, in this work, we propose to identify clusters among different occurrences of each target word, considering these as representatives of different word meanings. As such, disagreements in obtained clusters naturally allow to quantify the level of semantic shift per each target word in four target languages. To leverage this idea, clustering is performed on contextualized (BERT-based) embeddings of word occurrences. The obtained results show that our approach performs well both measured separately (per language) and overall, where we surpass all provided SemEval baselines.

pdf bib
TemporalTeller at SemEval-2020 Task 1 : Unsupervised Lexical Semantic Change Detection with Temporal ReferencingTemporalTeller at SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection with Temporal Referencing
Jinan Zhou | Jiaxin Li

This paper describes our TemporalTeller system for SemEval Task 1 : Unsupervised Lexical Semantic Change Detection. We develop a unified framework for the common semantic change detection pipelines including preprocessing, learning word embeddings, calculating vector distances and determining threshold. We also propose Gamma Quantile Threshold to distinguish between changed and stable words. Based on our system, we conduct a comprehensive comparison among BERT, Skip-gram, Temporal Referencing and alignment-based methods. Evaluation results show that Skip-gram with Temporal Referencing achieves the best performance of 66.5 % classification accuracy and 51.8 % Spearman’s Ranking Correlation.

pdf bib
Ferryman at SemEval-2020 Task 3 : Bert with TFIDF-Weighting for Predicting the Effect of Context in Word SimilaritySemEval-2020 Task 3: Bert with TFIDF-Weighting for Predicting the Effect of Context in Word Similarity
Weilong Chen | Xin Yuan | Sai Zhang | Jiehui Wu | Yanru Zhang | Yan Wang

Word similarity is widely used in machine learning applications like searching engine and recommendation. Measuring the changing meaning of the same word between two different sentences is not only a way to handle complex features in word usage (such as sentence syntax and semantics), but also an important method for different word polysemy modeling. In this paper, we present the methodology proposed by team Ferryman. Our system is based on the Bidirectional Encoder Representations from Transformers (BERT) model combined with term frequency-inverse document frequency (TF-IDF), applying the method on the provided datasets called CoSimLex, which covers four different languages including English, Croatian, Slovene, and Finnish. Our team Ferryman wins the the first position for English task and the second position for Finnish in the subtask 1.

pdf bib
JUSTMasters at SemEval-2020 Task 3 : Multilingual Deep Learning Model to Predict the Effect of Context in Word SimilarityJUSTMasters at SemEval-2020 Task 3: Multilingual Deep Learning Model to Predict the Effect of Context in Word Similarity
Nour Al-khdour | Mutaz Bni Younes | Malak Abdullah | Mohammad AL-Smadi

There is a growing research interest in studying word similarity. Without a doubt, two similar words in a context may considered different in another context. Therefore, this paper investigates the effect of the context in word similarity. The SemEval-2020 workshop has provided a shared task (Task 3 : Predicting the (Graded) Effect of Context in Word Similarity). In this task, the organizers provided unlabeled datasets for four languages, English, Croatian, Finnish and Slovenian. Our team, JUSTMasters, has participated in this competition in the two subtasks : A and B. Our approach has used a weighted average ensembling method for different pretrained embeddings techniques for each of the four languages. Our proposed model outperformed the baseline models in both subtasks and acheived the best result for subtask 2 in English and Finnish, with score 0.725 and 0.68 respectively. We have been ranked the sixth for subtask 1, with scores for English, Croatian, Finnish, and Slovenian as follows : 0.738, 0.44, 0.546, 0.512.

pdf bib
Will_Go at SemEval-2020 Task 3 : An Accurate Model for Predicting the (Graded) Effect of Context in Word Similarity Based on BERTWill_Go at SemEval-2020 Task 3: An Accurate Model for Predicting the (Graded) Effect of Context in Word Similarity Based on BERT
Wei Bao | Hongshu Che | Jiandong Zhang

Natural Language Processing (NLP) has been widely used in the semantic analysis in recent years. Our paper mainly discusses a methodology to analyze the effect that context has on human perception of similar words, which is the third task of SemEval 2020. We apply several methods in calculating the distance between two embedding vector generated by Bidirectional Encoder Representation from Transformer (BERT). Our team will go won the 1st place in Finnish language track of subtask1, the second place in English track of subtask1.

pdf bib
SemEval-2020 Task 6 : Definition Extraction from Free Text with the DEFT CorpusSemEval-2020 Task 6: Definition Extraction from Free Text with the DEFT Corpus
Sasha Spala | Nicholas Miller | Franck Dernoncourt | Carl Dockhorn

Research on definition extraction has been conducted for well over a decade, largely with significant constraints on the type of definitions considered. In this work, we present DeftEval, a SemEval shared task in which participants must extract definitions from free text using a term-definition pair corpus that reflects the complex reality of definitions in natural language. Definitions and glosses in free text often appear without explicit indicators, across sentences boundaries, or in an otherwise complex linguistic manner. DeftEval involved 3 distinct subtasks : 1) Sentence classification, 2) sequence labeling, and 3) relation extraction.

pdf bib
IIE-NLP-NUT at SemEval-2020 Task 4 : Guiding PLM with Prompt Template Reconstruction Strategy for ComVEIIE-NLP-NUT at SemEval-2020 Task 4: Guiding PLM with Prompt Template Reconstruction Strategy for ComVE
Luxi Xing | Yuqiang Xie | Yue Hu | Wei Peng

This paper introduces our systems for the first two subtasks of SemEval Task4 : Commonsense Validation and Explanation. To clarify the intention for judgment and inject contrastive information for selection, we propose the input reconstruction strategy with prompt templates. Specifically, we formalize the subtasks into the multiple-choice question answering format and construct the input with the prompt templates, then, the final prediction of question answering is considered as the result of subtasks. Experimental results show that our approaches achieve significant performance compared with the baseline systems. Our approaches secure the third rank on both official test sets of the first two subtasks with an accuracy of 96.4 and an accuracy of 94.3 respectively.

pdf bib
BUT-FIT at SemEval-2020 Task 4 : Multilingual CommonsenseBUT-FIT at SemEval-2020 Task 4: Multilingual Commonsense
Josef Jon | Martin Fajcik | Martin Docekal | Pavel Smrz

We participated in all three subtasks. In subtasks A and B, our submissions are based on pretrained language representation models (namely ALBERT) and data augmentation. We experimented with solving the task for another language, Czech, by means of multilingual models and machine translated dataset, or translated model inputs. We show that with a strong machine translation system, our system can be used in another language with a small accuracy loss. In subtask C, our submission, which is based on pretrained sequence-to-sequence model (BART), ranked 1st in BLEU score ranking, however, we show that the correlation between BLEU and human evaluation, in which our submission ended up 4th, is low. We analyse the metrics used in the evaluation and we propose an additional score based on model from subtask B, which correlates well with our manual ranking, as well as reranking method based on the same principle. We performed an error and dataset analysis for all subtasks and we present our findings.

pdf bib
Masked Reasoner at SemEval-2020 Task 4 : Fine-Tuning RoBERTa for Commonsense ReasoningSemEval-2020 Task 4: Fine-Tuning RoBERTa for Commonsense Reasoning
Daming Lu

This paper describes the masked reasoner system that participated in SemEval-2020 Task 4 : Commonsense Validation and Explanation. The system participated in the subtask B.We proposes a novel method to fine-tune RoBERTa by masking the most important word in the statement. We believe that the confidence of the system in recovering that word is positively correlated to the score the masked language model gives to the current statement-explanation pair. We evaluate the importance of each word using InferSent and do the masked fine-tuning on RoBERTa. Then we use the fine-tuned model to predict the most plausible explanation. Our system is fast in training and achieved 73.5 % accuracy.

pdf bib
UoR at SemEval-2020 Task 4 : Pre-trained Sentence Transformer Models for Commonsense Validation and ExplanationUoR at SemEval-2020 Task 4: Pre-trained Sentence Transformer Models for Commonsense Validation and Explanation
Thanet Markchom | Bhuvana Dhruva | Chandresh Pravin | Huizhi Liang

SemEval Task 4 Commonsense Validation and Explanation Challenge is to validate whether a system can differentiate natural language statements that make sense from those that do not make sense. Two subtasks, A and B, are focused in this work, i.e., detecting against-common-sense statements and selecting explanations of why they are false from the given options. Intuitively, commonsense validation requires additional knowledge beyond the given statements. Therefore, we propose a system utilising pre-trained sentence transformer models based on BERT, RoBERTa and DistillBERT architectures to embed the statements before classification. According to the results, these embeddings can improve the performance of the typical MLP and LSTM classifiers as downstream models of both subtasks compared to regular tokenised statements. These embedded statements are shown to comprise additional information from external resources which help validate common sense in natural language.

pdf bib
BUT-FIT at SemEval-2020 Task 5 : Automatic Detection of Counterfactual Statements with Deep Pre-trained Language Representation ModelsBUT-FIT at SemEval-2020 Task 5: Automatic Detection of Counterfactual Statements with Deep Pre-trained Language Representation Models
Martin Fajcik | Josef Jon | Martin Docekal | Pavel Smrz

This paper describes BUT-FIT’s submission at SemEval-2020 Task 5 : Modelling Causal Reasoning in Language : Detecting Counterfactuals. The challenge focused on detecting whether a given statement contains a counterfactual (Subtask 1) and extracting both antecedent and consequent parts of the counterfactual from the text (Subtask 2). We experimented with various state-of-the-art language representation models (LRMs). We found RoBERTa LRM to perform the best in both subtasks. We achieved the first place in both exact match and F1 for Subtask 2 and ranked second for Subtask 1.

pdf bib
ACNLP at SemEval-2020 Task 6 : A Supervised Approach for Definition ExtractionACNLP at SemEval-2020 Task 6: A Supervised Approach for Definition Extraction
Fabien Caspani | Pirashanth Ratnamogan | Mathis Linger | Mhamed Hajaiej

We describe our contribution to two of the subtasks of SemEval 2020 Task 6, DeftEval : Extracting term-definition pairs in free text. The system for Subtask 1 : Sentence Classification is based on a transformer architecture where we use transfer learning to fine-tune a pretrained model on the downstream task, and the one for Subtask 3 : Relation Classification uses a Random Forest classifier with handcrafted dedicated features. Our systems respectively achieve 0.830 and 0.994 F1-scores on the official test set, and we believe that the insights derived from our study are potentially relevant to help advance the research on definition extraction.

pdf bib
CN-HIT-IT.NLP at SemEval-2020 Task 4 : Enhanced Language Representation with Multiple Knowledge TriplesCN-HIT-IT.NLP at SemEval-2020 Task 4: Enhanced Language Representation with Multiple Knowledge Triples
Yice Zhang | Jiaxuan Lin | Yang Fan | Peng Jin | Yuanchao Liu | Bingquan Liu

This paper describes our system that participated in the SemEval-2020 task 4 : Commonsense Validation and Explanation. For this task, it is obvious that external knowledge, such as Knowledge graph, can help the model understand commonsense in natural language statements. But how to select the right triples for statements remains unsolved, so how to reduce the interference of irrelevant triples on model performance is a research focus. This paper adopt a modified K-BERT as the language encoder, to enhance language representation through triples from knowledge graphs. Experiments show that our method is better than models without external knowledge, and is slightly better than the original K-BERT. We got an accuracy score of 0.97 in subtaskA, ranking 1/45, and got an accuracy score of 0.948, ranking 2/35.

pdf bib
CS-NLP Team at SemEval-2020 Task 4 : Evaluation of State-of-the-art NLP Deep Learning Architectures on Commonsense Reasoning TaskCS-NLP Team at SemEval-2020 Task 4: Evaluation of State-of-the-art NLP Deep Learning Architectures on Commonsense Reasoning Task
Sirwe Saeedi | Aliakbar Panahi | Seyran Saeedi | Alvis C Fong

In this paper, we investigate a commonsense inference task that unifies natural language understanding and commonsense reasoning. We describe our attempt at SemEval-2020 Task 4 competition : Commonsense Validation and Explanation (ComVE) challenge. We discuss several state-of-the-art deep learning architectures for this challenge. Our system uses prepared labeled textual datasets that were manually curated for three different natural language inference subtasks. The goal of the first subtask is to test whether a model can distinguish between natural language statements that make sense and those that do not make sense. We compare the performance of several language models and fine-tuned classifiers. Then, we propose a method inspired by question / answering tasks to treat a classification problem as a multiple choice question task to boost the performance of our experimental results (96.06 %), which is significantly better than the baseline. For the second subtask, which is to select the reason why a statement does not make sense, we stand within the first six teams (93.7 %) among 27 participants with very competitive results. Our result for last subtask of generating reason against the nonsense statement shows many potentials for future researches as we applied the most powerful generative model of language (GPT-2) with 6.1732 BLEU score among first four teams.

pdf bib
JBNU at SemEval-2020 Task 4 : BERT and UniLM for Commonsense Validation and ExplanationJBNU at SemEval-2020 Task 4: BERT and UniLM for Commonsense Validation and Explanation
Seung-Hoon Na | Jong-Hyeon Lee

This paper presents our contributions to the SemEval-2020 Task 4 Commonsense Validation and Explanation (ComVE) and includes the experimental results of the two Subtasks B and C of the SemEval-2020 Task 4. Our systems rely on pre-trained language models, i.e., BERT (including its variants) and UniLM, and rank 10th and 7th among 27 and 17 systems on Subtasks B and C, respectively. We analyze the commonsense ability of the existing pretrained language models by testing them on the SemEval-2020 Task 4 ComVE dataset, specifically for Subtasks B and C, the explanation subtasks with multi-choice and sentence generation, respectively.

pdf bib
KaLM at SemEval-2020 Task 4 : Knowledge-aware Language Models for Comprehension and GenerationKaLM at SemEval-2020 Task 4: Knowledge-aware Language Models for Comprehension and Generation
Jiajing Wan | Xinting Huang

This paper presents our strategies in SemEval 2020 Task 4 : Commonsense Validation and Explanation. We propose a novel way to search for evidence and choose the different large-scale pre-trained models as the backbone for three subtasks. The results show that our evidence-searching approach improves model performance on commonsense explanation task. Our team ranks 2nd in subtask C according to human evaluation score.

pdf bib
LMVE at SemEval-2020 Task 4 : Commonsense Validation and Explanation Using Pretraining Language ModelLMVE at SemEval-2020 Task 4: Commonsense Validation and Explanation Using Pretraining Language Model
Shilei Liu | Yu Guo | BoChao Li | Feiliang Ren

This paper introduces our system for commonsense validation and explanation. For Sen-Making task, we use a novel pretraining language model based architecture to pick out one of the two given statements that is againstcommon sense. For Explanation task, we use a hint sentence mechanism to improve the performance greatly. In addition, we propose a subtask level transfer learning to share information between subtasks.

pdf bib
SSN-NLP at SemEval-2020 Task 4 : Text Classification and Generation on Common Sense Context Using Neural NetworksSSN-NLP at SemEval-2020 Task 4: Text Classification and Generation on Common Sense Context Using Neural Networks
Rishivardhan K. | Kayalvizhi S | Thenmozhi D. | Raghav R. | Kshitij Sharma

Common sense validation deals with testing whether a system can differentiate natural language statements that make sense from those that do not make sense. This paper describes the our approach to solve this challenge. For common sense validation with multi choice, we propose a stacking based approach to classify sentences that are more favourable in terms of common sense to the particular statement. We have used majority voting classifier methodology amongst three models such as Bidirectional Encoder Representations from Transformers (BERT), Micro Text Classification (Micro TC) and XLNet. For sentence generation, we used Neural Machine Translation (NMT) model to generate explanatory sentences.

pdf bib
UAICS at SemEval-2020 Task 4 : Using a Bidirectional Transformer for Task aUAICS at SemEval-2020 Task 4: Using a Bidirectional Transformer for Task a
Ciprian-Gabriel Cusmuliuc | Lucia-Georgiana Coca | Adrian Iftene

Commonsense Validation and Explanation has been a difficult task for machines since the dawn of computing. Although very trivial to humans it poses a high complexity for machines due to the necessity of inference over a pre-existing knowledge base. In order to try and solve this problem the SemEval 2020 Task 4-Commonsense Validation and Explanation (ComVE) aims to evaluate systems capable of multiple stages of ComVE. The challenge includes 3 tasks (A, B and C), each with it’s own requirements. Our team participated only in task A which required selecting the statement that made the least sense. We choose to use a bidirectional transformer in order to solve the challenge, this paper presents the details of our method, runs and result.

pdf bib
Warren at SemEval-2020 Task 4 : ALBERT and Multi-Task Learning for Commonsense ValidationSemEval-2020 Task 4: ALBERT and Multi-Task Learning for Commonsense Validation
Yuhang Wu | Hao Wu

This paper describes our system in subtask A of SemEval 2020 Shared Task 4. We propose a reinforcement learning model based on MTL(Multi-Task Learning) to enhance the prediction ability of commonsense validation. The experimental results demonstrate that our system outperforms the single-task text classification model. We combine MTL and ALBERT pretrain model to achieve an accuracy of 0.904 and our model is ranked 16th on the final leader board of the competition among the 45 teams.

pdf bib
ETHAN at SemEval-2020 Task 5 : Modelling Causal Reasoning in Language Using Neuro-symbolic Cloud ComputingETHAN at SemEval-2020 Task 5: Modelling Causal Reasoning in Language Using Neuro-symbolic Cloud Computing
Len Yabloko

I present ETHAN : Experimental Testing of Hybrid AI Node implemented entirely on free cloud computing infrastructure. The ultimate goal of this research is to create modular reusable hybrid neuro-symbolic architecture for Artificial Intelligence. As a test case I model natural language comprehension of causal relations from open domain text corpus that combines semi-supervised language model (Huggingface Transformers) with constituency and dependency parsers (Allen Institute for Artificial Intelligence.)

pdf bib
Ferryman as SemEval-2020 Task 5 : Optimized BERT for Detecting CounterfactualsSemEval-2020 Task 5: Optimized BERT for Detecting Counterfactuals
Weilong Chen | Yan Zhuang | Peng Wang | Feng Hong | Yan Wang | Yanru Zhang

The main purpose of this article is to state the effect of using different methods and models for counterfactual determination and detection of causal knowledge. Nowadays, counterfactual reasoning has been widely used in various fields. In the realm of natural language process(NLP), counterfactual reasoning has huge potential to improve the correctness of a sentence. In the shared Task 5 of detecting counterfactual in SemEval 2020, we pre-process the officially given dataset according to case conversion, extract stem and abbreviation replacement. We use last-5 bidirectional encoder representation from bidirectional encoder representation from transformer (BERT)and term frequencyinverse document frequency (TF-IDF) vectorizer for counterfactual detection. Meanwhile, multi-sample dropout and cross validation are used to improve versatility and prevent problems such as poor generosity caused by overfitting. Finally, our team Ferryman ranked the 8th place in the sub-task 1 of this competition.

pdf bib
Lee at SemEval-2020 Task 5 : ALBERT Model Based on the Maximum Ensemble Strategy and Different Data Sampling Methods for Detecting Counterfactual StatementsSemEval-2020 Task 5: ALBERT Model Based on the Maximum Ensemble Strategy and Different Data Sampling Methods for Detecting Counterfactual Statements
Junyi Li | Yuhang Wu | Bin Wang | Haiyan Ding

This article describes the system submitted to SemEval 2020 Task 5 : Modelling Causal Reasoning in Language : Detecting Counterfactuals. In this task, we only participate in the subtask A which is detecting counterfactual statements. In order to solve this sub-task, first of all, because of the problem of data balance, we use the undersampling and oversampling methods to process the data set. Second, we used the ALBERT model and the maximum ensemble method based on the ALBERT model. Our methods achieved a F1 score of 0.85 in subtask A.

pdf bib
NLU-Co at SemEval-2020 Task 5 : NLU / SVM Based Model Apply Tocharacterise and Extract Counterfactual Items on Raw DataNLU-Co at SemEval-2020 Task 5: NLU/SVM Based Model Apply Tocharacterise and Extract Counterfactual Items on Raw Data
Elvis Mboning Tchiaze | Damien Nouvel

In this article, we try to solve the problem of classification of counterfactual statements and extraction of antecedents / consequences in raw data, by mobilizing on one hand Support vector machine (SVMs) and on the other hand Natural Language Understanding (NLU) infrastructures available on the market for conversational agents. Our experiments allowed us to test different pipelines of two known platforms (Snips NLU and Rasa NLU). The results obtained show that a Rasa NLU pipeline, built with a well-preprocessed dataset and tuned algorithms, allows to model accurately the structure of a counterfactual event, in order to facilitate the identification and the extraction of its components.

pdf bib
YNU-oxz at SemEval-2020 Task 5 : Detecting Counterfactuals Based on Ordered Neurons LSTM and Hierarchical Attention NetworkYNU-oxz at SemEval-2020 Task 5: Detecting Counterfactuals Based on Ordered Neurons LSTM and Hierarchical Attention Network
Xiaozhi Ou | Shengyan Liu | Hongling Li

This paper describes the system and results of our team’s participation in SemEval-2020 Task5 : Modelling Causal Reasoning in Language : Detecting Counterfactuals, which aims to simulate counterfactual semantics and reasoning in natural language. This task contains two subtasks : Subtask1Detecting counterfactual statements and Subtask2Detecting antecedent and consequence. We only participated in Subtask1, aiming to determine whether a given sentence is counterfactual. In order to solve this task, we proposed a system based on Ordered Neurons LSTM (ON-LSTM) with Hierarchical Attention Network (HAN) and used Pooling operation for dimensionality reduction. Finally, we used the K-fold approach as the ensemble method. Our model achieved an F1 score of 0.7040 in Subtask1 (Ranked 16/27).

pdf bib
BERTatDE at SemEval-2020 Task 6 : Extracting Term-definition Pairs in Free Text Using Pre-trained ModelBERTatDE at SemEval-2020 Task 6: Extracting Term-definition Pairs in Free Text Using Pre-trained Model
Huihui Zhang | Feiliang Ren

Definition extraction is an important task in Nature Language Processing, and it is used to identify the terms and definitions related to terms. The task contains sentence classification task (i.e., classify whether it contains definition) and sequence labeling task (i.e., find the boundary of terms and definitions). The paper describes our system BERTatDE1 in sentence classification task (subtask 1) and sequence labeling task (subtask 2) in the definition extraction (SemEval-2020 Task 6). We use BERT to solve the multi-domain problems including the uncertainty of term boundary that is, different areas have different ways to definite the domain related terms. We use BERT, BiLSTM and attention in subtask 1 and our best result achieved 79.71 % in F1 and the eighteenth place in subtask 1. For the subtask 2, we use BERT, BiLSTM and CRF to sequence labeling, and achieve 40.73 % in Macro-averaged F1.

pdf bib
Defx at SemEval-2020 Task 6 : Joint Extraction of Concepts and Relations for Definition ExtractionSemEval-2020 Task 6: Joint Extraction of Concepts and Relations for Definition Extraction
Marc Hübner | Christoph Alt | Robert Schwarzenberg | Leonhard Hennig

Definition Extraction systems are a valuable knowledge source for both humans and algorithms. In this paper we describe our submissions to the DeftEval shared task (SemEval-2020 Task 6), which is evaluated on an English textbook corpus. We provide a detailed explanation of our system for the joint extraction of definition concepts and the relations among them. Furthermore we provide an ablation study of our model variations and describe the results of an error analysis.

pdf bib
UPB at SemEval-2020 Task 6 : Pretrained Language Models for Definition ExtractionUPB at SemEval-2020 Task 6: Pretrained Language Models for Definition Extraction
Andrei-Marius Avram | Dumitru-Clementin Cercel | Costin Chiru

This work presents our contribution in the context of the 6th task of SemEval-2020 : Extracting Definitions from Free Text in Textbooks (DeftEval). This competition consists of three subtasks with different levels of granularity : (1) classification of sentences as definitional or non-definitional, (2) labeling of definitional sentences, and (3) relation classification. We use various pretrained language models (i.e., BERT, XLNet, RoBERTa, SciBERT, and ALBERT) to solve each of the three subtasks of the competition. Specifically, for each language model variant, we experiment by both freezing its weights and fine-tuning them. We also explore a multi-task architecture that was trained to jointly predict the outputs for the second and the third subtasks. Our best performing model evaluated on the DeftEval dataset obtains the 32nd place for the first subtask and the 37th place for the second subtask. The code is available for further research at :\n

pdf bib
Buhscitu at SemEval-2020 Task 7 : Assessing Humour in Edited News Headlines Using Hand-Crafted Features and Online Knowledge BasesSemEval-2020 Task 7: Assessing Humour in Edited News Headlines Using Hand-Crafted Features and Online Knowledge Bases
Kristian Nørgaard Jensen | Nicolaj Filrup Rasmussen | Thai Wang | Marco Placenti | Barbara Plank

This paper describes a system that aims at assessing humour intensity in edited news headlines as part of the 7th task of SemEval-2020 on Humor, Emphasis and Sentiment. Various factors need to be accounted for in order to assess the funniness of an edited headline. We propose an architecture that uses hand-crafted features, knowledge bases and a language model to understand humour, and combines them in a regression model. Our system outperforms two baselines. In general, automatic humour assessment remains a difficult task.

pdf bib
Hasyarasa at SemEval-2020 Task 7 : Quantifying Humor as Departure from ExpectednessSemEval-2020 Task 7: Quantifying Humor as Departure from Expectedness
Ravi Theja Desetty | Ranit Chatterjee | Smita Ghaisas

This paper describes our system submission Hasyarasa for the SemEval-2020 Task-7 : Assessing Humor in Edited News Headlines. This task has two subtasks. The goal of Subtask 1 is to predict the mean funniness of the edited headline given the original and the edited headline. In Subtask 2, given two edits on the original headline, the goal is to predict the funnier of the two. We observed that the departure from expected state/ actions of situations/ individuals is the cause of humor in the edited headlines. We propose two novel features : Contextual Semantic Distance and Contextual Neighborhood Distance to estimate this departure and thus capture the contextual absurdity and hence the humor in the edited headlines. We have used these features together with a Bi-LSTM Attention based model and have achieved 0.53310 RMSE for Subtask 1 and 60.19 % accuracy for Subtask 2.

pdf bib
YNU-HPCC at SemEval-2020 Task 7 : Using an Ensemble BiGRU Model to Evaluate the Humor of Edited News TitlesYNU-HPCC at SemEval-2020 Task 7: Using an Ensemble BiGRU Model to Evaluate the Humor of Edited News Titles
Joseph Tomasulo | Jin Wang | Xuejie Zhang

This paper describes an ensemble model designed for Semeval-2020 Task 7. The task is based on the Humicroedit dataset that is comprised of news titles and one-word substitutions designed to make them humorous. We use BERT, FastText, Elmo, and Word2Vec to encode these titles then pass them to a bidirectional gated recurrent unit (BiGRU) with attention. Finally, we used XGBoost on the concatenation of the results of the different models to make predictions.

pdf bib
NLP_UIOWA at SemEval-2020 Task 8 : You’re Not the Only One Cursed with Knowledge-Multi Branch Model Memotion AnalysisNLP_UIOWA at SemEval-2020 Task 8: You’re Not the Only One Cursed with Knowledge - Multi Branch Model Memotion Analysis
Ingroj Shrestha | Jonathan Rusert

We propose hybrid models (HybridE and HybridW) for meme analysis (SemEval 2020 Task 8), which involves sentiment classification (Subtask A), humor classification (Subtask B), and scale of semantic classes (Subtask C). The hybrid model consists of BLSTM and CNN for text and image processing respectively. HybridE provides equal weight to BLSTM and CNN performance, while HybridW provides weightage based on the performance of BLSTM and CNN on a validation set. The performances (macro F1) of our hybrid model on Subtask A are 0.329 (HybridE), 0.328 (HybridW), on Subtask B are 0.507 (HybridE), 0.512 (HybridW), and on Subtask C are 0.309 (HybridE), 0.311 (HybridW).

pdf bib
CS-Embed at SemEval-2020 Task 9 : The Effectiveness of Code-switched Word Embeddings for Sentiment AnalysisCS-Embed at SemEval-2020 Task 9: The Effectiveness of Code-switched Word Embeddings for Sentiment Analysis
Frances Adriana Laureano De Leon | Florimond Guéniat | Harish Tayyar Madabushi

The growing popularity and applications of sentiment analysis of social media posts has naturally led to sentiment analysis of posts written in multiple languages, a practice known as code-switching. While recent research into code-switched posts has focused on the use of multilingual word embeddings, these embeddings were not trained on code-switched data. In this work, we present word-embeddings trained on code-switched tweets, specifically those that make use of Spanish and English, known as Spanglish. We explore the embedding space to discover how they capture the meanings of words in both languages. We test the effectiveness of these embeddings by participating in SemEval 2020 Task 9 : Sentiment Analysis on Code-Mixed Social Media Text. We utilised them to train a sentiment classifier that achieves an F-1 score of 0.722. This is higher than the baseline for the competition of 0.656, with our team (codalab username francesita) ranking 14 out of 29 participating teams, beating the baseline.Sentiment Analysis on Code-Mixed Social Media Text. We utilised them to train a sentiment classifier that achieves an F-1 score of 0.722. This is higher than the baseline for the competition of 0.656, with our team (codalab username francesita) ranking 14 out of 29 participating teams, beating the baseline.

pdf bib
FII-UAIC at SemEval-2020 Task 9 : Sentiment Analysis for Code-Mixed Social Media Text Using CNNFII-UAIC at SemEval-2020 Task 9: Sentiment Analysis for Code-Mixed Social Media Text Using CNN
Lavinia Aparaschivei | Andrei Palihovici | Daniela Gîfu

The Sentiment Analysis for Code-Mixed Social Media Text task at the SemEval 2020 competition focuses on sentiment analysis in code-mixed social media text, specifically, on the combination of English with Spanish (Spanglish) and Hindi (Hinglish). In this paper, we present a system able to classify tweets, from Spanish and English languages, into positive, negative and neutral. Firstly, we built a classifier able to provide corresponding sentiment labels. Besides the sentiment labels, we provide the language labels at the word level. Secondly, we generate a word-level representation, using Convolutional Neural Network (CNN) architecture. Our solution indicates promising results for the Sentimix Spanglish-English task (0.744), the team, Lavinia_Ap, occupied the 9th place. However, for the Sentimix Hindi-English task (0.324) the results have to be improved.

pdf bib
NLP-CIC at SemEval-2020 Task 9 : Analysing Sentiment in Code-switching Language Using a Simple Deep-learning ClassifierNLP-CIC at SemEval-2020 Task 9: Analysing Sentiment in Code-switching Language Using a Simple Deep-learning Classifier
Jason Angel | Segun Taofeek Aroyehun | Antonio Tamayo | Alexander Gelbukh

Code-switching is a phenomenon in which two or more languages are used in the same message. Nowadays, it is quite common to find messages with languages mixed in social media. This phenomenon presents a challenge for sentiment analysis. In this paper, we use a standard convolutional neural network model to predict the sentiment of tweets in a blend of Spanish and English languages. Our simple approach achieved a F1-score of 0:71 on test set on the competition. We analyze our best model capabilities and perform error analysis to expose important difficulties for classifying sentiment in a code-switching setting.

pdf bib
Palomino-Ochoa at SemEval-2020 Task 9 : Robust System Based on Transformer for Code-Mixed Sentiment ClassificationSemEval-2020 Task 9: Robust System Based on Transformer for Code-Mixed Sentiment Classification
Daniel Palomino | José Ochoa-Luna

We present a transfer learning system to perform a mixed Spanish-English sentiment classification task. Our proposal uses the state-of-the-art language model BERT and embed it within a ULMFiT transfer learning pipeline. This combination allows us to predict the polarity detection of code-mixed (English-Spanish) tweets. Thus, among 29 submitted systems, our approach (referred to as dplominop) is ranked 4th on the Sentimix Spanglish test set of SemEval 2020 Task 9. In fact, our system yields the weighted-F1 score value of 0.755 which can be easily reproduced the source code and implementation details are made available.

pdf bib
ULD@NUIG at SemEval-2020 Task 9 : Generative Morphemes with an Attention Model for Sentiment Analysis in Code-Mixed TextULD@NUIG at SemEval-2020 Task 9: Generative Morphemes with an Attention Model for Sentiment Analysis in Code-Mixed Text
Koustava Goswami | Priya Rani | Bharathi Raja Chakravarthi | Theodorus Fransen | John P. McCrae

Code mixing is a common phenomena in multilingual societies where people switch from one language to another for various reasons. Recent advances in public communication over different social media sites have led to an increase in the frequency of code-mixed usage in written language. In this paper, we present the Generative Morphemes with Attention (GenMA) Model sentiment analysis system contributed to SemEval 2020 Task 9 SentiMix. The system aims to predict the sentiments of the given English-Hindi code-mixed tweets without using word-level language tags instead inferring this automatically using a morphological model. The system is based on a novel deep neural network (DNN) architecture, which has outperformed the baseline F1-score on the test data-set as well as the validation data-set. Our results can be found under the user name koustava on the Sentimix Hindi English page.

pdf bib
ECNU at SemEval-2020 Task 7 : Assessing Humor in Edited News Headlines Using BiLSTM with AttentionECNU at SemEval-2020 Task 7: Assessing Humor in Edited News Headlines Using BiLSTM with Attention
Tiantian Zhang | Zhixuan Chen | Man Lan

In this paper we describe our system submitted to SemEval 2020 Task 7 : Assessing Humor in Edited News Headlines. We participated in all subtasks, in which the main goal is to predict the mean funniness of the edited headline given the original and the edited headline. Our system involves two similar sub-networks, which generate vector representations for the original and edited headlines respectively. And then we do a subtract operation of the outputs from two sub-networks to predict the funniness of the edited headline.

pdf bib
ELMo-NB at SemEval-2020 Task 7 : Assessing Sense of Humor in EditedNews Headlines Using ELMo and NBELMo-NB at SemEval-2020 Task 7: Assessing Sense of Humor in EditedNews Headlines Using ELMo and NB
Enas Khwaileh | Muntaha A. Al-As’ad

Our approach is constructed to improve on a couple of aspects ; preprocessing with an emphasis on humor sense detection, using embeddings from state-of-the-art language model(Elmo), and ensembling the results came up with using machine learning model Na ve Bayes(NB) with a deep learning pre-trained models. Elmo-NB participation has scored (0.5642) on the competition leader board, where results were measured by Root Mean Squared Error (RMSE).

pdf bib
Ferryman at SemEval-2020 Task 7 : Ensemble Model for Assessing Humor in Edited News HeadlinesSemEval-2020 Task 7: Ensemble Model for Assessing Humor in Edited News Headlines
Weilong Chen | Jipeng Li | Chenghao Huang | Wei Bai | Yanru Zhang | Yan Wang

Natural language processing (NLP) has been applied to various fields including text classification and sentiment analysis. In the shared task of assessing the funniness of edited news headlines, which is a part of the SemEval 2020 competition, we preprocess datasets by replacing abbreviation, stemming words, then merge three models including Light Gradient Boosting Machine (LightGBM), Long Short-Term Memory (LSTM), and Bidirectional Encoder Representation from Transformer (BERT) by taking the average to perform the best. Our team Ferryman wins the 9th place in Sub-task 1 of Task 7-Regression.

pdf bib
Funny3 at SemEval-2020 Task 7 : Humor Detection of Edited Headlines with LSTM and TFIDF Neural Network SystemSemEval-2020 Task 7: Humor Detection of Edited Headlines with LSTM and TFIDF Neural Network System
Xuefeng Luo | Kuan Tang

This paper presents a neural network system where we participate in the first task of SemEval-2020 shared task 7 Assessing the Funniness of Edited News Headlines. Our target is to create to neural network model that can predict the funniness of edited headlines. We build our model using a combination of LSTM and TF-IDF, then a feed-forward neural network. The system manages to slightly improve RSME scores regarding our mean score baseline.

pdf bib
HumorAAC at SemEval-2020 Task 7 : Assessing the Funniness of Edited News Headlines through Regression and Trump MentionsHumorAAC at SemEval-2020 Task 7: Assessing the Funniness of Edited News Headlines through Regression and Trump Mentions
Anna-Katharina Dick | Charlotte Weirich | Alla Kutkina

In this paper we describe our contribution to the Semeval-2020 Humor Assessment task. We essentially use three different features that are passed into a ridge regression to determine a funniness score for an edited news headline : statistical, count-based features, semantic features and contextual information. For deciding which one of two given edited headlines is funnier, we additionally use scoring information and logistic regression. Our work was mostly concentrated on investigating features, rather than improving prediction based on pre-trained language models. The resulting system is task-specific, lightweight and performs above the majority baseline. Our experiments indicate that features related to socio-cultural context, in our case mentions of Donald Trump, generally perform better than context-independent features like headline length.

pdf bib
MLEngineer at SemEval-2020 Task 7 : BERT-Flair Based Humor Detection Model (BFHumor)MLEngineer at SemEval-2020 Task 7: BERT-Flair Based Humor Detection Model (BFHumor)
Fara Shatnawi | Malak Abdullah | Mahmoud Hammad

Task 7, Assessing the Funniness of Edited News Headlines, in the International Workshop SemEval2020 introduces two sub-tasks to predict the funniness values of edited news headlines from the Reddit website. This paper proposes the BFHumor model of the MLEngineer team that participates in both sub-tasks in this competition. The BFHumor’s model is defined as a BERT-Flair based humor detection model that is a combination of different pre-trained models with various Natural Language Processing (NLP) techniques. The Bidirectional Encoder Representations from Transformers (BERT) regressor is considered the primary pre-trained model in our approach, whereas Flair is the main NLP library. It is worth mentioning that the BFHumor model has been ranked 4th in sub-task1 with a root mean square error (RMSE) value of 0.51966, and it is 0.02 away from the first ranked model. Also, the team is ranked 12th in the sub-task2 with an accuracy of 0.62291, which is 0.05 away from the top-ranked model. Our results indicate that the BFHumor model is one of the top models for detecting humor in the text.

pdf bib
UTFPR at SemEval-2020 Task 7 : Using Co-occurrence Frequencies to Capture UnexpectednessUTFPR at SemEval-2020 Task 7: Using Co-occurrence Frequencies to Capture Unexpectedness
Gustavo Henrique Paetzold

We describe the UTFPR system for SemEval-2020’s Task 7 : Assessing Humor in Edited News Headlines. Ours is a minimalist unsupervised system that uses word co-occurrence frequencies from large corpora to capture unexpectedness as a mean to capture funniness. Our system placed 22nd on the shared task’s Task 2. We found that our approach requires more text than we used to perform reliably, and that unexpectedness alone is not sufficient to gauge funniness for humorous content that targets a diverse target audience.

pdf bib
WUY at SemEval-2020 Task 7 : Combining BERT and Naive Bayes-SVM for Humor Assessment in Edited News HeadlinesWUY at SemEval-2020 Task 7: Combining BERT and Naive Bayes-SVM for Humor Assessment in Edited News Headlines
Cheng Zhang | Hayato Yamana

This paper describes our participation in SemEval 2020 Task 7 on assessment of humor in edited news headlines, which includes two subtasks, estimating the humor of micro-editd news headlines (subtask A) and predicting the more humorous of the two edited headlines (subtask B). To address these tasks, we propose two systems. The first system adopts a regression-based fine-tuned single-sequence bidirectional encoder representations from transformers (BERT) model with easy data augmentation (EDA), called BERT+EDA. The second system adopts a hybrid of a regression-based fine-tuned sequence-pair BERT model and a combined Naive Bayes and support vector machine (SVM) model estimated on term frequencyinverse document frequency (TFIDF) features, called BERT+NB-SVM. In this case, no additional training datasets were used, and the BERT+NB-SVM model outperformed BERT+EDA. The official root-mean-square deviation (RMSE) score for subtask A is 0.57369 and ranks 31st out of 48, whereas the best RMSE of BERT+NB-SVM is 0.52429, ranking 7th. For subtask B, we simply use a sequence-pair BERT model, the official accuracy of which is 0.53196 and ranks 25th out of 32.

pdf bib
BERT at SemEval-2020 Task 8 : Using BERT to Analyse Meme EmotionsBERT at SemEval-2020 Task 8: Using BERT to Analyse Meme Emotions
Adithya Avvaru | Sanath Vobilisetty

Sentiment analysis, being one of the most sought after research problems within Natural Language Processing (NLP) researchers. The range of problems being addressed by sentiment analysis is increasing. Till now, most of the research focuses on predicting sentiment, or sentiment categories like sarcasm, humor, offense and motivation on text data. But, there is very limited research that is focusing on predicting or analyzing the sentiment of internet memes. We try to address this problem as part of Task 8 of SemEval 2020 : Memotion Analysis. We have participated in all the three tasks under Memotion Analysis. Our system built using state-of-the-art Transformer-based pre-trained Bidirectional Encoder Representations from Transformers (BERT) performed better compared to baseline models for the two tasks A and C and performed close to the baseline model for task B. In this paper, we present the data used, steps used by us for data cleaning and preparation, the fine-tuning process for BERT based model and finally predict the sentiment or sentiment categories. We found that the sequence models like Long Short Term Memory(LSTM) and its variants performed below par in predicting the sentiments. We also performed a comparative analysis with other Transformer based models like DistilBERT and XLNet.

pdf bib
CSECU_KDE_MA at SemEval-2020 Task 8 : A Neural Attention Model for Memotion AnalysisCSECU_KDE_MA at SemEval-2020 Task 8: A Neural Attention Model for Memotion Analysis
Abu Nowshed Chy | Umme Aymun Siddiqua | Masaki Aono

A meme is a pictorial representation of an idea or theme. In the age of emerging volume of social media platforms, memes are spreading rapidly from person to person and becoming a trending ways of opinion expression. However, due to the multimodal characteristics of meme contents, detecting and analyzing the underlying emotion of a meme is a formidable task. In this paper, we present our approach for detecting the emotion of a meme defined in the SemEval-2020 Task 8. Our team CSECU_KDE_MA employs an attention-based neural network model to tackle the problem. Upon extracting the text contents from a meme using an optical character reader (OCR), we represent it using the distributed representation of words. Next, we perform the convolution based on multiple kernel sizes to obtain the higher-level feature sequences. The feature sequences are then fed into the attentive time-distributed bidirectional LSTM model to learn the long-term dependencies effectively. Experimental results show that our proposed neural model obtained competitive performance among the participants’ systems.

pdf bib
Hitachi at SemEval-2020 Task 8 : Simple but Effective Modality Ensemble for Meme Emotion RecognitionSemEval-2020 Task 8: Simple but Effective Modality Ensemble for Meme Emotion Recognition
Terufumi Morishita | Gaku Morio | Shota Horiguchi | Hiroaki Ozaki | Toshinori Miyoshi

Users of social networking services often share their emotions via multi-modal content, usually images paired with text embedded in them. SemEval-2020 task 8, Memotion Analysis, aims at automatically recognizing these emotions of so-called internet memes. In this paper, we propose a simple but effective Modality Ensemble that incorporates visual and textual deep-learning models, which are independently trained, rather than providing a single multi-modal joint network. To this end, we first fine-tune four pre-trained visual models (i.e., Inception-ResNet, PolyNet, SENet, and PNASNet) and four textual models (i.e., BERT, GPT-2, Transformer-XL, and XLNet). Then, we fuse their predictions with ensemble methods to effectively capture cross-modal correlations. The experiments performed on dev-set show that both visual and textual features aided each other, especially in subtask-C, and consequently, our system ranked 2nd on subtask-C.

pdf bib
Memebusters at SemEval-2020 Task 8 : Feature Fusion Model for Sentiment Analysis on Memes Using Transfer LearningSemEval-2020 Task 8: Feature Fusion Model for Sentiment Analysis on Memes Using Transfer Learning
Mayukh Sharma | Ilanthenral Kandasamy | W.b. Vasantha

In this paper, we describe our deep learning system used for SemEval 2020 Task 8 : Memotion analysis. We participated in all the subtasks i.e Subtask A : Sentiment classification, Subtask B : Humor classification, and Subtask C : Scales of semantic classes. Similar multimodal architecture was used for each subtask. The proposed architecture makes use of transfer learning for images and text feature extraction. The extracted features are then fused together using stacked bidirectional Long Short Term Memory (LSTM) and Gated Recurrent Unit (GRU) model with attention mechanism for final predictions. We also propose a single model for predicting semantic classes (Subtask B) as well as their scales (Subtask C) by branching the final output of the post LSTM dense layers. Our model was ranked 5 in Subtask B and ranked 8 in Subtask C and performed nicely in Subtask A on the leader board. Our system makes use of transfer learning for feature extraction and fusion of image and text features for predictions.

pdf bib
SIS@IIITH at SemEval-2020 Task 8 : An Overview of Simple Text Classification Methods for Meme AnalysisSIS@IIITH at SemEval-2020 Task 8: An Overview of Simple Text Classification Methods for Meme Analysis
Sravani Boinepelli | Manish Shrivastava | Vasudeva Varma

Memes are steadily taking over the feeds of the public on social media. There is always the threat of malicious users on the internet posting offensive content, even through memes. Hence, the automatic detection of offensive images / memes is imperative along with detection of offensive text. However, this is a much more complex task as it involves both visual cues as well as language understanding and cultural / context knowledge. This paper describes our approach to the task of SemEval-2020 Task 8 : Memotion Analysis. We chose to participate only in Task A which dealt with Sentiment Classification, which we formulated as a text classification problem. Through our experiments, we explored multiple training models to evaluate the performance of simple text classification algorithms on the raw text obtained after running OCR on meme images. Our submitted model achieved an accuracy of 72.69 % and exceeded the existing baseline’s Macro F1 score by 8 % on the official test dataset. Apart from describing our official submission, we shall elucidate how different classification models respond to this task.

pdf bib
UoR at SemEval-2020 Task 8 : Gaussian Mixture Modelling (GMM) Based Sampling Approach for Multi-modal Memotion AnalysisUoR at SemEval-2020 Task 8: Gaussian Mixture Modelling (GMM) Based Sampling Approach for Multi-modal Memotion Analysis
Zehao Liu | Emmanuel Osei-Brefo | Siyuan Chen | Huizhi Liang

Memes are widely used on social media. They usually contain multi-modal information such as images and texts, serving as valuable data sources to analyse opinions and sentiment orientations of online communities. The provided memes data often face an imbalanced data problem, that is, some classes or labelled sentiment categories significantly outnumber other classes. This often results in difficulty in applying machine learning techniques where balanced labelled input data are required. In this paper, a Gaussian Mixture Model sampling method is proposed to tackle the problem of class imbalance for the memes sentiment classification task. To utilise both text and image data, a multi-modal CNN-LSTM model is proposed to jointly learn latent features for positive, negative and neutral category predictions. The experiments show that the re-sampling model can slightly improve the accuracy on the trial data of sub-task A of Task 8. The multi-modal CNN-LSTM model can achieve macro F1 score 0.329 on the test set.

pdf bib
BAKSA at SemEval-2020 Task 9 : Bolstering CNN with Self-Attention for Sentiment Analysis of Code Mixed TextBAKSA at SemEval-2020 Task 9: Bolstering CNN with Self-Attention for Sentiment Analysis of Code Mixed Text
Ayush Kumar | Harsh Agarwal | Keshav Bansal | Ashutosh Modi

Sentiment Analysis of code-mixed text has diversified applications in opinion mining ranging from tagging user reviews to identifying social or political sentiments of a sub-population. In this paper, we present an ensemble architecture of convolutional neural net (CNN) and self-attention based LSTM for sentiment analysis of code-mixed tweets. While the CNN component helps in the classification of positive and negative tweets, the self-attention based LSTM, helps in the classification of neutral tweets, because of its ability to identify correct sentiment among multiple sentiment bearing units. We achieved F1 scores of 0.707 (ranked 5th) and 0.725 (ranked 13th) on Hindi-English (Hinglish) and Spanish-English (Spanglish) datasets, respectively. The submissions for Hinglish and Spanglish tasks were made under the usernames ayushk and harsh_6 respectively.

pdf bib
Deep Learning Brasil-NLP at SemEval-2020 Task 9 : Sentiment Analysis of Code-Mixed Tweets Using Ensemble of Language ModelsNLP at SemEval-2020 Task 9: Sentiment Analysis of Code-Mixed Tweets Using Ensemble of Language Models
Manoel Veríssimo dos Santos Neto | Ayrton Amaral | Nádia Silva | Anderson da Silva Soares

In this paper, we describe a methodology to predict sentiment in code-mixed tweets (hindi-english). Our team called verissimo.manoel in CodaLab developed an approach based on an ensemble of four models (MultiFiT, BERT, ALBERT, and XLNET). The final classification algorithm was an ensemble of some predictions of all softmax values from these four models. This architecture was used and evaluated in the context of the SemEval 2020 challenge (task 9), and our system got 72.7 % on the F1 score.

pdf bib
IUST at SemEval-2020 Task 9 : Sentiment Analysis for Code-Mixed Social Media Text Using Deep Neural Networks and Linear BaselinesIUST at SemEval-2020 Task 9: Sentiment Analysis for Code-Mixed Social Media Text Using Deep Neural Networks and Linear Baselines
Soroush Javdan | Taha Shangipour ataei | Behrouz Minaei-Bidgoli

Sentiment Analysis is a well-studied field of Natural Language Processing. However, the rapid growth of social media and noisy content within them poses significant challenges in addressing this problem with well-established methods and tools. One of these challenges is code-mixing, which means using different languages to convey thoughts in social media texts. Our group, with the name of IUST(username : TAHA), participated at the SemEval-2020 shared task 9 on Sentiment Analysis for Code-Mixed Social Media Text, and we have attempted to develop a system to predict the sentiment of a given code-mixed tweet. We used different preprocessing techniques and proposed to use different methods that vary from NBSVM to more complicated deep neural network models. Our best performing method obtains an F1 score of 0.751 for the Spanish-English sub-task and 0.706 over the Hindi-English sub-task.

pdf bib
MeisterMorxrc at SemEval-2020 Task 9 : Fine-Tune Bert and Multitask Learning for Sentiment Analysis of Code-Mixed TweetsMeisterMorxrc at SemEval-2020 Task 9: Fine-Tune Bert and Multitask Learning for Sentiment Analysis of Code-Mixed Tweets
Qi Wu | Peng Wang | Chenghao Huang

Natural language processing (NLP) has been applied to various fields including text classification and sentiment analysis. In the shared task of sentiment analysis of code-mixed tweets, which is a part of the SemEval-2020 competition, we preprocess datasets by replacing emoji and deleting uncommon characters and so on, and then fine-tune the Bidirectional Encoder Representation from Transformers(BERT) to perform the best. After exhausting top3 submissions, Our team MeisterMorxrc achieves an averaged F1 score of 0.730 in this task, and and our codalab username is MeisterMorxrc

pdf bib
WESSA at SemEval-2020 Task 9 : Code-Mixed Sentiment Analysis Using TransformersWESSA at SemEval-2020 Task 9: Code-Mixed Sentiment Analysis Using Transformers
Ahmed Sultan | Mahmoud Salim | Amina Gaber | Islam El Hosary

In this paper, we describe our system submitted for SemEval 2020 Task 9, Sentiment Analysis for Code-Mixed Social Media Text alongside other experiments. Our best performing system is a Transfer Learning-based model that fine-tunes XLM-RoBERTa, a transformer-based multilingual masked language model, on monolingual English and Spanish data and Spanish-English code-mixed data. Our system outperforms the official task baseline by achieving a 70.1 % average F1-Score on the official leaderboard using the test set. For later submissions, our system manages to achieve a 75.9 % average F1-Score on the test set using CodaLab username ahmed0sultan.

pdf bib
Zyy1510 Team at SemEval-2020 Task 9 : Sentiment Analysis for Code-Mixed Social Media Text with Sub-word Level RepresentationsSemEval-2020 Task 9: Sentiment Analysis for Code-Mixed Social Media Text with Sub-word Level Representations
Yueying Zhu | Xiaobing Zhou | Hongling Li | Kunjie Dong

This paper reports the zyy1510 team’s work in the International Workshop on Semantic Evaluation (SemEval-2020) shared task on Sentiment analysis for Code-Mixed (Hindi-English, English-Spanish) Social Media Text. The purpose of this task is to determine the polarity of the text, dividing it into one of the three labels positive, negative and neutral. To achieve this goal, we propose an ensemble model of word n-grams-based Multinomial Naive Bayes (MNB) and sub-word level representations in LSTM (Sub-word LSTM) to identify the sentiments of code-mixed data of Hindi-English and English-Spanish. This ensemble model combines the advantage of rich sequential patterns and the intermediate features after convolution from the LSTM model, and the polarity of keywords from the MNB model to obtain the final sentiment score. We have tested our system on Hindi-English and English-Spanish code-mixed social media data sets released for the task. Our model achieves the F1 score of 0.647 in the Hindi-English task and 0.682 in the English-Spanish task, respectively.

pdf bib
SemEval-2020 Task 12 : Multilingual Offensive Language Identification in Social Media (OffensEval 2020)SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020)
Marcos Zampieri | Preslav Nakov | Sara Rosenthal | Pepa Atanasova | Georgi Karadzhov | Hamdy Mubarak | Leon Derczynski | Zeses Pitenis | Çağrı Çöltekin

We present the results and the main findings of SemEval-2020 Task 12 on Multilingual Offensive Language Identification in Social Media (OffensEval-2020). The task included three subtasks corresponding to the hierarchical taxonomy of the OLID schema from OffensEval-2019, and it was offered in five languages : Arabic, Danish, English, Greek, and Turkish. OffensEval-2020 was one of the most popular tasks at SemEval-2020, attracting a large number of participants across all subtasks and languages : a total of 528 teams signed up to participate in the task, 145 teams submitted official runs on the test data, and 70 teams submitted system description papers.

pdf bib
Galileo at SemEval-2020 Task 12 : Multi-lingual Learning for Offensive Language Identification Using Pre-trained Language ModelsSemEval-2020 Task 12: Multi-lingual Learning for Offensive Language Identification Using Pre-trained Language Models
Shuohuan Wang | Jiaxiang Liu | Xuan Ouyang | Yu Sun

This paper describes Galileo’s performance in SemEval-2020 Task 12 on detecting and categorizing offensive language in social media. For Offensive Language Identification, we proposed a multi-lingual method using Pre-trained Language Models, ERNIE and XLM-R. For offensive language categorization, we proposed a knowledge distillation method trained on soft labels generated by several supervised models. Our team participated in all three sub-tasks. In Sub-task A-Offensive Language Identification, we ranked first in terms of average F1 scores in all languages. We are also the only team which ranked among the top three across all languages. We also took the first place in Sub-task B-Automatic Categorization of Offense Types and Sub-task C-Offence Target Identification.

pdf bib
Aschern at SemEval-2020 Task 11 : It Takes Three to Tango : RoBERTa, CRF, and Transfer LearningSemEval-2020 Task 11: It Takes Three to Tango: RoBERTa, CRF, and Transfer Learning
Anton Chernyavskiy | Dmitry Ilvovsky | Preslav Nakov

We describe our system for SemEval-2020 Task 11 on Detection of Propaganda Techniques in News Articles. We developed ensemble models using RoBERTa-based neural architectures, additional CRF layers, transfer learning between the two subtasks, and advanced post-processing to handle the multi-label nature of the task, the consistency between nested spans, repetitions, and labels from similar spans in training. We achieved sizable improvements over baseline fine-tuned RoBERTa models, and the official evaluation ranked our system 3rd (almost tied with the 2nd) out of 36 teams on the span identification subtask with an F1 score of 0.491, and 2nd (almost tied with the 1st) out of 31 teams on the technique classification subtask with an F1 score of 0.62.

pdf bib
AdelaideCyC at SemEval-2020 Task 12 : Ensemble of Classifiers for Offensive Language Detection in Social MediaAdelaideCyC at SemEval-2020 Task 12: Ensemble of Classifiers for Offensive Language Detection in Social Media
Mahen Herath | Thushari Atapattu | Hoang Anh Dung | Christoph Treude | Katrina Falkner

This paper describes the systems our team (AdelaideCyC) has developed for SemEval Task 12 (OffensEval 2020) to detect offensive language in social media. The challenge focuses on three subtasks offensive language identification (subtask A), offense type identification (subtask B), and offense target identification (subtask C). Our team has participated in all the three subtasks. We have developed machine learning and deep learning-based ensembles of models. We have achieved F1-scores of 0.906, 0.552, and 0.623 in subtask A, B, and C respectively. While our performance scores are promising for subtask A, the results demonstrate that subtask B and C still remain challenging to classify.

pdf bib
GruPaTo at SemEval-2020 Task 12 : Retraining mBERT on Social Media and Fine-tuned Offensive Language ModelsGruPaTo at SemEval-2020 Task 12: Retraining mBERT on Social Media and Fine-tuned Offensive Language Models
Davide Colla | Tommaso Caselli | Valerio Basile | Jelena Mitrović | Michael Granitzer

We introduce an approach to multilingual Offensive Language Detection based on the mBERT transformer model. We download extra training data from Twitter in English, Danish, and Turkish, and use it to re-train the model. We then fine-tuned the model on the provided training data and, in some configurations, implement transfer learning approach exploiting the typological relatedness between English and Danish. Our systems obtained good results across the three languages (.9036 for EN,.7619 for DA, and.7789 for TR).

pdf bib
GUIR at SemEval-2020 Task 12 : Domain-Tuned Contextualized Models for Offensive Language DetectionGUIR at SemEval-2020 Task 12: Domain-Tuned Contextualized Models for Offensive Language Detection
Sajad Sotudeh | Tong Xiang | Hao-Ren Yao | Sean MacAvaney | Eugene Yang | Nazli Goharian | Ophir Frieder

Offensive language detection is an important and challenging task in natural language processing. We present our submissions to the OffensEval 2020 shared task, which includes three English sub-tasks : identifying the presence of offensive language (Sub-task A), identifying the presence of target in offensive language (Sub-task B), and identifying the categories of the target (Sub-task C). Our experiments explore using a domain-tuned contextualized language model (namely, BERT) for this task. We also experiment with different components and configurations (e.g., a multi-view SVM) stacked upon BERT models for specific sub-tasks. Our submissions achieve F1 scores of 91.7 % in Sub-task A, 66.5 % in Sub-task B, and 63.2 % in Sub-task C. We perform an ablation study which reveals that domain tuning considerably improves the classification performance. Furthermore, error analysis shows common misclassification errors made by our model and outlines research directions for future.

pdf bib
IIITG-ADBU at SemEval-2020 Task 12 : Comparison of BERT and BiLSTM in Detecting Offensive LanguageIIITG-ADBU at SemEval-2020 Task 12: Comparison of BERT and BiLSTM in Detecting Offensive Language
Arup Baruah | Kaushik Das | Ferdous Barbhuiya | Kuntal Dey

Task 12 of SemEval 2020 consisted of 3 subtasks, namely offensive language identification (Subtask A), categorization of offense type (Subtask B), and offense target identification (Subtask C). This paper presents the results our classifiers obtained for the English language in the 3 subtasks. The classifiers used by us were BERT and BiLSTM. On the test set, our BERT classifier obtained macro F1 score of 0.90707 for subtask A, and 0.65279 for subtask B. The BiLSTM classifier obtained macro F1 score of 0.57565 for subtask C. The paper also performs an analysis of the errors made by our classifiers. We conjecture that the presence of few misleading instances in the dataset is affecting the performance of the classifiers. Our analysis also discusses the need of temporal context and world knowledge to determine the offensiveness of few comments.

pdf bib
NUIG at SemEval-2020 Task 12 : Pseudo Labelling