European Chapter of the Association for Computational Linguistics (2021)



bib (full) Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

pdf bib
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume
Paola Merlo | Jorg Tiedemann | Reut Tsarfaty

pdf bib
Unsupervised Sentence-embeddings by Manifold Approximation and Projection
Subhradeep Kayal

The concept of unsupervised universal sentence encoders has gained traction recently, wherein pre-trained models generate effective task-agnostic fixed-dimensional representations for phrases, sentences and paragraphs. Such methods are of varying complexity, from simple weighted-averages of word vectors to complex language-models based on bidirectional transformers. In this work we propose a novel technique to generate sentence-embeddings in an unsupervised fashion by projecting the sentences onto a fixed-dimensional manifold with the objective of preserving local neighbourhoods in the original space. To delineate such neighbourhoods we experiment with several set-distance metrics, including the recently proposed Word Mover’s distance, while the fixed-dimensional projection is achieved by employing a scalable and efficient manifold approximation method rooted in topological data analysis. We test our approach, which we term EMAP or Embeddings by Manifold Approximation and Projection, on six publicly available text-classification datasets of varying size and complexity. Empirical results show that our method consistently performs similar to or better than several alternative state-of-the-art approaches.

pdf bib
Disambiguatory Signals are Stronger in Word-initial Positions
Tiago Pimentel | Ryan Cotterell | Brian Roark

Psycholinguistic studies of human word processing and lexical access provide ample evidence of the preferred nature of word-initial versus word-final segments, e.g., in terms of attention paid by listeners (greater) or the likelihood of reduction by speakers (lower). This has led to the conjectureas in Wedel et al. (2019b), but common elsewherethat languages have evolved to provide more information earlier in words than later. Information-theoretic methods to establish such tendencies in lexicons have suffered from several methodological shortcomings that leave open the question of whether this high word-initial informativeness is actually a property of the lexicon or simply an artefact of the incremental nature of recognition. In this paper, we point out the confounds in existing methods for comparing the informativeness of segments early in the word versus later in the word, and present several new measures that avoid these confounds. When controlling for these confounds, we still find evidence across hundreds of languages that indeed there is a cross-linguistic tendency to front-load information in words.

pdf bib
If you’ve got it, flaunt it : Making the most of fine-grained sentiment annotations
Jeremy Barnes | Lilja Øvrelid | Erik Velldal

Fine-grained sentiment analysis attempts to extract sentiment holders, targets and polar expressions and resolve the relationship between them, but progress has been hampered by the difficulty of annotation. Targeted sentiment analysis, on the other hand, is a more narrow task, focusing on extracting sentiment targets and classifying their polarity. In this paper, we explore whether incorporating holder and expression information can improve target extraction and classification and perform experiments on eight English datasets. We conclude that jointly predicting target and polarity BIO labels improves target extraction, and that augmenting the input text with gold expressions generally improves targeted polarity classification. This highlights the potential importance of annotating expressions for fine-grained sentiment datasets. At the same time, our results show that performance of current models for predicting polar expressions is poor, hampering the benefit of this information in practice.

pdf bib
Telling BERT’s Full Story : from Local Attention to Global AggregationBERT’s Full Story: from Local Attention to Global Aggregation
Damian Pascual | Gino Brunner | Roger Wattenhofer

We take a deep look into the behaviour of self-attention heads in the transformer architecture. In light of recent work discouraging the use of attention distributions for explaining a model’s behaviour, we show that attention distributions can nevertheless provide insights into the local behaviour of attention heads. This way, we propose a distinction between local patterns revealed by attention and global patterns that refer back to the input, and analyze BERT from both angles. We use gradient attribution to analyze how the output of an attention head depends on the input tokens, effectively extending the local attention-based analysis to account for the mixing of information throughout the transformer layers. We find that there is a significant mismatch between attention and attribution distributions, caused by the mixing of context inside the model. We quantify this discrepancy and observe that interestingly, there are some patterns that persist across all layers despite the mixing.

pdf bib
Maximal Multiverse Learning for Promoting Cross-Task Generalization of Fine-Tuned Language Models
Itzik Malkiel | Lior Wolf

Language modeling with BERT consists of two phases of (i) unsupervised pre-training on unlabeled text, and (ii) fine-tuning for a specific supervised task. We present a method that leverages the second phase to its fullest, by applying an extensive number of parallel classifier heads, which are enforced to be orthogonal, while adaptively eliminating the weaker heads during training. We conduct an extensive inter- and intra-dataset evaluation, showing that our method improves the generalization ability of BERT, sometimes leading to a +9 % gain in accuracy. These results highlight the importance of a proper fine-tuning procedure, especially for relatively smaller-sized datasets. Our code is attached as supplementary.

pdf bib
Dictionary-based Debiasing of Pre-trained Word Embeddings
Masahiro Kaneko | Danushka Bollegala

Word embeddings trained on large corpora have shown to encode high levels of unfair discriminatory gender, racial, religious and ethnic biases. In contrast, human-written dictionaries describe the meanings of words in a concise, objective and an unbiased manner. We propose a method for debiasing pre-trained word embeddings using dictionaries, without requiring access to the original training resources or any knowledge regarding the word embedding algorithms used. Unlike prior work, our proposed method does not require the types of biases to be pre-defined in the form of word lists, and learns the constraints that must be satisfied by unbiased word embeddings automatically from dictionary definitions of the words. Specifically, we learn an encoder to generate a debiased version of an input word embedding such that it (a) retains the semantics of the pre-trained word embedding, (b) agrees with the unbiased definition of the word according to the dictionary, and (c) remains orthogonal to the vector space spanned by any biased basis vectors in the pre-trained word embedding space. Experimental results on standard benchmark datasets show that the proposed method can accurately remove unfair biases encoded in pre-trained word embeddings, while preserving useful semantics.

pdf bib
Non-Autoregressive Text Generation with Pre-trained Language Models
Yixuan Su | Deng Cai | Yan Wang | David Vandyke | Simon Baker | Piji Li | Nigel Collier

Non-autoregressive generation (NAG) has recently attracted great attention due to its fast inference speed. However, the generation quality of existing NAG models still lags behind their autoregressive counterparts. In this work, we show that BERT can be employed as the backbone of a NAG model for a greatly improved performance. Additionally, we devise two mechanisms to alleviate the two common problems of vanilla NAG models : the inflexibility of prefixed output length and the conditional independence of individual token predictions. To further strengthen the speed advantage of the proposed model, we propose a new decoding strategy, ratio-first, for applications where the output lengths can be approximately estimated beforehand. For a comprehensive evaluation, we test the proposed model on three text generation tasks, including text summarization, sentence compression and machine translation. Experimental results show that our model significantly outperforms existing non-autoregressive baselines and achieves competitive performance with many strong autoregressive models. In addition, we also conduct extensive analysis experiments to reveal the effect of each proposed component.

pdf bib
CD2CR : Co-reference resolution across documents and domainsCDˆ2CR: Co-reference resolution across documents and domains
James Ravenscroft | Amanda Clare | Arie Cattan | Ido Dagan | Maria Liakata

Cross-document co-reference resolution (CDCR) is the task of identifying and linking mentions to entities and concepts across many text documents. Current state-of-the-art models for this task assume that all documents are of the same type (e.g. news articles) or fall under the same theme. However, it is also desirable to perform CDCR across different domains (type or theme). A particular use case we focus on in this paper is the resolution of entities mentioned across scientific work and newspaper articles that discuss them. Identifying the same entities and corresponding concepts in both scientific articles and news can help scientists understand how their work is represented in mainstream media. We propose a new task and English language dataset for cross-document cross-domain co-reference resolution (CD2CR). The task aims to identify links between entities across heterogeneous document types. We show that in this cross-domain, cross-document setting, existing CDCR models do not perform well and we provide a baseline model that outperforms current state-of-the-art CDCR models on CD2CR. Our data set, annotation tool and guidelines as well as our model for cross-document cross-domain co-reference are all supplied as open access open source resources.

pdf bib
Recipes for Building an Open-Domain Chatbot
Stephen Roller | Emily Dinan | Naman Goyal | Da Ju | Mary Williamson | Yinhan Liu | Jing Xu | Myle Ott | Eric Michael Smith | Y-Lan Boureau | Jason Weston

Building open-domain chatbots is a challenging area for machine learning research. While prior work has shown that scaling neural models in the number of parameters and the size of the data they are trained on gives improved results, we highlight other ingredients. Good conversation requires blended skills : providing engaging talking points, and displaying knowledge, empathy and personality appropriately, while maintaining a consistent persona. We show that large scale models can learn these skills when given appropriate training data and choice of generation strategy. We build variants of these recipes with 90 M, 2.7B and 9.4B parameter models, and make our models and code publicly available. Human evaluations show our best models outperform existing approaches in multi-turn dialogue on engagingness and humanness measurements. We then discuss the limitations of this work by analyzing failure cases of our models.

pdf bib
Evaluating the Evaluation of Diversity in Natural Language Generation
Guy Tevet | Jonathan Berant

Despite growing interest in natural language generation (NLG) models that produce diverse outputs, there is currently no principled method for evaluating the diversity of an NLG system. In this work, we propose a framework for evaluating diversity metrics. The framework measures the correlation between a proposed diversity metric and a diversity parameter, a single parameter that controls some aspect of diversity in generated text. For example, a diversity parameter might be a binary variable used to instruct crowdsourcing workers to generate text with either low or high content diversity. We demonstrate the utility of our framework by : (a) establishing best practices for eliciting diversity judgments from humans, (b) showing that humans substantially outperform automatic metrics in estimating content diversity, and (c) demonstrating that existing methods for controlling diversity by tuning a decoding parameter mostly affect form but not meaning. Our framework can advance the understanding of different diversity metrics, an essential step on the road towards better NLG systems.metrics. The framework measures the correlation between a proposed diversity metric and a diversity parameter, a single parameter that controls some aspect of diversity in generated text. For example, a diversity parameter might be a binary variable used to instruct crowdsourcing workers to generate text with either low or high content diversity. We demonstrate the utility of our framework by: (a) establishing best practices for eliciting diversity judgments from humans, (b) showing that humans substantially outperform automatic metrics in estimating content diversity, and (c) demonstrating that existing methods for controlling diversity by tuning a “decoding parameter” mostly affect form but not meaning. Our framework can advance the understanding of different diversity metrics, an essential step on the road towards better NLG systems.

pdf bib
Retrieval, Re-ranking and Multi-task Learning for Knowledge-Base Question Answering
Zhiguo Wang | Patrick Ng | Ramesh Nallapati | Bing Xiang

Question answering over knowledge bases (KBQA) usually involves three sub-tasks, namely topic entity detection, entity linking and relation detection. Due to the large number of entities and relations inside knowledge bases (KB), previous work usually utilized sophisticated rules to narrow down the search space and managed only a subset of KBs in memory. In this work, we leverage a retrieve-and-rerank framework to access KBs via traditional information retrieval (IR) method, and re-rank retrieved candidates with more powerful neural networks such as the pre-trained BERT model. Considering the fact that directly assigning a different BERT model for each sub-task may incur prohibitive costs, we propose to share a BERT encoder across all three sub-tasks and define task-specific layers on top of the shared layer. The unified model is then trained under a multi-task learning framework. Experiments show that : (1) Our IR-based retrieval method is able to collect high-quality candidates efficiently, thus enables our method adapt to large-scale KBs easily ; (2) the BERT model improves the accuracy across all three sub-tasks ; and (3) benefiting from multi-task learning, the unified model obtains further improvements with only 1/3 of the original parameters. Our final model achieves competitive results on the SimpleQuestions dataset and superior performance on the FreebaseQA dataset.retrieve-and-rerank framework to access KBs via traditional information retrieval (IR) method, and re-rank retrieved candidates with more powerful neural networks such as the pre-trained BERT model. Considering the fact that directly assigning a different BERT model for each sub-task may incur prohibitive costs, we propose to share a BERT encoder across all three sub-tasks and define task-specific layers on top of the shared layer. The unified model is then trained under a multi-task learning framework. Experiments show that: (1) Our IR-based retrieval method is able to collect high-quality candidates efficiently, thus enables our method adapt to large-scale KBs easily; (2) the BERT model improves the accuracy across all three sub-tasks; and (3) benefiting from multi-task learning, the unified model obtains further improvements with only 1/3 of the original parameters. Our final model achieves competitive results on the SimpleQuestions dataset and superior performance on the FreebaseQA dataset.

pdf bib
Implicitly Abusive Comparisons A New Dataset and Linguistic Analysis
Michael Wiegand | Maja Geulig | Josef Ruppenhofer

We examine the task of detecting implicitly abusive comparisons (e.g. Your hair looks like you have been electrocuted). Implicitly abusive comparisons are abusive comparisons in which abusive words (e.g. dumbass or scum) are absent. We detail the process of creating a novel dataset for this task via crowdsourcing that includes several measures to obtain a sufficiently representative and unbiased set of comparisons. We also present classification experiments that include a range of linguistic features that help us better understand the mechanisms underlying abusive comparisons.

pdf bib
A Systematic Review of Reproducibility Research in Natural Language Processing
Anya Belz | Shubham Agarwal | Anastasia Shimorina | Ehud Reiter

Against the background of what has been termed a reproducibility crisis in science, the NLP field is becoming increasingly interested in, and conscientious about, the reproducibility of its results. The past few years have seen an impressive range of new initiatives, events and active research in the area. However, the field is far from reaching a consensus about how reproducibility should be defined, measured and addressed, with diversity of views currently increasing rather than converging. With this focused contribution, we aim to provide a wide-angle, and as near as possible complete, snapshot of current work on reproducibility in NLP,

pdf bib
Semantic Oppositeness Assisted Deep Contextual Modeling for Automatic Rumor Detection in Social Networks
Nisansa de Silva | Dejing Dou

Social networks face a major challenge in the form of rumors and fake news, due to their intrinsic nature of connecting users to millions of others, and of giving any individual the power to post anything. Given the rapid, widespread dissemination of information in social networks, manually detecting suspicious news is sub-optimal. Thus, research on automatic rumor detection has become a necessity. Previous works in the domain have utilized the reply relations between posts, as well as the semantic similarity between the main post and its context, consisting of replies, in order to obtain state-of-the-art performance. In this work, we demonstrate that semantic oppositeness can improve the performance on the task of rumor detection. We show that semantic oppositeness captures elements of discord, which are not properly covered by previous efforts, which only utilize semantic similarity or reply structure. We show, with extensive experiments on recent data sets for this problem, that our proposed model achieves state-of-the-art performance. Further, we show that our model is more resistant to the variances in performance introduced by randomness.

pdf bib
Polarized-VAE : Proximity Based Disentangled Representation Learning for Text GenerationVAE: Proximity Based Disentangled Representation Learning for Text Generation
Vikash Balasubramanian | Ivan Kobyzev | Hareesh Bahuleyan | Ilya Shapiro | Olga Vechtomova

Learning disentangled representations of realworld data is a challenging open problem. Most previous methods have focused on either supervised approaches which use attribute labels or unsupervised approaches that manipulate the factorization in the latent space of models such as the variational autoencoder (VAE) by training with task-specific losses. In this work, we propose polarized-VAE, an approach that disentangles select attributes in the latent space based on proximity measures reflecting the similarity between data points with respect to these attributes. We apply our method to disentangle the semantics and syntax of sentences and carry out transfer experiments. Polarized-VAE outperforms the VAE baseline and is competitive with state-of-the-art approaches, while being more a general framework that is applicable to other attribute disentanglement tasks.

pdf bib
FEWS : Large-Scale, Low-Shot Word Sense Disambiguation with the DictionaryFEWS: Large-Scale, Low-Shot Word Sense Disambiguation with the Dictionary
Terra Blevins | Mandar Joshi | Luke Zettlemoyer

Current models for Word Sense Disambiguation (WSD) struggle to disambiguate rare senses, despite reaching human performance on global WSD metrics. This stems from a lack of data for both modeling and evaluating rare senses in existing WSD datasets. In this paper, we introduce FEWS (Few-shot Examples of Word Senses), a new low-shot WSD dataset automatically extracted from example sentences in Wiktionary. FEWS has high sense coverage across different natural language domains and provides : (1) a large training set that covers many more senses than previous datasets and (2) a comprehensive evaluation set containing few- and zero-shot examples of a wide variety of senses. We establish baselines on FEWS with knowledge-based and neural WSD approaches and present transfer learning experiments demonstrating that models additionally trained with FEWS better capture rare senses in existing WSD datasets. Finally, we find humans outperform the best baseline models on FEWS, indicating that FEWS will support significant future work on low-shot WSD.

pdf bib
MONAH : Multi-Modal Narratives for Humans to analyze conversationsMONAH: Multi-Modal Narratives for Humans to analyze conversations
Joshua Y. Kim | Kalina Yacef | Greyson Kim | Chunfeng Liu | Rafael Calvo | Silas Taylor

In conversational analyses, humans manually weave multimodal information into the transcripts, which is significantly time-consuming. We introduce a system that automatically expands the verbatim transcripts of video-recorded conversations using multimodal data streams. This system uses a set of preprocessing rules to weave multimodal annotations into the verbatim transcripts and promote interpretability. Our feature engineering contributions are two-fold : firstly, we identify the range of multimodal features relevant to detect rapport-building ; secondly, we expand the range of multimodal annotations and show that the expansion leads to statistically significant improvements in detecting rapport-building.

pdf bib
Does Typological Blinding Impede Cross-Lingual Sharing?
Johannes Bjerva | Isabelle Augenstein

Bridging the performance gap between high- and low-resource languages has been the focus of much previous work. Typological features from databases such as the World Atlas of Language Structures (WALS) are a prime candidate for this, as such data exists even for very low-resource languages. However, previous work has only found minor benefits from using typological information. Our hypothesis is that a model trained in a cross-lingual setting will pick up on typological cues from the input data, thus overshadowing the utility of explicitly using such features. We verify this hypothesis by blinding a model to typological information, and investigate how cross-lingual sharing and performance is impacted. Our model is based on a cross-lingual architecture in which the latent weights governing the sharing between languages is learnt during training. We show that (i) preventing this model from exploiting typology severely reduces performance, while a control experiment reaffirms that (ii) encouraging sharing according to typology somewhat improves performance.

pdf bib
Improving Factual Consistency Between a Response and Persona Facts
Mohsen Mesgar | Edwin Simpson | Iryna Gurevych

Neural models for response generation produce responses that are semantically plausible but not necessarily factually consistent with facts describing the speaker’s persona. These models are trained with fully supervised learning where the objective function barely captures factual consistency. We propose to fine-tune these models by reinforcement learning and an efficient reward function that explicitly captures the consistency between a response and persona facts as well as semantic plausibility. Our automatic and human evaluations on the PersonaChat corpus confirm that our approach increases the rate of responses that are factually consistent with persona facts over its supervised counterpart while retains the language quality of responses.

pdf bib
PolyLM : Learning about Polysemy through Language ModelingPolyLM: Learning about Polysemy through Language Modeling
Alan Ansell | Felipe Bravo-Marquez | Bernhard Pfahringer

To avoid the meaning conflation deficiency of word embeddings, a number of models have aimed to embed individual word senses. These methods at one time performed well on tasks such as word sense induction (WSI), but they have since been overtaken by task-specific techniques which exploit contextualized embeddings. However, sense embeddings and contextualization need not be mutually exclusive. We introduce PolyLM, a method which formulates the task of learning sense embeddings as a language modeling problem, allowing contextualization techniques to be applied. PolyLM is based on two underlying assumptions about word senses : firstly, that the probability of a word occurring in a given context is equal to the sum of the probabilities of its individual senses occurring ; and secondly, that for a given occurrence of a word, one of its senses tends to be much more plausible in the context than the others. We evaluate PolyLM on WSI, showing that it performs considerably better than previous sense embedding techniques, and matches the current state-of-the-art specialized WSI method despite having six times fewer parameters. Code and pre-trained models are available at

pdf bib
Cross-lingual Entity Alignment with Incidental Supervision
Muhao Chen | Weijia Shi | Ben Zhou | Dan Roth

Much research effort has been put to multilingual knowledge graph (KG) embedding methods to address the entity alignment task, which seeks to match entities in different languagespecific KGs that refer to the same real-world object. Such methods are often hindered by the insufficiency of seed alignment provided between KGs. Therefore, we propose a new model, JEANS, which jointly represents multilingual KGs and text corpora in a shared embedding scheme, and seeks to improve entity alignment with incidental supervision signals from text. JEANS first deploys an entity grounding process to combine each KG with the monolingual text corpus. Then, two learning processes are conducted : (i) an embedding learning process to encode the KG and text of each language in one embedding space, and (ii) a self-learning based alignment learning process to iteratively induce the correspondence of entities and that of lexemes between embeddings. Experiments on benchmark datasets show that JEANS leads to promising improvement on entity alignment with incidental supervision, and significantly outperforms state-of-the-art methods that solely rely on internal information of KGs.

pdf bib
Query Generation for Multimodal Documents
Kyungho Kim | Kyungjae Lee | Seung-won Hwang | Young-In Song | Seungwook Lee

This paper studies the problem of generatinglikely queries for multimodal documents withimages. Our application scenario is enablingefficient first-stage retrieval of relevant doc-uments, by attaching generated queries to doc-uments before indexing. We can then indexthis expanded text to efficiently narrow downto candidate matches using inverted index, sothat expensive reranking can follow. Our eval-uation results show that our proposed multi-modal representation meaningfully improvesrelevance ranking. More importantly, ourframework can achieve the state of the art inthe first stage retrieval scenarios

pdf bib
End-to-End Argument Mining as Biaffine Dependency Parsing
Yuxiao Ye | Simone Teufel

Non-neural approaches to argument mining (AM) are often pipelined and require heavy feature-engineering. In this paper, we propose a neural end-to-end approach to AM which is based on dependency parsing, in contrast to the current state-of-the-art which relies on relation extraction. Our biaffine AM dependency parser significantly outperforms the state-of-the-art, performing at F1 = 73.5 % for component identification and F1 = 46.4 % for relation identification. One of the advantages of treating AM as biaffine dependency parsing is the simple neural architecture that results. The idea of treating AM as dependency parsing is not new, but has previously been abandoned as it was lagging far behind the state-of-the-art. In a thorough analysis, we investigate the factors that contribute to the success of our model : the biaffine model itself, our representation for the dependency structure of arguments, different encoders in the biaffine model, and syntactic information additionally fed to the model. Our work demonstrates that dependency parsing for AM, an overlooked idea from the past, deserves more attention in the future.

pdf bib
CTC-based Compression for Direct Speech TranslationCTC-based Compression for Direct Speech Translation
Marco Gaido | Mauro Cettolo | Matteo Negri | Marco Turchi

Previous studies demonstrated that a dynamic phone-informed compression of the input audio is beneficial for speech translation (ST). However, they required a dedicated model for phone recognition and did not test this solution for direct ST, in which a single model translates the input audio into the target language without intermediate representations. In this work, we propose the first method able to perform a dynamic compression of the input in direct ST models. In particular, we exploit the Connectionist Temporal Classification (CTC) to compress the input sequence according to its phonetic characteristics. Our experiments demonstrate that our solution brings a 1.3-1.5 BLEU improvement over a strong baseline on two language pairs (English-Italian and English-German), contextually reducing the memory footprint by more than 10 %.

pdf bib
Top-down Discourse Parsing via Sequence Labelling
Fajri Koto | Jey Han Lau | Timothy Baldwin

We introduce a top-down approach to discourse parsing that is conceptually simpler than its predecessors (Kobayashi et al., 2020 ; Zhang et al., 2020). By framing the task as a sequence labelling problem where the goal is to iteratively segment a document into individual discourse units, we are able to eliminate the decoder and reduce the search space for splitting points. We explore both traditional recurrent models and modern pre-trained transformer models for the task, and additionally introduce a novel dynamic oracle for top-down parsing. Based on the Full metric, our proposed LSTM model sets a new state-of-the-art for RST parsing.

pdf bib
Neural Data-to-Text Generation with LM-based Text AugmentationLM-based Text Augmentation
Ernie Chang | Xiaoyu Shen | Dawei Zhu | Vera Demberg | Hui Su

For many new application domains for data-to-text generation, the main obstacle in training neural models consists of a lack of training data. While usually large numbers of instances are available on the data side, often only very few text samples are available. To address this problem, we here propose a novel few-shot approach for this setting. Our approach automatically augments the data available for training by (i) generating new text samples based on replacing specific values by alternative ones from the same category, (ii) generating new text samples based on GPT-2, and (iii) proposing an automatic method for pairing the new text samples with data samples. As the text augmentation can introduce noise to the training data, we use cycle consistency as an objective, in order to make sure that a given data sample can be correctly reconstructed after having been formulated as text (and that text samples can be reconstructed from data). On both the E2E and WebNLG benchmarks, we show that this weakly supervised training paradigm is able to outperform fully supervised sequence-to-sequence models with less than 10 % of the training set. By utilizing all annotated data, our model can boost the performance of a standard sequence-to-sequence model by over 5 BLEU points, establishing a new state-of-the-art on both datasets.

pdf bib
Self-Training Pre-Trained Language Models for Zero- and Few-Shot Multi-Dialectal Arabic Sequence LabelingArabic Sequence Labeling
Muhammad Khalifa | Muhammad Abdul-Mageed | Khaled Shaalan

A sufficient amount of annotated data is usually required to fine-tune pre-trained language models for downstream tasks. Unfortunately, attaining labeled data can be costly, especially for multiple language varieties and dialects. We propose to self-train pre-trained language models in zero- and few-shot scenarios to improve performance on data-scarce varieties using only resources from data-rich ones. We demonstrate the utility of our approach in the context of Arabic sequence labeling by using a language model fine-tuned on Modern Standard Arabic (MSA) only to predict named entities (NE) and part-of-speech (POS) tags on several dialectal Arabic (DA) varieties. We show that self-training is indeed powerful, improving zero-shot MSA-to-DA transfer by as large as 10 % F_1 (NER) and 2 % accuracy (POS tagging). We acquire even better performance in few-shot scenarios with limited amounts of labeled data. We conduct an ablation study and show that the performance boost observed directly results from training data augmentation possible with DA examples via self-training. This opens up opportunities for developing DA models exploiting only MSA resources. Our approach can also be extended to other languages and tasks._1 (NER) and 2% accuracy (POS tagging). We acquire even better performance in few-shot scenarios with limited amounts of labeled data. We conduct an ablation study and show that the performance boost observed directly results from training data augmentation possible with DA examples via self-training. This opens up opportunities for developing DA models exploiting only MSA resources. Our approach can also be extended to other languages and tasks.

pdf bib
Coordinate Constructions in English Enhanced Universal Dependencies : Analysis and Computational ModelingEnglish Enhanced Universal Dependencies: Analysis and Computational Modeling
Stefan Grünewald | Prisca Piccirilli | Annemarie Friedrich

In this paper, we address the representation of coordinate constructions in Enhanced Universal Dependencies (UD), where relevant dependency links are propagated from conjunction heads to other conjuncts. English treebanks for enhanced UD have been created from gold basic dependencies using a heuristic rule-based converter, which propagates only core arguments. With the aim of determining which set of links should be propagated from a semantic perspective, we create a large-scale dataset of manually edited syntax graphs. We identify several systematic errors in the original data, and propose to also propagate adjuncts. We observe high inter-annotator agreement for this semantic annotation task. Using our new manually verified dataset, we perform the first principled comparison of rule-based and (partially novel) machine-learning based methods for conjunction propagation for English. We show that learning propagation rules is more effective than hand-designing heuristic rules. When using automatic parses, our neural graph-parser based edge predictor outperforms the currently predominant pipelines using a basic-layer tree parser plus converters.

pdf bib
Continuous Learning in Neural Machine Translation using Bilingual Dictionaries
Jan Niehues

While recent advances in deep learning led to significant improvements in machine translation, neural machine translation is often still not able to continuously adapt to the environment. For humans, as well as for machine translation, bilingual dictionaries are a promising knowledge source to continuously integrate new knowledge. However, their exploitation poses several challenges : The system needs to be able to perform one-shot learning as well as model the morphology of source and target language. In this work, we proposed an evaluation framework to assess the ability of neural machine translation to continuously learn new phrases. We integrate one-shot learning methods for neural machine translation with different word representations and show that it is important to address both in order to successfully make use of bilingual dictionaries. By addressing both challenges we are able to improve the ability to translate new, rare words and phrases from 30 % to up to 70 %. The correct lemma is even generated by more than 90 %.

pdf bib
Adv-OLM : Generating Textual Adversaries via OLMOLM: Generating Textual Adversaries via OLM
Vijit Malik | Ashwani Bhat | Ashutosh Modi

Deep learning models are susceptible to adversarial examples that have imperceptible perturbations in the original input, resulting in adversarial attacks against these models. Analysis of these attacks on the state of the art transformers in NLP can help improve the robustness of these models against such adversarial inputs. In this paper, we present Adv-OLM, a black-box attack method that adapts the idea of Occlusion and Language Models (OLM) to the current state of the art attack methods. OLM is used to rank words of a sentence, which are later substituted using word replacement strategies. We experimentally show that our approach outperforms other attack methods for several text classification tasks.

pdf bib
Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering
Gautier Izacard | Edouard Grave

Generative models for open domain question answering have proven to be competitive, without resorting to external knowledge. While promising, this approach requires to use models with billions of parameters, which are expensive to train and query. In this paper, we investigate how much these models can benefit from retrieving text passages, potentially containing evidence. We obtain state-of-the-art results on the Natural Questions and TriviaQA open benchmarks. Interestingly, we observe that the performance of this method significantly improves when increasing the number of retrieved passages. This is evidence that sequence-to-sequence models offers a flexible framework to efficiently aggregate and combine evidence from multiple passages.

pdf bib
A Neural Few-Shot Text Classification Reality Check
Thomas Dopierre | Christophe Gravier | Wilfried Logerais

Modern classification models tend to struggle when the amount of annotated data is scarce. To overcome this issue, several neural few-shot classification models have emerged, yielding significant progress over time, both in Computer Vision and Natural Language Processing. In the latter, such models used to rely on fixed word embeddings, before the advent of transformers. Additionally, some models used in Computer Vision are yet to be tested in NLP applications. In this paper, we compare all these models, first adapting those made in the field of image processing to NLP, and second providing them access to transformers. We then test these models equipped with the same transformer-based encoder on the intent detection task, known for having a large amount of classes. Our results reveal that while methods perform almost equally on the ARSC dataset, this is not the case for the Intent Detection task, where most recent and supposedly best competitors perform worse than older and simpler ones (while all are are given access to transformers). We also show that a simple baseline is surprisingly strong. All the new developed models as well as the evaluation framework are made publicly available.

pdf bib
Multilingual Machine Translation : Closing the Gap between Shared and Language-specific Encoder-Decoders
Carlos Escolano | Marta R. Costa-jussà | José A. R. Fonollosa | Mikel Artetxe

State-of-the-art multilingual machine translation relies on a universal encoder-decoder, which requires retraining the entire system to add new languages. In this paper, we propose an alternative approach that is based on language-specific encoder-decoders, and can thus be more easily extended to new languages by learning their corresponding modules. So as to encourage a common interlingua representation, we simultaneously train the N initial languages. Our experiments show that the proposed approach outperforms the universal encoder-decoder by 3.28 BLEU points on average, while allowing to add new languages without the need to retrain the rest of the modules. All in all, our work closes the gap between shared and language-specific encoderdecoders, advancing toward modular multilingual machine translation systems that can be flexibly extended in lifelong learning settings.

pdf bib
Identifying Named Entities as they are Typed
Ravneet Arora | Chen-Tse Tsai | Daniel Preotiuc-Pietro

Identifying named entities in written text is an essential component of the text processing pipeline used in applications such as text editors to gain a better understanding of the semantics of the text. However, the typical experimental setup for evaluating Named Entity Recognition (NER) systems is not directly applicable to systems that process text in real time as the text is being typed. Evaluation is performed on a sentence level assuming the end-user is willing to wait until the entire sentence is typed for entities to be identified and further linked to identifiers or co-referenced. We introduce a novel experimental setup for NER systems for applications where decisions about named entity boundaries need to be performed in an online fashion. We study how state-of-the-art methods perform under this setup in multiple languages and propose adaptations to these models to suit this new experimental setup. Experimental results show that the best systems that are evaluated on each token after its typed, reach performance within 15 F1 points of systems that are evaluated at the end of the sentence. These show that entity recognition can be performed in this setup and open up the development of other NLP tools in a similar setup.

pdf bib
SANDI : Story-and-Images AlignmentSANDI: Story-and-Images Alignment
Sreyasi Nag Chowdhury | Simon Razniewski | Gerhard Weikum

The Internet contains a multitude of social media posts and other of stories where text is interspersed with images. In these contexts, images are not simply used for general illustration, but are judiciously placed in certain spots of a story for multimodal descriptions and narration. In this work we analyze the problem of text-image alignment, and present SANDI, a methodology for automatically selecting images from an image collection and aligning them with text paragraphs of a story. SANDI combines visual tags, user-provided tags and background knowledge, and uses an Integer Linear Program to compute alignments that are semantically meaningful. Experiments show that SANDI can select and align images with texts with high quality of semantic fit.

pdf bib
El Volumen Louder Por Favor : Code-switching in Task-oriented Semantic Parsing
Arash Einolghozati | Abhinav Arora | Lorena Sainz-Maza Lecanda | Anuj Kumar | Sonal Gupta

Being able to parse code-switched (CS) utterances, such as Spanish+English or Hindi+English, is essential to democratize task-oriented semantic parsing systems for certain locales. In this work, we focus on Spanglish (Spanish+English) and release a dataset, CSTOP, containing 5800 CS utterances alongside their semantic parses. We examine the CS generalizability of various Cross-lingual (XL) models and exhibit the advantage of pre-trained XL language models when data for only one language is present. As such, we focus on improving the pre-trained models for the case when only English corpus alongside either zero or a few CS training instances are available. We propose two data augmentation methods for the zero-shot and the few-shot settings : fine-tune using translate-and-align and augment using a generation model followed by match-and-filter. Combining the few-shot setting with the above improvements decreases the initial 30-point accuracy gap between the zero-shot and the full-data settings by two thirds.

pdf bib
Generating Syntactically Controlled Paraphrases without Using Annotated Parallel Pairs
Kuan-Hao Huang | Kai-Wei Chang

Paraphrase generation plays an essential role in natural language process (NLP), and it has many downstream applications. However, training supervised paraphrase models requires many annotated paraphrase pairs, which are usually costly to obtain. On the other hand, the paraphrases generated by existing unsupervised approaches are usually syntactically similar to the source sentences and are limited in diversity. In this paper, we demonstrate that it is possible to generate syntactically various paraphrases without the need for annotated paraphrase pairs. We propose Syntactically controlled Paraphrase Generator (SynPG), an encoder-decoder based model that learns to disentangle the semantics and the syntax of a sentence from a collection of unannotated texts. The disentanglement enables SynPG to control the syntax of output paraphrases by manipulating the embedding in the syntactic space. Extensive experiments using automatic metrics and human evaluation show that SynPG performs better syntactic control than unsupervised baselines, while the quality of the generated paraphrases is competitive. We also demonstrate that the performance of SynPG is competitive or even better than supervised models when the unannotated data is large. Finally, we show that the syntactically controlled paraphrases generated by SynPG can be utilized for data augmentation to improve the robustness of NLP models.

pdf bib
Data Augmentation for Hypernymy Detection
Thomas Kober | Julie Weeds | Lorenzo Bertolini | David Weir

The automatic detection of hypernymy relationships represents a challenging problem in NLP. The successful application of state-of-the-art supervised approaches using distributed representations has generally been impeded by the limited availability of high quality training data. We have developed two novel data augmentation techniques which generate new training examples from existing ones. First, we combine the linguistic principles of hypernym transitivity and intersective modifier-noun composition to generate additional pairs of vectors, such as small dog-dog or small dog-animal, for which a hypernymy relationship can be assumed. Second, we use generative adversarial networks (GANs) to generate pairs of vectors for which the hypernymy relation can also be assumed. We furthermore present two complementary strategies for extending an existing dataset by leveraging linguistic resources such as WordNet. Using an evaluation across 3 different datasets for hypernymy detection and 2 different vector spaces, we demonstrate that both of the proposed automatic data augmentation and dataset extension strategies substantially improve classifier performance.

pdf bib
Few-shot learning through contextual data augmentation
Farid Arthaud | Rachel Bawden | Alexandra Birch

Machine translation (MT) models used in industries with constantly changing topics, such as translation or news agencies, need to adapt to new data to maintain their performance over time. Our aim is to teach a pre-trained MT model to translate previously unseen words accurately, based on very few examples. We propose (i) an experimental setup allowing us to simulate novel vocabulary appearing in human-submitted translations, and (ii) corresponding evaluation metrics to compare our approaches. We extend a data augmentation approach using a pretrained language model to create training examples with similar contexts for novel words. We compare different fine-tuning and data augmentation approaches and show that adaptation on the scale of one to five examples is possible. Combining data augmentation with randomly selected training sentences leads to the highest BLEU score and accuracy improvements. Impressively, with only 1 to 5 examples, our model reports better accuracy scores than a reference system trained with on average 313 parallel examples.

pdf bib
Zero-shot Generalization in Dialog State Tracking through Generative Question Answering
Shuyang Li | Jin Cao | Mukund Sridhar | Henghui Zhu | Shang-Wen Li | Wael Hamza | Julian McAuley

Dialog State Tracking (DST), an integral part of modern dialog systems, aims to track user preferences and constraints (slots) in task-oriented dialogs. In real-world settings with constantly changing services, DST systems must generalize to new domains and unseen slot types. Existing methods for DST do not generalize well to new slot names and many require known ontologies of slot types and values for inference. We introduce a novel ontology-free framework that supports natural language queries for unseen constraints and slots in multi-domain task-oriented dialogs. Our approach is based on generative question-answering using a conditional language model pre-trained on substantive English sentences. Our model improves joint goal accuracy in zero-shot domain adaptation settings by up to 9 % (absolute) over the previous state-of-the-art on the MultiWOZ 2.1 dataset.

pdf bib
MIDAS : A Dialog Act Annotation Scheme for Open Domain HumanMachine Spoken ConversationsMIDAS: A Dialog Act Annotation Scheme for Open Domain HumanMachine Spoken Conversations
Dian Yu | Zhou Yu

Dialog act prediction in open-domain conversations is an essential language comprehension task for both dialog system building and discourse analysis. Previous dialog act schemes, such as SWBD-DAMSL, are designed mainly for discourse analysis in human-human conversations. In this paper, we present a dialog act annotation scheme, MIDAS (Machine Interaction Dialog Act Scheme), targeted at open-domain human-machine conversations. MIDAS is designed to assist machines to improve their ability to understand human partners. MIDAS has a hierarchical structure and supports multi-label annotations. We collected and annotated a large open-domain human-machine spoken conversation dataset (consisting of 24 K utterances). To validate our scheme, we leveraged transfer learning methods to train a multi-label dialog act prediction model and reached an F1 score of 0.79.

pdf bib
Detecting Extraneous Content in Podcasts
Sravana Reddy | Yongze Yu | Aasish Pappu | Aswin Sivaraman | Rezvaneh Rezapour | Rosie Jones

Podcast episodes often contain material extraneous to the main content, such as advertisements, interleaved within the audio and the written descriptions. We present classifiers that leverage both textual and listening patterns in order to detect such content in podcast descriptions and audio transcripts. We demonstrate that our models are effective by evaluating them on the downstream task of podcast summarization and show that we can substantively improve ROUGE scores and reduce the extraneous content generated in the summaries.

pdf bib
Joint Learning of Representations for Web-tables, Entities and Types using Graph Convolutional Network
Aniket Pramanick | Indrajit Bhattacharya

Existing approaches for table annotation with entities and types either capture the structure of table using graphical models, or learn embeddings of table entries without accounting for the complete syntactic structure. We propose TabGCN, that uses Graph Convolutional Networks to capture the complete structure of tables, knowledge graph and the training annotations, and jointly learns embeddings for table elements as well as the entities and types. To account for knowledge incompleteness, TabGCN’s embeddings can be used to discover new entities and types. Using experiments on 5 benchmark datasets, we show that TabGCN significantly outperforms multiple state-of-the-art baselines for table annotation, while showing promising performance on downstream table-related applications.

pdf bib
ECOL-R : Encouraging Copying in Novel Object Captioning with Reinforcement LearningECOL-R: Encouraging Copying in Novel Object Captioning with Reinforcement Learning
Yufei Wang | Ian Wood | Stephen Wan | Mark Johnson

Novel Object Captioning is a zero-shot Image Captioning task requiring describing objects not seen in the training captions, but for which information is available from external object detectors. The key challenge is to select and describe all salient detected novel objects in the input images. In this paper, we focus on this challenge and propose the ECOL-R model (Encouraging Copying of Object Labels with Reinforced Learning), a copy-augmented transformer model that is encouraged to accurately describe the novel object labels. This is achieved via a specialised reward function in the SCST reinforcement learning framework (Rennie et al., 2017) that encourages novel object mentions while maintaining the caption quality. We further restrict the SCST training to the images where detected objects are mentioned in reference captions to train the ECOL-R model. We additionally improve our copy mechanism via Abstract Labels, which transfer knowledge from known to novel object types, and a Morphological Selector, which determines the appropriate inflected forms of novel object labels. The resulting model sets new state-of-the-art on the nocaps (Agrawal et al., 2019) and held-out COCO (Hendricks et al., 2016) benchmarks.

pdf bib
Debiasing Pre-trained Contextualised Embeddings
Masahiro Kaneko | Danushka Bollegala

In comparison to the numerous debiasing methods proposed for the static non-contextualised word embeddings, the discriminative biases in contextualised embeddings have received relatively little attention. We propose a fine-tuning method that can be applied at token- or sentence-levels to debias pre-trained contextualised embeddings. Our proposed method can be applied to any pre-trained contextualised embedding model, without requiring to retrain those models. Using gender bias as an illustrative example, we then conduct a systematic study using several state-of-the-art (SoTA) contextualised representations on multiple benchmark datasets to evaluate the level of biases encoded in different contextualised embeddings before and after debiasing using the proposed method. We find that applying token-level debiasing for all tokens and across all layers of a contextualised embedding model produces the best performance. Interestingly, we observe that there is a trade-off between creating an accurate vs. unbiased contextualised embedding model, and different contextualised embedding models respond differently to this trade-off.

pdf bib
Cross-lingual Visual Pre-training for Multimodal Machine Translation
Ozan Caglayan | Menekse Kuyu | Mustafa Sercan Amac | Pranava Madhyastha | Erkut Erdem | Aykut Erdem | Lucia Specia

Pre-trained language models have been shown to improve performance in many natural language tasks substantially. Although the early focus of such models was single language pre-training, recent advances have resulted in cross-lingual and visual pre-training methods. In this paper, we combine these two approaches to learn visually-grounded cross-lingual representations. Specifically, we extend the translation language modelling (Lample and Conneau, 2019) with masked region classification and perform pre-training with three-way parallel vision & language corpora. We show that when fine-tuned for multimodal machine translation, these models obtain state-of-the-art performance. We also provide qualitative insights into the usefulness of the learned grounded representations.

pdf bib
An Expert Annotated Dataset for the Detection of Online Misogyny
Ella Guest | Bertie Vidgen | Alexandros Mittos | Nishanth Sastry | Gareth Tyson | Helen Margetts

Online misogyny is a pernicious social problem that risks making online platforms toxic and unwelcoming to women. We present a new hierarchical taxonomy for online misogyny, as well as an expert labelled dataset to enable automatic classification of misogynistic content. The dataset consists of 6567 labels for Reddit posts and comments. As previous research has found untrained crowdsourced annotators struggle with identifying misogyny, we hired and trained annotators and provided them with robust annotation guidelines. We report baseline classification performance on the binary classification task, achieving accuracy of 0.93 and F1 of 0.43. The codebook and datasets are made freely available for future researchers.

pdf bib
WikiMatrix : Mining 135 M Parallel Sentences in 1620 Language Pairs from WikipediaWikiMatrix: Mining 135M Parallel Sentences in 1620 Language Pairs from Wikipedia
Holger Schwenk | Vishrav Chaudhary | Shuo Sun | Hongyu Gong | Francisco Guzmán

We present an approach based on multilingual sentence embeddings to automatically extract parallel sentences from the content of Wikipedia articles in 96 languages, including several dialects or low-resource languages. We do not limit the extraction process to alignments with English, but we systematically consider all possible language pairs. In total, we are able to extract 135 M parallel sentences for 16720 different language pairs, out of which only 34 M are aligned with English. This corpus is freely available. To get an indication on the quality of the extracted bitexts, we train neural MT baseline systems on the mined data only for 1886 languages pairs, and evaluate them on the TED corpus, achieving strong BLEU scores for many language pairs. The WikiMatrix bitexts seem to be particularly interesting to train MT systems between distant languages without the need to pivot through English.

pdf bib
ChEMU-Ref : A Corpus for Modeling Anaphora Resolution in the Chemical DomainChEMU-Ref: A Corpus for Modeling Anaphora Resolution in the Chemical Domain
Biaoyan Fang | Christian Druckenbrodt | Saber A Akhondi | Jiayuan He | Timothy Baldwin | Karin Verspoor

Chemical patents contain rich coreference and bridging links, which are the target of this research. Specially, we introduce a novel annotation scheme, based on which we create the ChEMU-Ref dataset from reaction description snippets in English-language chemical patents. We propose a neural approach to anaphora resolution, which we show to achieve strong results, especially when jointly trained over coreference and bridging links.

pdf bib
Searching for Search Errors in Neural Morphological Inflection
Martina Forster | Clara Meister | Ryan Cotterell

Neural sequence-to-sequence models are currently the predominant choice for language generation tasks. Yet, on word-level tasks, exact inference of these models reveals the empty string is often the global optimum. Prior works have speculated this phenomenon is a result of the inadequacy of neural models for language generation. However, in the case of morphological inflection, we find that the empty string is almost never the most probable solution under the model. Further, greedy search often finds the global optimum. These observations suggest that the poor calibration of many neural models may stem from characteristics of a specific subset of tasks rather than general ill-suitedness of such models for language generation.

pdf bib
Quantifying Appropriateness of Summarization Data for Curriculum Learning
Ryuji Kano | Takumi Takahashi | Toru Nishino | Motoki Taniguchi | Tomoki Taniguchi | Tomoko Ohkuma

Much research has reported the training data of summarization models are noisy ; summaries often do not reflect what is written in the source texts. We propose an effective method of curriculum learning to train summarization models from such noisy data. Curriculum learning is used to train sequence-to-sequence models with noisy data. In translation tasks, previous research quantified noise of the training data using two models trained with noisy and clean corpora. Because such corpora do not exist in summarization fields, we propose a model that can quantify noise from a single noisy corpus. We conduct experiments on three summarization models ; one pretrained model and two non-pretrained models, and verify our method improves the performance. Furthermore, we analyze how different curricula affect the performance of pretrained and non-pretrained summarization models. Our result on human evaluation also shows our method improves the performance of summarization models.

pdf bib
Civil Rephrases Of Toxic Texts With Self-Supervised Transformers
Léo Laugier | John Pavlopoulos | Jeffrey Sorensen | Lucas Dixon

Platforms that support online commentary, from social networks to news sites, are increasingly leveraging machine learning to assist their moderation efforts. But this process does not typically provide feedback to the author that would help them contribute according to the community guidelines. This is prohibitively time-consuming for human moderators to do, and computational approaches are still nascent. This work focuses on models that can help suggest rephrasings of toxic comments in a more civil manner. Inspired by recent progress in unpaired sequence-to-sequence tasks, a self-supervised learning model is introduced, called CAE-T5. CAE-T5 employs a pre-trained text-to-text transformer, which is fine tuned with a denoising and cyclic auto-encoder loss. Experimenting with the largest toxicity detection dataset to date (Civil Comments) our model generates sentences that are more fluent and better at preserving the initial content compared to earlier text style transfer systems which we compare with using several scoring systems and human evaluation.

pdf bib
Generating Weather Comments from Meteorological Simulations
Soichiro Murakami | Sora Tanaka | Masatsugu Hangyo | Hidetaka Kamigaito | Kotaro Funakoshi | Hiroya Takamura | Manabu Okumura

The task of generating weather-forecast comments from meteorological simulations has the following requirements : (i) the changes in numerical values for various physical quantities need to be considered, (ii) the weather comments should be dependent on delivery time and area information, and (iii) the comments should provide useful information for users. To meet these requirements, we propose a data-to-text model that incorporates three types of encoders for numerical forecast maps, observation data, and meta-data. We also introduce weather labels representing weather information, such as sunny and rain, for our model to explicitly describe useful information. We conducted automatic and human evaluations. The results indicate that our model performed best against baselines in terms of informativeness. We make our code and data publicly available.

pdf bib
A phonetic model of non-native spoken word processing
Yevgen Matusevych | Herman Kamper | Thomas Schatz | Naomi Feldman | Sharon Goldwater

Non-native speakers show difficulties with spoken word processing. Many studies attribute these difficulties to imprecise phonological encoding of words in the lexical memory. We test an alternative hypothesis : that some of these difficulties can arise from the non-native speakers’ phonetic perception. We train a computational model of phonetic learning, which has no access to phonology, on either one or two languages. We first show that the model exhibits predictable behaviors on phone-level and word-level discrimination tasks. We then test the model on a spoken word processing task, showing that phonology may not be necessary to explain some of the word processing effects observed in non-native speakers. We run an additional analysis of the model’s lexical representation space, showing that the two training languages are not fully separated in that space, similarly to the languages of a bilingual human speaker.

pdf bib
Bootstrapping Relation Extractors using Syntactic Search by Examples
Matan Eyal | Asaf Amrami | Hillel Taub-Tabib | Yoav Goldberg

The advent of neural-networks in NLP brought with it substantial improvements in supervised relation extraction. However, obtaining a sufficient quantity of training data remains a key challenge. In this work we propose a process for bootstrapping training datasets which can be performed quickly by non-NLP-experts. We take advantage of search engines over syntactic-graphs (Such as Shlain et al. (2020)) which expose a friendly by-example syntax. We use these to obtain positive examples by searching for sentences that are syntactically similar to user input examples. We apply this technique to relations from TACRED and DocRED and show that the resulting models are competitive with models trained on manually annotated data and on data obtained from distant supervision. The models also outperform models trained using NLG data augmentation techniques. Extending the search-based approach with the NLG method further improves the results.

pdf bib
Towards a Decomposable Metric for Explainable Evaluation of Text Generation from AMRAMR
Juri Opitz | Anette Frank

Systems that generate natural language text from abstract meaning representations such as AMR are typically evaluated using automatic surface matching metrics that compare the generated texts to reference texts from which the input meaning representations were constructed. We show that besides well-known issues from which such metrics suffer, an additional problem arises when applying these metrics for AMR-to-text evaluation, since an abstract meaning representation allows for numerous surface realizations. In this work we aim to alleviate these issues by proposing _, a decomposable metric that builds on two pillars. The first is the principle of meaning preservation : it measures to what extent a given AMR can be reconstructed from the generated sentence using SOTA AMR parsers and applying (fine-grained) AMR evaluation metrics to measure the distance between the original and the reconstructed AMR. The second pillar builds on a principle of (grammatical) form that measures the linguistic quality of the generated text, which we implement using SOTA language models. In two extensive pilot studies we show that fulfillment of both principles offers benefits for AMR-to-text evaluation, including explainability of scores. Since _ does not necessarily rely on gold AMRs, it may extend to other text generation tasks.\\mathcal{M}\\mathcal{F}_\\beta, a decomposable metric that builds on two pillars. The first is the principle of meaning preservation \\mathcal{M}\n : it measures to what extent a given AMR can be reconstructed from the generated sentence using SOTA AMR parsers and applying (fine-grained) AMR evaluation metrics to measure the distance between the original and the reconstructed AMR. The second pillar builds on a principle of (grammatical) form \\mathcal{F}\n that measures the linguistic quality of the generated text, which we implement using SOTA language models. In two extensive pilot studies we show that fulfillment of both principles offers benefits for AMR-to-text evaluation, including explainability of scores. Since \\mathcal{M}\\mathcal{F}_\\beta does not necessarily rely on gold AMRs, it may extend to other text generation tasks.

pdf bib
The Source-Target Domain Mismatch Problem in Machine Translation
Jiajun Shen | Peng-Jen Chen | Matthew Le | Junxian He | Jiatao Gu | Myle Ott | Michael Auli | Marc’Aurelio Ranzato

While we live in an increasingly interconnected world, different places still exhibit strikingly different cultures and many events we experience in our every day life pertain only to the specific place we live in. As a result, people often talk about different things in different parts of the world. In this work we study the effect of local context in machine translation and postulate that this causes the domains of the source and target language to greatly mismatch. We first formalize the concept of source-target domain mismatch, propose a metric to quantify it, and provide empirical evidence for its existence. We conclude with an empirical study of how source-target domain mismatch affects training of machine translation systems on low resource languages. While this may severely affect back-translation, the degradation can be alleviated by combining back-translation with self-training and by increasing the amount of target side monolingual data.

pdf bib
Understanding Pre-Editing for Black-Box Neural Machine Translation
Rei Miyata | Atsushi Fujita

Pre-editing is the process of modifying the source text (ST) so that it can be translated by machine translation (MT) in a better quality. Despite the unpredictability of black-box neural MT (NMT), pre-editing has been deployed in various practical MT use cases. Although many studies have demonstrated the effectiveness of pre-editing methods for particular settings, thus far, a deep understanding of what pre-editing is and how it works for black-box NMT is lacking. To elicit such understanding, we extensively investigated human pre-editing practices. We first implemented a protocol to incrementally record the minimum edits for each ST and collected 6,652 instances of pre-editing across three translation directions, two MT systems, and four text domains. We then analysed the instances from three perspectives : the characteristics of the pre-edited ST, the diversity of pre-editing operations, and the impact of the pre-editing operations on NMT outputs. Our findings include the following : (1) enhancing the explicitness of the meaning of an ST and its syntactic structure is more important for obtaining better translations than making the ST shorter and simpler, and (2) although the impact of pre-editing on NMT is generally unpredictable, there are some tendencies of changes in the NMT outputs depending on the editing operation types.

pdf bib
WiC-TSV : An Evaluation Benchmark for Target Sense Verification of Words in ContextWiC-TSV: An Evaluation Benchmark for Target Sense Verification of Words in Context
Anna Breit | Artem Revenko | Kiamehr Rezaee | Mohammad Taher Pilehvar | Jose Camacho-Collados

We present WiC-TSV, a new multi-domain evaluation benchmark for Word Sense Disambiguation. More specifically, we introduce a framework for Target Sense Verification of Words in Context which grounds its uniqueness in the formulation as binary classification task thus being independent of external sense inventories, and the coverage of various domains. This makes the dataset highly flexible for the evaluation of a diverse set of models and systems in and across domains. WiC-TSV provides three different evaluation settings, depending on the input signals provided to the model. We set baseline performance on the dataset using state-of-the-art language models. Experimental results show that even though these models can perform decently on the task, there remains a gap between machine and human performance, especially in out-of-domain settings. WiC-TSV data is available at

pdf bib
Self-Supervised and Controlled Multi-Document Opinion Summarization
Hady Elsahar | Maximin Coavoux | Jos Rozen | Matthias Gallé

We address the problem of unsupervised abstractive summarization of collections of user generated reviews through self-supervision and control. We propose a self-supervised setup that considers an individual document as a target summary for a set of similar documents. This setting makes training simpler than previous approaches by relying only on standard log-likelihood loss and mainstream models. We address the problem of hallucinations through the use of control codes, to steer the generation towards more coherent and relevant summaries.

pdf bib
Active Learning for Sequence Tagging with Deep Pre-trained Models and Bayesian Uncertainty EstimatesBayesian Uncertainty Estimates
Artem Shelmanov | Dmitri Puzyrev | Lyubov Kupriyanova | Denis Belyakov | Daniil Larionov | Nikita Khromov | Olga Kozlova | Ekaterina Artemova | Dmitry V. Dylov | Alexander Panchenko

Annotating training data for sequence tagging of texts is usually very time-consuming. Recent advances in transfer learning for natural language processing in conjunction with active learning open the possibility to significantly reduce the necessary annotation budget. We are the first to thoroughly investigate this powerful combination for the sequence tagging task. We conduct an extensive empirical study of various Bayesian uncertainty estimation methods and Monte Carlo dropout options for deep pre-trained models in the active learning framework and find the best combinations for different types of models. Besides, we also demonstrate that to acquire instances during active learning, a full-size Transformer can be substituted with a distilled version, which yields better computational performance and reduces obstacles for applying deep active learning in practice.

pdf bib
BERT Prescriptions to Avoid Unwanted Headaches : A Comparison of Transformer Architectures for Adverse Drug Event DetectionBERT Prescriptions to Avoid Unwanted Headaches: A Comparison of Transformer Architectures for Adverse Drug Event Detection
Beatrice Portelli | Edoardo Lenzi | Emmanuele Chersoni | Giuseppe Serra | Enrico Santus

Pretrained transformer-based models, such as BERT and its variants, have become a common choice to obtain state-of-the-art performances in NLP tasks. In the identification of Adverse Drug Events (ADE) from social media texts, for example, BERT architectures rank first in the leaderboard. However, a systematic comparison between these models has not yet been done. In this paper, we aim at shedding light on the differences between their performance analyzing the results of 12 models, tested on two standard benchmarks. SpanBERT and PubMedBERT emerged as the best models in our evaluation : this result clearly shows that span-based pretraining gives a decisive advantage in the precise recognition of ADEs, and that in-domain language pretraining is particularly useful when the transformer model is trained just on biomedical text from scratch.

pdf bib
Semantic Parsing of Disfluent Speech
Priyanka Sen | Isabel Groves

Speech disfluencies are prevalent in spontaneous speech. The rising popularity of voice assistants presents a growing need to handle naturally occurring disfluencies. Semantic parsing is a key component for understanding user utterances in voice assistants, yet most semantic parsing research to date focuses on written text. In this paper, we investigate semantic parsing of disfluent speech with the ATIS dataset. We find that a state-of-the-art semantic parser does not seamlessly handle disfluencies. We experiment with adding real and synthetic disfluencies at training time and find that adding synthetic disfluencies not only improves model performance by up to 39 % but can also outperform adding real disfluencies in the ATIS dataset.

pdf bib
We Need To Talk About Random Splits
Anders Søgaard | Sebastian Ebert | Jasmijn Bastings | Katja Filippova

(CITATION) argued for using random splits rather than standard splits in NLP experiments. We argue that random splits, like standard splits, lead to overly optimistic performance estimates. We can also split data in biased or adversarial ways, e.g., training on short sentences and evaluating on long ones. Biased sampling has been used in domain adaptation to simulate real-world drift ; this is known as the covariate shift assumption. In NLP, however, even worst-case splits, maximizing bias, often under-estimate the error observed on new samples of in-domain data, i.e., the data that models should minimally generalize to at test time. This invalidates the covariate shift assumption. Instead of using multiple random splits, future benchmarks should ideally include multiple, independent test sets instead ; if infeasible, we argue that multiple biased splits leads to more realistic performance estimates than multiple random splits.

pdf bib
Alignment verification to improve NMT translation towards highly inflectional languages with limited resourcesNMT translation towards highly inflectional languages with limited resources
George Tambouratzis

The present article discusses how to improve translation quality when using limited training data to translate towards morphologically rich languages. The starting point is a neural MT system, used to train translation models, using solely publicly available parallel data. An initial analysis of the translation output has shown that quality is sub-optimal, due mainly to an insufficient amount of training data. To improve translation quality, a hybridized solution is proposed, using an ensemble of relatively simple NMT systems trained with different metrics, combined with an open source module, designed for a low-resource MT system. Experimental results of the proposed hybridized method with multiple independent test sets achieve improvements over (i) both the best individual NMT and (ii) the standard ensemble system provided in the Marian-NMT system. Improvements over Marian-NMT are in many cases statistically significant. Finally, a qualitative analysis of translation results indicates a greater robustness for the hybridized method.

pdf bib
Data Augmentation for Voice-Assistant NLU using BERT-based Interchangeable RephraseNLU using BERT-based Interchangeable Rephrase
Akhila Yerukola | Mason Bretan | Hongxia Jin

We introduce a data augmentation technique based on byte pair encoding and a BERT-like self-attention model to boost performance on spoken language understanding tasks. We compare and evaluate this method with a range of augmentation techniques encompassing generative models such as VAEs and performance-boosting techniques such as synonym replacement and back-translation. We show our method performs strongly on domain and intent classification tasks for a voice assistant and in a user-study focused on utterance naturalness and semantic similarity.

pdf bib
How to Evaluate a Summarizer : Study Design and Statistical Analysis for Manual Linguistic Quality Evaluation
Julius Steen | Katja Markert

Manual evaluation is essential to judge progress on automatic text summarization. However, we conduct a survey on recent summarization system papers that reveals little agreement on how to perform such evaluation studies. We conduct two evaluation experiments on two aspects of summaries’ linguistic quality (coherence and repetitiveness) to compare Likert-type and ranking annotations and show that best choice of evaluation method can vary from one aspect to another. In our survey, we also find that study parameters such as the overall number of annotators and distribution of annotators to annotation items are often not fully reported and that subsequent statistical analysis ignores grouping factors arising from one annotator judging multiple summaries. Using our evaluation experiments, we show that the total number of annotators can have a strong impact on study power and that current statistical analysis methods can inflate type I error rates up to eight-fold. In addition, we highlight that for the purpose of system comparison the current practice of eliciting multiple judgements per summary leads to less powerful and reliable annotations given a fixed study budget.

pdf bib
Error Analysis and the Role of Morphology
Marcel Bollmann | Anders Søgaard

We evaluate two common conjectures in error analysis of NLP models : (i) Morphology is predictive of errors ; and (ii) the importance of morphology increases with the morphological complexity of a language. We show across four different tasks and up to 57 languages that of these conjectures, somewhat surprisingly, only (i) is true. Using morphological features does improve error prediction across tasks ; however, this effect is less pronounced with morphologically complex languages. We speculate this is because morphology is more discriminative in morphologically simple languages. Across all four tasks, case and gender are the morphological features most predictive of error.

pdf bib
Attention-based Relational Graph Convolutional Network for Target-Oriented Opinion Words Extraction
Junfeng Jiang | An Wang | Akiko Aizawa

Target-oriented opinion words extraction (TOWE) is a subtask of aspect-based sentiment analysis (ABSA). It aims to extract the corresponding opinion words for a given opinion target in a review sentence. Intuitively, the relation between an opinion target and an opinion word mostly relies on syntactics. In this study, we design a directed syntactic dependency graph based on a dependency tree to establish a path from the target to candidate opinions. Subsequently, we propose a novel attention-based relational graph convolutional neural network (ARGCN) to exploit syntactic information over dependency graphs. Moreover, to explicitly extract the corresponding opinion words toward the given opinion target, we effectively encode target information in our model with the target-aware representation. Empirical results demonstrate that our model significantly outperforms all of the existing models on four benchmark datasets. Extensive analysis also demonstrates the effectiveness of each component of our models. Our code is available at

pdf bib
Acquiring a Formality-Informed Lexical Resource for Style Analysis
Elisabeth Eder | Ulrike Krieg-Holz | Udo Hahn

To track different levels of formality in written discourse, we introduce a novel type of lexicon for the German language, with entries ordered by their degree of (in)formality. We start with a set of words extracted from traditional lexicographic resources, extend it by sentence-based similarity computations, and let crowdworkers assess the enlarged set of lexical items on a continuous informal-formal scale as a gold standard for evaluation. We submit this lexicon to an intrinsic evaluation related to the best regression models and their effect on predicting formality scores and complement our investigation by an extrinsic evaluation of formality on a German-language email corpus.

pdf bib
Probing into the Root : A Dataset for Reason Extraction of Structural Events from Financial Documents
Pei Chen | Kang Liu | Yubo Chen | Taifeng Wang | Jun Zhao

This paper proposes a new task regarding event reason extraction from document-level texts. Unlike the previous causality detection task, we do not assign target events in the text, but only provide structural event descriptions, and such settings accord more with practice scenarios. Moreover, we annotate a large dataset FinReason for evaluation, which provides Reasons annotation for Financial events in company announcements. This task is challenging because the cases of multiple-events, multiple-reasons, and implicit-reasons are included. In total, FinReason contains 8,794 documents, 12,861 financial events and 11,006 reason spans. We also provide the performance of existing canonical methods in event extraction and machine reading comprehension on this task. The results show a 7 percentage point F1 score gap between the best model and human performance, and existing methods are far from resolving this problem.

pdf bib
Language Modelling as a Multi-Task Problem
Lucas Weber | Jaap Jumelet | Elia Bruni | Dieuwke Hupkes

In this paper, we propose to study language modelling as a multi-task problem, bringing together three strands of research : multi-task learning, linguistics, and interpretability. Based on hypotheses derived from linguistic theory, we investigate whether language models adhere to learning principles of multi-task learning during training. To showcase the idea, we analyse the generalisation behaviour of language models as they learn the linguistic concept of Negative Polarity Items (NPIs). Our experiments demonstrate that a multi-task setting naturally emerges within the objective of the more general task of language modelling. We argue that this insight is valuable for multi-task learning, linguistics and interpretability research and can lead to exciting new findings in all three domains.

pdf bib
ChainCQG : Flow-Aware Conversational Question GenerationChainCQG: Flow-Aware Conversational Question Generation
Jing Gu | Mostafa Mirshekari | Zhou Yu | Aaron Sisto

Conversational systems enable numerous valuable applications, and question-answering is an important component underlying many of these. However, conversational question-answering remains challenging due to the lack of realistic, domain-specific training data. Inspired by this bottleneck, we focus on conversational question generation as a means to generate synthetic conversations for training and evaluation purposes. We present a number of novel strategies to improve conversational flow and accommodate varying question types and overall fluidity. Specifically, we design ChainCQG as a two-stage architecture that learns question-answer representations across multiple dialogue turns using a flow propagation training strategy. ChainCQG significantly outperforms both answer-aware and answer-unaware SOTA baselines (e.g., up to 48 % BLEU-1 improvement). Additionally, our model is able to generate different types of questions, with improved fluidity and coreference alignment.

pdf bib
The Interplay of Task Success and Dialogue Quality : An in-depth Evaluation in Task-Oriented Visual Dialogues
Alberto Testoni | Raffaella Bernardi

When training a model on referential dialogue guessing games, the best model is usually chosen based on its task success. We show that in the popular end-to-end approach, this choice prevents the model from learning to generate linguistically richer dialogues, since the acquisition of language proficiency takes longer than learning the guessing task. By comparing models playing different games (GuessWhat, GuessWhich, and Mutual Friends), we show that this discrepancy is model- and task-agnostic. We investigate whether and when better language quality could lead to higher task success. We show that in GuessWhat, models could increase their accuracy if they learn to ground, encode, and decode also words that do not occur frequently in the training set.

pdf bib
Are you kidding me? : Detecting Unpalatable Questions on RedditReddit
Sunyam Bagga | Andrew Piper | Derek Ruths

Abusive language in online discourse negatively affects a large number of social media users. Many computational methods have been proposed to address this issue of online abuse. The existing work, however, tends to focus on detecting the more explicit forms of abuse leaving the subtler forms of abuse largely untouched. Our work addresses this gap by making three core contributions. First, inspired by the theory of impoliteness, we propose a novel task of detecting a subtler form of abuse, namely unpalatable questions. Second, we publish a context-aware dataset for the task using data from a diverse set of Reddit communities. Third, we implement a wide array of learning models and also investigate the benefits of incorporating conversational context into computational models. Our results show that modeling subtle abuse is feasible but difficult due to the language involved being highly nuanced and context-sensitive. We hope that future research in the field will address such subtle forms of abuse since their harm currently passes unnoticed through existing detection systems.

pdf bib
Neural-Driven Search-Based Paraphrase Generation
Betty Fabre | Tanguy Urvoy | Jonathan Chevelu | Damien Lolive

We study a search-based paraphrase generation scheme where candidate paraphrases are generated by iterated transformations from the original sentence and evaluated in terms of syntax quality, semantic distance, and lexical distance. The semantic distance is derived from BERT, and the lexical quality is based on GPT2 perplexity. To solve this multi-objective search problem, we propose two algorithms : Monte-Carlo Tree Search For Paraphrase Generation (MCPG) and Pareto Tree Search (PTS). We provide an extensive set of experiments on 5 datasets with a rigorous reproduction and validation for several state-of-the-art paraphrase generation algorithms. These experiments show that, although being non explicitly supervised, our algorithms perform well against these baselines.

pdf bib
FAST : Financial News and Tweet Based Time Aware Network for Stock TradingFAST: Financial News and Tweet Based Time Aware Network for Stock Trading
Ramit Sawhney | Arnav Wadhwa | Shivam Agarwal | Rajiv Ratn Shah

Designing profitable trading strategies is complex as stock movements are highly stochastic ; the market is influenced by large volumes of noisy data across diverse information sources like news and social media. Prior work mostly treats stock movement prediction as a regression or classification task and is not directly optimized towards profit-making. Further, they do not model the fine-grain temporal irregularities in the release of vast volumes of text that the market responds to quickly. Building on these limitations, we propose a novel hierarchical, learning to rank approach that uses textual data to make time-aware predictions for ranking stocks based on expected profit. Our approach outperforms state-of-the-art methods by over 8 % in terms of cumulative profit and risk-adjusted returns in trading simulations on two benchmarks : English tweets and Chinese financial news spanning two major stock indexes and four global markets. Through ablative and qualitative analyses, we build the case for our method as a tool for daily stock trading.

pdf bib
Building Representative Corpora from Illiterate Communities : A Reviewof Challenges and Mitigation Strategies for Developing Countries
Stephanie Hirmer | Alycia Leonard | Josephine Tumwesige | Costanza Conforti

Most well-established data collection methods currently adopted in NLP depend on the as- sumption of speaker literacy. Consequently, the collected corpora largely fail to represent swathes of the global population, which tend to be some of the most vulnerable and marginalised people in society, and often live in rural developing areas. Such underrepresented groups are thus not only ignored when making modeling and system design decisions, but also prevented from benefiting from development outcomes achieved through data-driven NLP. This paper aims to address the under-representation of illiterate communities in NLP corpora : we identify potential biases and ethical issues that might arise when collecting data from rural communities with high illiteracy rates in Low-Income Countries, and propose a set of practical mitigation strategies to help future work.

pdf bib
Content-based Models of Quotation
Ansel MacLaughlin | David Smith

We explore the task of quotability identification, in which, given a document, we aim to identify which of its passages are the most quotable, i.e. the most likely to be directly quoted by later derived documents. We approach quotability identification as a passage ranking problem and evaluate how well both feature-based and BERT-based (Devlin et al., 2019) models rank the passages in a given document by their predicted quotability. We explore this problem through evaluations on five datasets that span multiple languages (English, Latin) and genres of literature (e.g. poetry, plays, novels) and whose corresponding derived documents are of multiple types (news, journal articles). Our experiments confirm the relatively strong performance of BERT-based models on this task, with the best model, a RoBERTA sequential sentence tagger, achieving an average rho of 0.35 and NDCG@1, 5, 50 of 0.26, 0.31 and 0.40, respectively, across all five datasets.

pdf bib
Lexical Normalization for Code-switched Data and its Effect on POS TaggingPOS Tagging
Rob van der Goot | Özlem Çetinoğlu

Lexical normalization, the translation of non-canonical data to standard language, has shown to improve the performance of many natural language processing tasks on social media. Yet, using multiple languages in one utterance, also called code-switching (CS), is frequently overlooked by these normalization systems, despite its common use in social media. In this paper, we propose three normalization models specifically designed to handle code-switched data which we evaluate for two language pairs : Indonesian-English and Turkish-German. For the latter, we introduce novel normalization layers and their corresponding language ID and POS tags for the dataset, and evaluate the downstream effect of normalization on POS tagging. Results show that our CS-tailored normalization models significantly outperform monolingual ones, and lead to 5.4 % relative performance increase for POS tagging as compared to unnormalized input.

pdf bib
Structural Encoding and Pre-training Matter : Adapting BERT for Table-Based Fact VerificationBERT for Table-Based Fact Verification
Rui Dong | David Smith

Growing concern with online misinformation has encouraged NLP research on fact verification. Since writers often base their assertions on structured data, we focus here on verifying textual statements given evidence in tables. Starting from the Table Parsing (TAPAS) model developed for question answering (Herzig et al., 2020), we find that modeling table structure improves a language model pre-trained on unstructured text. Pre-training language models on English Wikipedia table data further improves performance. Pre-training on a question answering task with column-level cell rank information achieves the best performance. With improved pre-training and cell embeddings, this approach outperforms the state-of-the-art Numerically-aware Graph Neural Network table fact verification model (GNN-TabFact), increasing statement classification accuracy from 72.2 % to 73.9 % even without modeling numerical information. Incorporating numerical information with cell rankings and pre-training on a question-answering task increases accuracy to 76 %. We further analyze accuracy on statements implicating single rows or multiple rows and columns of tables, on different numerical reasoning subtasks, and on generalizing to detecting errors in statements derived from the ToTTo table-to-text generation dataset.

pdf bib
Cross-Cultural Similarity Features for Cross-Lingual Transfer Learning of Pragmatically Motivated Tasks
Jimin Sun | Hwijeen Ahn | Chan Young Park | Yulia Tsvetkov | David R. Mortensen

Much work in cross-lingual transfer learning explored how to select better transfer languages for multilingual tasks, primarily focusing on typological and genealogical similarities between languages. We hypothesize that these measures of linguistic proximity are not enough when working with pragmatically-motivated tasks, such as sentiment analysis. As an alternative, we introduce three linguistic features that capture cross-cultural similarities that manifest in linguistic patterns and quantify distinct aspects of language pragmatics : language context-level, figurative language, and the lexification of emotion concepts. Our analyses show that the proposed pragmatic features do capture cross-cultural similarities and align well with existing work in sociolinguistics and linguistic anthropology. We further corroborate the effectiveness of pragmatically-driven transfer in the downstream task of choosing transfer languages for cross-lingual sentiment analysis.

pdf bib
PHASE : Learning Emotional Phase-aware Representations for Suicide Ideation Detection on Social MediaPHASE: Learning Emotional Phase-aware Representations for Suicide Ideation Detection on Social Media
Ramit Sawhney | Harshit Joshi | Lucie Flek | Rajiv Ratn Shah

Recent psychological studies indicate that individuals exhibiting suicidal ideation increasingly turn to social media rather than mental health practitioners. Contextualizing the build-up of such ideation is critical for the identification of users at risk. In this work, we focus on identifying suicidal intent in tweets by augmenting linguistic models with emotional phases modeled from users’ historical context. We propose PHASE, a time-and phase-aware framework that adaptively learns features from a user’s historical emotional spectrum on Twitter for preliminary screening of suicidal risk. Building on clinical studies, PHASE learns phase-like progressions in users’ historical Plutchik-wheel-based emotions to contextualize suicidal intent. While outperforming state-of-the-art methods, we show the utility of temporal and phase-based emotional contextual cues for suicide ideation detection. We further discuss practical and ethical considerations.

pdf bib
Exploiting Definitions for Frame Identification
Tianyu Jiang | Ellen Riloff

Frame identification is one of the key challenges for frame-semantic parsing. The goal of this task is to determine which frame best captures the meaning of a target word or phrase in a sentence. We present a new model for frame identification that uses a pre-trained transformer model to generate representations for frames and lexical units (senses) using their formal definitions in FrameNet. Our frame identification model assesses the suitability of a frame for a target word in a sentence based on the semantic coherence of their meanings. We evaluate our model on three data sets and show that it consistently achieves better performance than previous systems.

pdf bib
ADePT : Auto-encoder based Differentially Private Text TransformationADePT: Auto-encoder based Differentially Private Text Transformation
Satyapriya Krishna | Rahul Gupta | Christophe Dupuy

Privacy is an important concern when building statistical models on data containing personal information. Differential privacy offers a strong definition of privacy and can be used to solve several privacy concerns. Multiple solutions have been proposed for the differentially-private transformation of datasets containing sensitive information. However, such transformation algorithms offer poor utility in Natural Language Processing (NLP) tasks due to noise added in the process. This paper addresses this issue by providing a utility-preserving differentially private text transformation algorithm using auto-encoders. Our algorithm transforms text to offer robustness against attacks and produces transformations with high semantic quality that perform well on downstream NLP tasks. We prove our algorithm’s theoretical privacy guarantee and assess its privacy leakage under Membership Inference Attacks (MIA) on models trained with transformed data. Our results show that the proposed model performs better against MIA attacks while offering lower to no degradation in the utility of the underlying transformation process compared to existing baselines.

pdf bib
Evaluating Neural Model Robustness for Machine Comprehension
Winston Wu | Dustin Arendt | Svitlana Volkova

We evaluate neural model robustness to adversarial attacks using different types of linguistic unit perturbations character and word, and propose a new method for strategic sentence-level perturbations. We experiment with different amounts of perturbations to examine model confidence and misclassification rate, and contrast model performance with different embeddings BERT and ELMo on two benchmark datasets SQuAD and TriviaQA. We demonstrate how to improve model performance during an adversarial attack by using ensembles. Finally, we analyze factors that effect model behavior under adversarial attack, and develop a new model to predict errors during attacks. Our novel findings reveal that (a) unlike BERT, models that use ELMo embeddings are more susceptible to adversarial attacks, (b) unlike word and paraphrase, character perturbations affect the model the most but are most easily compensated for by adversarial training, (c) word perturbations lead to more high-confidence misclassifications compared to sentence- and character-level perturbations, (d) the type of question and model answer length (the longer the answer the more likely it is to be incorrect) is the most predictive of model errors in adversarial setting, and (e) conclusions about model behavior are dataset-specific.

pdf bib
Hidden Biases in Unreliable News Detection Datasets
Xiang Zhou | Heba Elfardy | Christos Christodoulopoulos | Thomas Butler | Mohit Bansal

Automatic unreliable news detection is a research problem with great potential impact. Recently, several papers have shown promising results on large-scale news datasets with models that only use the article itself without resorting to any fact-checking mechanism or retrieving any supporting evidence. In this work, we take a closer look at these datasets. While they all provide valuable resources for future research, we observe a number of problems that may lead to results that do not generalize in more realistic settings. Specifically, we show that selection bias during data collection leads to undesired artifacts in the datasets. In addition, while most systems train and predict at the level of individual articles, overlapping article sources in the training and evaluation data can provide a strong confounding factor that models can exploit. In the presence of this confounding factor, the models can achieve good performance by directly memorizing the site-label mapping instead of modeling the real task of unreliable news detection. We observed a significant drop (10 %) in accuracy for all models tested in a clean split with no train / test source overlap. Using the observations and experimental results, we provide practical suggestions on how to create more reliable datasets for the unreliable news detection task. We suggest future dataset creation include a simple model as a difficulty / bias probe and future model development use a clean non-overlapping site and date split.

pdf bib
Unsupervised Extractive Summarization using Pointwise Mutual Information
Vishakh Padmakumar | He He

Unsupervised approaches to extractive summarization usually rely on a notion of sentence importance defined by the semantic similarity between a sentence and the document. We propose new metrics of relevance and redundancy using pointwise mutual information (PMI) between sentences, which can be easily computed by a pre-trained language model. Intuitively, a relevant sentence allows readers to infer the document content (high PMI with the document), and a redundant sentence can be inferred from the summary (high PMI with the summary). We then develop a greedy sentence selection algorithm to maximize relevance and minimize redundancy of extracted sentences. We show that our method outperforms similarity-based methods on datasets in a range of domains including news, medical journal articles, and personal anecdotes.

pdf bib
Deep Subjecthood : Higher-Order Grammatical Features in Multilingual BERTBERT
Isabel Papadimitriou | Ethan A. Chi | Richard Futrell | Kyle Mahowald

We investigate how Multilingual BERT (mBERT) encodes grammar by examining how the high-order grammatical feature of morphosyntactic alignment (how different languages define what counts as a subject) is manifested across the embedding spaces of different languages. To understand if and how morphosyntactic alignment affects contextual embedding spaces, we train classifiers to recover the subjecthood of mBERT embeddings in transitive sentences (which do not contain overt information about morphosyntactic alignment) and then evaluate them zero-shot on intransitive sentences (where subjecthood classification depends on alignment), within and across languages. We find that the resulting classifier distributions reflect the morphosyntactic alignment of their training languages. Our results demonstrate that mBERT representations are influenced by high-level grammatical features that are not manifested in any one input sentence, and that this is robust across languages. Further examining the characteristics that our classifiers rely on, we find that features such as passive voice, animacy and case strongly correlate with classification decisions, suggesting that mBERT does not encode subjecthood purely syntactically, but that subjecthood embedding is continuous and dependent on semantic and discourse factors, as is proposed in much of the functional linguistics literature. Together, these results provide insight into how grammatical features manifest in contextual embedding spaces, at a level of abstraction not covered by previous work.

pdf bib
DOCENT : Learning Self-Supervised Entity Representations from Large Document CollectionsDOCENT: Learning Self-Supervised Entity Representations from Large Document Collections
Yury Zemlyanskiy | Sudeep Gandhe | Ruining He | Bhargav Kanagal | Anirudh Ravula | Juraj Gottweis | Fei Sha | Ilya Eckstein

This paper explores learning rich self-supervised entity representations from large amounts of associated text. Once pre-trained, these models become applicable to multiple entity-centric tasks such as ranked retrieval, knowledge base completion, question answering, and more. Unlike other methods that harvest self-supervision signals based merely on a local context within a sentence, we radically expand the notion of context to include any available text related to an entity. This enables a new class of powerful, high-capacity representations that can ultimately distill much of the useful information about an entity from multiple text sources, without any human supervision. We present several training strategies that, unlike prior approaches, learn to jointly predict words and entities strategies we compare experimentally on downstream tasks in the TV-Movies domain, such as MovieLens tag prediction from user reviews and natural language movie search. As evidenced by results, our models match or outperform competitive baselines, sometimes with little or no fine-tuning, and are also able to scale to very large corpora. Finally, we make our datasets and pre-trained models publicly available. This includes Reviews2Movielens, mapping the ~1B word corpus of Amazon movie reviews (He and McAuley, 2016) to MovieLens tags (Harper and Konstan, 2016), as well as Reddit Movie Suggestions with natural language queries and corresponding community recommendations.

pdf bib
Scientific Discourse Tagging for Evidence Extraction
Xiangci Li | Gully Burns | Nanyun Peng

Evidence plays a crucial role in any biomedical research narrative, providing justification for some claims and refutation for others. We seek to build models of scientific argument using information extraction methods from full-text papers. We present the capability of automatically extracting text fragments from primary research papers that describe the evidence presented in that paper’s figures, which arguably provides the raw material of any scientific argument made within the paper. We apply richly contextualized deep representation learning pre-trained on biomedical domain corpus to the analysis of scientific discourse structures and the extraction of evidence fragments (i.e., the text in the results section describing data presented in a specified subfigure) from a set of biomedical experimental research articles. We first demonstrate our state-of-the-art scientific discourse tagger on two scientific discourse tagging datasets and its transferability to new datasets. We then show the benefit of leveraging scientific discourse tags for downstream tasks such as claim-extraction and evidence fragment detection. Our work demonstrates the potential of using evidence fragments derived from figure spans for improving the quality of scientific claims by cataloging, indexing and reusing evidence fragments as independent documents.

pdf bib
StructSum : Summarization via Structured RepresentationsStructSum: Summarization via Structured Representations
Vidhisha Balachandran | Artidoro Pagnoni | Jay Yoon Lee | Dheeraj Rajagopal | Jaime Carbonell | Yulia Tsvetkov

Abstractive text summarization aims at compressing the information of a long source document into a rephrased, condensed summary. Despite advances in modeling techniques, abstractive summarization models still suffer from several key challenges : (i) layout bias : they overfit to the style of training corpora ; (ii) limited abstractiveness : they are optimized to copying n-grams from the source rather than generating novel abstractive summaries ; (iii) lack of transparency : they are not interpretable. In this work, we propose a framework based on document-level structure induction for summarization to address these challenges. To this end, we propose incorporating latent and explicit dependencies across sentences in the source document into end-to-end single-document summarization models. Our framework complements standard encoder-decoder summarization models by augmenting them with rich structure-aware document representations based on implicitly learned (latent) structures and externally-derived linguistic (explicit) structures. We show that our summarization framework, trained on the CNN / DM dataset, improves the coverage of content in the source documents, generates more abstractive summaries by generating more novel n-grams, and incorporates interpretable sentence-level structures, while performing on par with standard baselines.

pdf bib
LSOIE : A Large-Scale Dataset for Supervised Open Information ExtractionLSOIE: A Large-Scale Dataset for Supervised Open Information Extraction
Jacob Solawetz | Stefan Larson

Open Information Extraction (OIE) systems seek to compress the factual propositions of a sentence into a series of n-ary tuples. These tuples are useful for downstream tasks in natural language processing like knowledge base creation, textual entailment, and natural language understanding. However, current OIE datasets are limited in both size and diversity. We introduce a new dataset by converting the QA-SRL 2.0 dataset to a large-scale OIE dataset LSOIE. Our LSOIE dataset is 20 times larger than the next largest human-annotated OIE dataset. We construct and evaluate several benchmark OIE models on LSOIE, providing baselines for future improvements on the task. Our LSOIE data, models, and code are made publicly available.

pdf bib
Unsupervised Abstractive Summarization of Bengali Text DocumentsBengali Text Documents
Radia Rayan Chowdhury | Mir Tafseer Nayeem | Tahsin Tasnim Mim | Md. Saifur Rahman Chowdhury | Taufiqul Jannat

Abstractive summarization systems generally rely on large collections of document-summary pairs. However, the performance of abstractive systems remains a challenge due to the unavailability of the parallel data for low-resource languages like Bengali. To overcome this problem, we propose a graph-based unsupervised abstractive summarization system in the single-document setting for Bengali text documents, which requires only a Part-Of-Speech (POS) tagger and a pre-trained language model trained on Bengali texts. We also provide a human-annotated dataset with document-summary pairs to evaluate our abstractive model and to support the comparison of future abstractive summarization systems of the Bengali Language. We conduct experiments on this dataset and compare our system with several well-established unsupervised extractive summarization systems. Our unsupervised abstractive summarization model outperforms the baselines without being exposed to any human-annotated reference summaries.

pdf bib
On the Computational Modelling of Michif Verbal MorphologyMichif Verbal Morphology
Fineen Davis | Eddie Antonio Santos | Heather Souter

This paper presents a finite-state computational model of the verbal morphology of Michif. Michif, the official language of the Mtis peoples, is a uniquely mixed language with Algonquian and French origins. It is spoken across the Mtis homelands in what is now called Canada and the United States, but it is highly endangered with less than 100 speakers. The verbal morphology is remarkably complex, as the already polysynthetic Algonquian patterns are combined with French elements and unique morpho-phonological interactions. The model presented in this paper, LI VERB KAA-OOSHITAHK DI MICHIF handles this complexity by using a series of composed finite-state transducers to model the concatenative morphology and phonological rule alternations that are unique to Michif. Such a rule-based approach is necessary as there is insufficient language data for an approach that uses machine learning. A language model such as LI VERB KAA-OOSHITAHK DI MICHIF furthers the goals of Indigenous computational linguistics in Canada while also supporting the creation of tools for documentation, education, and revitalization that are desired by the Mtis community.

pdf bib
A Few Topical Tweets are Enough for Effective User Stance Detection
Younes Samih | Kareem Darwish

User stance detection entails ascertaining the position of a user towards a target, such as an entity, topic, or claim. Recent work that employs unsupervised classification has shown that performing stance detection on vocal Twitter users, who have many tweets on a target, can be highly accurate (+98 %). However, such methods perform poorly or fail completely for less vocal users, who may have authored only a few tweets about a target. In this paper, we tackle stance detection for such users using two approaches. In the first approach, we improve user-level stance detection by representing tweets using contextualized embeddings, which capture latent meanings of words in context. We show that this approach outperforms two strong baselines and achieves 89.6 % accuracy and 91.3 % macro F-measure on eight controversial topics. In the second approach, we expand the tweets of a given user using their Twitter timeline tweets, which may not be topically relevant, and then we perform unsupervised classification of the user, which entails clustering a user with other users in the training set. This approach achieves 95.6 % accuracy and 93.1 % macro F-measure.

pdf bib
Do Syntax Trees Help Pre-trained Transformers Extract Information?
Devendra Sachan | Yuhao Zhang | Peng Qi | William L. Hamilton

Much recent work suggests that incorporating syntax information from dependency trees can improve task-specific transformer models. However, the effect of incorporating dependency tree information into pre-trained transformer models (e.g., BERT) remains unclear, especially given recent studies highlighting how these models implicitly encode syntax. In this work, we systematically study the utility of incorporating dependency trees into pre-trained transformers on three representative information extraction tasks : semantic role labeling (SRL), named entity recognition, and relation extraction. We propose and investigate two distinct strategies for incorporating dependency structure : a late fusion approach, which applies a graph neural network on the output of a transformer, and a joint fusion approach, which infuses syntax structure into the transformer attention layers. These strategies are representative of prior work, but we introduce additional model design elements that are necessary for obtaining improved performance. Our empirical analysis demonstrates that these syntax-infused transformers obtain state-of-the-art results on SRL and relation extraction tasks. However, our analysis also reveals a critical shortcoming of these models : we find that their performance gains are highly contingent on the availability of human-annotated dependency parses, which raises important questions regarding the viability of syntax-augmented transformers in real-world applications.

pdf bib
Entity-level Factual Consistency of Abstractive Text Summarization
Feng Nan | Ramesh Nallapati | Zhiguo Wang | Cicero Nogueira dos Santos | Henghui Zhu | Dejiao Zhang | Kathleen McKeown | Bing Xiang

A key challenge for abstractive summarization is ensuring factual consistency of the generated summary with respect to the original document. For example, state-of-the-art models trained on existing datasets exhibit entity hallucination, generating names of entities that are not present in the source document. We propose a set of new metrics to quantify the entity-level factual consistency of generated summaries and we show that the entity hallucination problem can be alleviated by simply filtering the training data. In addition, we propose a summary-worthy entity classification task to the training process as well as a joint entity and summary generation approach, which yield further improvements in entity level metrics.

pdf bib
Diverse Adversaries for Mitigating Bias in Training
Xudong Han | Timothy Baldwin | Trevor Cohn

Adversarial learning can learn fairer and less biased models of language processing than standard training. However, current adversarial techniques only partially mitigate the problem of model bias, added to which their training procedures are often unstable. In this paper, we propose a novel approach to adversarial learning based on the use of multiple diverse discriminators, whereby discriminators are encouraged to learn orthogonal hidden representations from one another. Experimental results show that our method substantially improves over standard adversarial removal methods, in terms of reducing bias and stability of training.

pdf bib
‘Just because you are right, does n’t mean I am wrong’ : Overcoming a bottleneck in development and evaluation of Open-Ended VQA tasksI am wrong’: Overcoming a bottleneck in development and evaluation of Open-Ended VQA tasks
Man Luo | Shailaja Keyur Sampat | Riley Tallman | Yankai Zeng | Manuha Vancha | Akarshan Sajja | Chitta Baral

GQA (CITATION) is a dataset for real-world visual reasoning and compositional question answering. We found that many answers predicted by the best vision-language models on the GQA dataset do not match the ground-truth answer but still are semantically meaningful and correct in the given context. In fact, this is the case with most existing visual question answering (VQA) datasets where they assume only one ground-truth answer for each question. We propose Alternative Answer Sets (AAS) of ground-truth answers to address this limitation, which is created automatically using off-the-shelf NLP tools. We introduce a semantic metric based on AAS and modify top VQA solvers to support multiple plausible answers for a question. We implement this approach on the GQA dataset and show the performance improvements.

pdf bib
Better Neural Machine Translation by Extracting Linguistic Information from BERTBERT
Hassan S. Shavarani | Anoop Sarkar

Adding linguistic information (syntax or semantics) to neural machine translation (NMT) have mostly focused on using point estimates from pre-trained models. Directly using the capacity of massive pre-trained contextual word embedding models such as BERT(Devlin et al., 2019) has been marginally useful in NMT because effective fine-tuning is difficult to obtain for NMT without making training brittle and unreliable. We augment NMT by extracting dense fine-tuned vector-based linguistic information from BERT instead of using point estimates. Experimental results show that our method of incorporating linguistic information helps NMT to generalize better in a variety of training contexts and is no more difficult to train than conventional Transformer-based NMT.

pdf bib
CLiMP : A Benchmark for Chinese Language Model EvaluationCLiMP: A Benchmark for Chinese Language Model Evaluation
Beilei Xiang | Changbing Yang | Yu Li | Alex Warstadt | Katharina Kann

Linguistically informed analyses of language models (LMs) contribute to the understanding and improvement of such models. Here, we introduce the corpus of Chinese linguistic minimal pairs (CLiMP) to investigate what knowledge Chinese LMs acquire. CLiMP consists of sets of 1000 minimal pairs (MPs) for 16 syntactic contrasts in Chinese, covering 9 major Chinese linguistic phenomena. The MPs are semi-automatically generated, and human agreement with the labels in CLiMP is 95.8 %. We evaluate 11 different LMs on CLiMP, covering n-grams, LSTMs, and Chinese BERT. We find that classifiernoun agreement and verb complement selection are the phenomena that models generally perform best at. However, models struggle the most with the ba construction, binding, and filler-gap dependencies. Overall, Chinese BERT achieves an 81.8 % average accuracy, while the performances of LSTMs and 5-grams are only moderately above chance level.

pdf bib
Progressively Pretrained Dense Corpus Index for Open-Domain Question Answering
Wenhan Xiong | Hong Wang | William Yang Wang

Commonly used information retrieval methods such as TF-IDF in open-domain question answering (QA) systems are insufficient to capture deep semantic matching that goes beyond lexical overlaps. Some recent studies consider the retrieval process as maximum inner product search (MIPS) using dense question and paragraph representations, achieving promising results on several information-seeking QA datasets. However, the pretraining of the dense vector representations is highly resource-demanding, e.g., requires a very large batch size and lots of training steps. In this work, we propose a sample-efficient method to pretrain the paragraph encoder. First, instead of using heuristically created pseudo question-paragraph pairs for pretraining, we use an existing pretrained sequence-to-sequence model to build a strong question generator that creates high-quality pretraining data. Second, we propose a simple progressive pretraining algorithm to ensure the existence of effective negative samples in each batch. Across three open-domain QA datasets, our method consistently outperforms a strong dense retrieval baseline that uses 6 times more computation for training. On two of the datasets, our method achieves more than 4-point absolute improvement in terms of answer exact match.e.g., requires a very large batch size and lots of training steps. In this work, we propose a sample-efficient method to pretrain the paragraph encoder. First, instead of using heuristically created pseudo question-paragraph pairs for pretraining, we use an existing pretrained sequence-to-sequence model to build a strong question generator that creates high-quality pretraining data. Second, we propose a simple progressive pretraining algorithm to ensure the existence of effective negative samples in each batch. Across three open-domain QA datasets, our method consistently outperforms a strong dense retrieval baseline that uses 6 times more computation for training. On two of the datasets, our method achieves more than 4-point absolute improvement in terms of answer exact match.

pdf bib
Exploring the Limits of Few-Shot Link Prediction in Knowledge Graphs
Dora Jambor | Komal Teru | Joelle Pineau | William L. Hamilton

Real-world knowledge graphs are often characterized by low-frequency relationsa challenge that has prompted an increasing interest in few-shot link prediction methods. These methods perform link prediction for a set of new relations, unseen during training, given only a few example facts of each relation at test time. In this work, we perform a systematic study on a spectrum of models derived by generalizing the current state of the art for few-shot link prediction, with the goal of probing the limits of learning in this few-shot setting. We find that a simple, zero-shot baseline which ignores any relation-specific information achieves surprisingly strong performance. Moreover, experiments on carefully crafted synthetic datasets show that having only a few examples of a relation fundamentally limits models from using fine-grained structural information and only allows for exploiting the coarse-grained positional information of entities. Together, our findings challenge the implicit assumptions and inductive biases of prior work and highlight new directions for research in this area.

pdf bib
ProFormer : Towards On-Device LSH Projection Based TransformersProFormer: Towards On-Device LSH Projection Based Transformers
Chinnadhurai Sankar | Sujith Ravi | Zornitsa Kozareva

At the heart of text based neural models lay word representations, which are powerful but occupy a lot of memory making it challenging to deploy to devices with memory constraints such as mobile phones, watches and IoT. To surmount these challenges, we introduce ProFormer a projection based transformer architecture that is faster and lighter making it suitable to deploy to memory constraint devices and preserve user privacy. We use LSH projection layer to dynamically generate word representations on-the-fly without embedding lookup tables leading to significant memory footprint reduction from O(V.d) to O(T), where V is the vocabulary size, d is the embedding dimension size and T is the dimension of the LSH projection representation. We also propose a local projection attention (LPA) layer, which uses self-attention to transform the input sequence of N LSH word projections into a sequence of N / K representations reducing the computations quadratically by O(K2). We evaluate ProFormer on multiple text classification tasks and observed improvements over prior state-of-the-art on-device approaches for short text classification and comparable performance for long text classification tasks. ProFormer is also competitive with other popular but highly resource-intensive approaches like BERT and even outperforms small-sized BERT variants with significant resource savings reduces the embedding memory footprint from 92.16 MB to 1.7 KB and requires 16x less computation overhead, which is very impressive making it the fastest and smallest on-device model.

pdf bib
Crisscrossed Captions : Extended Intramodal and Intermodal Semantic Similarity Judgments for MS-COCOMS-COCO
Zarana Parekh | Jason Baldridge | Daniel Cer | Austin Waters | Yinfei Yang

By supporting multi-modal retrieval training and evaluation, image captioning datasets have spurred remarkable progress on representation learning. Unfortunately, datasets have limited cross-modal associations : images are not paired with other images, captions are only paired with other captions of the same image, there are no negative associations and there are missing positive cross-modal associations. This undermines research into how inter-modality learning impacts intra-modality tasks. We address this gap with Crisscrossed Captions (CxC), an extension of the MS-COCO dataset with human semantic similarity judgments for 267,095 intra- and inter-modality pairs. We report baseline results on CxC for strong existing unimodal and multimodal models. We also evaluate a multitask dual encoder trained on both image-caption and caption-caption pairs that crucially demonstrates CxC’s value for measuring the influence of intra- and inter-modality learning.

pdf bib
ENPAR : Enhancing Entity and Entity Pair Representations for Joint Entity Relation ExtractionENPAR:Enhancing Entity and Entity Pair Representations for Joint Entity Relation Extraction
Yijun Wang | Changzhi Sun | Yuanbin Wu | Hao Zhou | Lei Li | Junchi Yan

Current state-of-the-art systems for joint entity relation extraction (Luan et al., 2019 ; Wad-den et al., 2019) usually adopt the multi-task learning framework. However, annotations for these additional tasks such as coreference resolution and event extraction are always equally hard (or even harder) to obtain. In this work, we propose a pre-training method ENPAR to improve the joint extraction performance. ENPAR requires only the additional entity annotations that are much easier to collect. Unlike most existing works that only consider incorporating entity information into the sentence encoder, we further utilize the entity pair information. Specifically, we devise four novel objectives, i.e., masked entity typing, masked entity prediction, adversarial context discrimination, and permutation prediction, to pre-train an entity encoder and an entity pair encoder. Comprehensive experiments show that the proposed pre-training method achieves significant improvement over BERT on ACE05, SciERC, and NYT, and outperforms current state-of-the-art on ACE05.

pdf bib
Modelling Context Emotions using Multi-task Learning for Emotion Controlled Dialog Generation
Deeksha Varshney | Asif Ekbal | Pushpak Bhattacharyya

A recent topic of research in natural language generation has been the development of automatic response generation modules that can automatically respond to a user’s utterance in an empathetic manner. Previous research has tackled this task using neural generative methods by augmenting emotion classes with the input sequences. However, the outputs by these models may be inconsistent. We employ multi-task learning to predict the emotion label and to generate a viable response for a given utterance using a common encoder with multiple decoders. Our proposed encoder-decoder model consists of a self-attention based encoder and a decoder with dot product attention mechanism to generate response with a specified emotion. We use the focal loss to handle imbalanced data distribution, and utilize the consistency loss to allow coherent decoding by the decoders. Human evaluation reveals that our model produces more emotionally pertinent responses. In addition, our model outperforms multiple strong baselines on automatic evaluation measures such as F1 and BLEU scores, thus resulting in more fluent and adequate responses.

pdf bib
Modeling Context in Answer Sentence Selection Systems on a Latency Budget
Rujun Han | Luca Soldaini | Alessandro Moschitti

Answer Sentence Selection (AS2) is an efficient approach for the design of open-domain Question Answering (QA) systems. In order to achieve low latency, traditional AS2 models score question-answer pairs individually, ignoring any information from the document each potential answer was extracted from. In contrast, more computationally expensive models designed for machine reading comprehension tasks typically receive one or more passages as input, which often results in better accuracy. In this work, we present an approach to efficiently incorporate contextual information in AS2 models. For each answer candidate, we first use unsupervised similarity techniques to extract relevant sentences from its source document, which we then feed into an efficient transformer architecture fine-tuned for AS2. Our best approach, which leverages a multi-way attention architecture to efficiently encode context, improves 6 % to 11 % over non-contextual state of the art in AS2 with minimal impact on system latency. All experiments in this work were conducted in English.

pdf bib
DISK-CSV : Distilling Interpretable Semantic Knowledge with a Class Semantic VectorDISK-CSV: Distilling Interpretable Semantic Knowledge with a Class Semantic Vector
Housam Khalifa Bashier | Mi-Young Kim | Randy Goebel

Neural networks (NN) applied to natural language processing (NLP) are becoming deeper and more complex, making them increasingly difficult to understand and interpret. Even in applications of limited scope on fixed data, the creation of these complex black-boxes creates substantial challenges for debugging, understanding, and generalization. But rapid development in this field has now lead to building more straightforward and interpretable models. We propose a new technique (DISK-CSV) to distill knowledge concurrently from any neural network architecture for text classification, captured as a lightweight interpretable / explainable classifier. Across multiple datasets, our approach achieves better performance than the target black-box. In addition, our approach provides better explanations than existing techniques.

pdf bib
Attention Can Reflect Syntactic Structure (If You Let It)
Vinit Ravishankar | Artur Kulmizev | Mostafa Abdou | Anders Søgaard | Joakim Nivre

Since the popularization of the Transformer as a general-purpose feature encoder for NLP, many studies have attempted to decode linguistic structure from its novel multi-head attention mechanism. However, much of such work focused almost exclusively on English a language with rigid word order and a lack of inflectional morphology. In this study, we present decoding experiments for multilingual BERT across 18 languages in order to test the generalizability of the claim that dependency syntax is reflected in attention patterns. We show that full trees can be decoded above baseline accuracy from single attention heads, and that individual relations are often tracked by the same heads across languages. Furthermore, in an attempt to address recent debates about the status of attention as an explanatory mechanism, we experiment with fine-tuning mBERT on a supervised parsing objective while freezing different series of parameters. Interestingly, in steering the objective to learn explicit linguistic structure, we find much of the same structure represented in the resulting attention patterns, with interesting differences with respect to which parameters are frozen.

pdf bib
CDA : a Cost Efficient Content-based Multilingual Web Document AlignerCDA: a Cost Efficient Content-based Multilingual Web Document Aligner
Thuy Vu | Alessandro Moschitti

We introduce a Content-based Document Alignment approach (CDA), an efficient method to align multilingual web documents based on content in creating parallel training data for machine translation (MT) systems operating at the industrial level. CDA works in two steps : (i) projecting documents of a web domain to a shared multilingual space ; then (ii) aligning them based on the similarity of their representations in such space. We leverage lexical translation models to build vector representations using TFIDF. CDA achieves performance comparable with state-of-the-art systems in the WMT-16 Bilingual Document Alignment Shared Task benchmark while operating in multilingual space. Besides, we created two web-scale datasets to examine the robustness of CDA in an industrial setting involving up to 28 languages and millions of documents. The experiments show that CDA is robust, cost-effective, and is significantly superior in (i) processing large and noisy web data and (ii) scaling to new and low-resourced languages.

pdf bib
Is Supervised Syntactic Parsing Beneficial for Language Understanding Tasks? An Empirical Investigation
Goran Glavaš | Ivan Vulić

Traditional NLP has long held (supervised) syntactic parsing necessary for successful higher-level semantic language understanding (LU). The recent advent of end-to-end neural models, self-supervised via language modeling (LM), and their success on a wide range of LU tasks, however, questions this belief. In this work, we empirically investigate the usefulness of supervised parsing for semantic LU in the context of LM-pretrained transformer networks. Relying on the established fine-tuning paradigm, we first couple a pretrained transformer with a biaffine parsing head, aiming to infuse explicit syntactic knowledge from Universal Dependencies treebanks into the transformer. We then fine-tune the model for LU tasks and measure the effect of the intermediate parsing training (IPT) on downstream LU task performance. Results from both monolingual English and zero-shot language transfer experiments (with intermediate target-language parsing) show that explicit formalized syntax, injected into transformers through IPT, has very limited and inconsistent effect on downstream LU performance. Our results, coupled with our analysis of transformers’ representation spaces before and after intermediate parsing, make a significant step towards providing answers to an essential question : how (un)availing is supervised parsing for high-level semantic natural language understanding in the era of large neural models?

pdf bib
Facilitating Terminology Translation with Target Lemma Annotations
Toms Bergmanis | Mārcis Pinnis

Most of the recent work on terminology integration in machine translation has assumed that terminology translations are given already inflected in forms that are suitable for the target language sentence. In day-to-day work of professional translators, however, it is seldom the case as translators work with bilingual glossaries where terms are given in their dictionary forms ; finding the right target language form is part of the translation process. We argue that the requirement for apriori specified target language forms is unrealistic and impedes the practical applicability of previous work. In this work, we propose to train machine translation systems using a source-side data augmentation method that annotates randomly selected source language words with their target language lemmas. We show that systems trained on such augmented data are readily usable for terminology integration in real-life translation scenarios. Our experiments on terminology translation into the morphologically complex Baltic and Uralic languages show an improvement of up to 7 BLEU points over baseline systems with no means for terminology integration and an average improvement of 4 BLEU points over the previous work. Results of the human evaluation indicate a 47.7 % absolute improvement over the previous work in term translation accuracy when translating into Latvian.

pdf bib
Summarising Historical Text in Modern Languages
Xutan Peng | Yi Zheng | Chenghua Lin | Advaith Siddharthan

We introduce the task of historical text summarisation, where documents in historical forms of a language are summarised in the corresponding modern language. This is a fundamentally important routine to historians and digital humanities researchers but has never been automated. We compile a high-quality gold-standard text summarisation dataset, which consists of historical German and Chinese news from hundreds of years ago summarised in modern German or Chinese. Based on cross-lingual transfer learning techniques, we propose a summarisation model that can be trained even with no cross-lingual (historical to modern) parallel data, and further benchmark it against state-of-the-art algorithms. We report automatic and human evaluations that distinguish the historic to modern language summarisation task from standard cross-lingual summarisation (i.e., modern to modern language), highlight the distinctness and value of our dataset, and demonstrate that our transfer learning approach outperforms standard cross-lingual benchmarks on this task.

pdf bib
Challenges in Automated Debiasing for Toxic Language Detection
Xuhui Zhou | Maarten Sap | Swabha Swayamdipta | Yejin Choi | Noah Smith

Biased associations have been a challenge in the development of classifiers for detecting toxic language, hindering both fairness and accuracy. As potential solutions, we investigate recently introduced debiasing methods for text classification datasets and models, as applied to toxic language detection. Our focus is on lexical (e.g., swear words, slurs, identity mentions) and dialectal markers (specifically African American English). Our comprehensive experiments establish that existing methods are limited in their ability to prevent biased behavior in current toxicity detectors. We then propose an automatic, dialect-aware data correction method, as a proof-of-concept. Despite the use of synthetic labels, this method reduces dialectal associations with toxicity. Overall, our findings show that debiasing a model trained on biased toxic language data is not as effective as simply relabeling the data to remove existing biases.

pdf bib
Detecting Scenes in Fiction : A new Segmentation Task
Albin Zehe | Leonard Konle | Lea Katharina Dümpelmann | Evelyn Gius | Andreas Hotho | Fotis Jannidis | Lucas Kaufmann | Markus Krug | Frank Puppe | Nils Reiter | Annekea Schreiber | Nathalie Wiedmer

This paper introduces the novel task of scene segmentation on narrative texts and provides an annotated corpus, a discussion of the linguistic and narrative properties of the task and baseline experiments towards automatic solutions. A scene here is a segment of the text where time and discourse time are more or less equal, the narration focuses on one action and location and character constellations stay the same. The corpus we describe consists of German-language dime novels (550k tokens) that have been annotated in parallel, achieving an inter-annotator agreement of gamma = 0.7. Baseline experiments using BERT achieve an F1 score of 24 %, showing that the task is very challenging. An automatic scene segmentation paves the way towards processing longer narrative texts like tales or novels by breaking them down into smaller, coherent and meaningful parts, which is an important stepping stone towards the reconstruction of plot in Computational Literary Studies but also can serve to improve tasks like coreference resolution.

pdf bib
Expanding, Retrieving and Infilling : Diversifying Cross-Domain Question Generation with Flexible Templates
Xiaojing Yu | Anxiao Jiang

Sequence-to-sequence based models have recently shown promising results in generating high-quality questions. However, these models are also known to have main drawbacks such as lack of diversity and bad sentence structures. In this paper, we focus on question generation over SQL database and propose a novel framework by expanding, retrieving, and infilling that first incorporates flexible templates with a neural-based model to generate diverse expressions of questions with sentence structure guidance. Furthermore, a new activation / deactivation mechanism is proposed for template-based sequence-to-sequence generation, which learns to discriminate template patterns and content patterns, thus further improves generation quality. We conduct experiments on two large-scale cross-domain datasets. The experiments show that the superiority of our question generation method in producing more diverse questions while maintaining high quality and consistency under both automatic evaluation and human evaluation.

pdf bib
Handling Out-Of-Vocabulary Problem in Hangeul Word Embeddings
Ohjoon Kwon | Dohyun Kim | Soo-Ryeon Lee | Junyoung Choi | SangKeun Lee

Word embedding is considered an essential factor in improving the performance of various Natural Language Processing (NLP) models. However, it is hardly applicable in real-world datasets as word embedding is generally studied with a well-refined corpus. Notably, in Hangeul (Korean writing system), which has a unique writing system, various kinds of Out-Of-Vocabulary (OOV) appear from typos. In this paper, we propose a robust Hangeul word embedding model against typos, while maintaining high performance. The proposed model utilizes a Convolutional Neural Network (CNN) architecture with a channel attention mechanism that learns to infer the original word embeddings. The model train with a dataset that consists of a mix of typos and correct words. To demonstrate the effectiveness of the proposed model, we conduct three kinds of intrinsic and extrinsic tasks. While the existing embedding models fail to maintain stable performance as the noise level increases, the proposed model shows stable performance.

pdf bib
Exploiting Multimodal Reinforcement Learning for Simultaneous Machine Translation
Julia Ive | Andy Mingren Li | Yishu Miao | Ozan Caglayan | Pranava Madhyastha | Lucia Specia

This paper addresses the problem of simultaneous machine translation (SiMT) by exploring two main concepts : (a) adaptive policies to learn a good trade-off between high translation quality and low latency ; and (b) visual information to support this process by providing additional (visual) contextual information which may be available before the textual input is produced. For that, we propose a multimodal approach to simultaneous machine translation using reinforcement learning, with strategies to integrate visual and textual information in both the agent and the environment. We provide an exploration on how different types of visual information and integration strategies affect the quality and latency of simultaneous translation models, and demonstrate that visual cues lead to higher quality while keeping the latency low.

pdf bib
Do Multi-Hop Question Answering Systems Know How to Answer the Single-Hop Sub-Questions?
Yixuan Tang | Hwee Tou Ng | Anthony Tung

Multi-hop question answering (QA) requires a model to retrieve and integrate information from multiple passages to answer a question. Rapid progress has been made on multi-hop QA systems with regard to standard evaluation metrics, including EM and F1. However, by simply evaluating the correctness of the answers, it is unclear to what extent these systems have learned the ability to perform multi-hop reasoning. In this paper, we propose an additional sub-question evaluation for the multi-hop QA dataset HotpotQA, in order to shed some light on explaining the reasoning process of QA systems in answering complex questions. We adopt a neural decomposition model to generate sub-questions for a multi-hop question, followed by extracting the corresponding sub-answers. Contrary to our expectation, multiple state-of-the-art multi-hop QA models fail to answer a large portion of sub-questions, although the corresponding multi-hop questions are correctly answered. Our work takes a step forward towards building a more explainable multi-hop QA system.

pdf bib
Variational Weakly Supervised Sentiment Analysis with Posterior Regularization
Ziqian Zeng | Yangqiu Song

Sentiment analysis is an important task in natural language processing (NLP). Most of existing state-of-the-art methods are under the supervised learning paradigm. However, human annotations can be scarce. Thus, we should leverage more weak supervision for sentiment analysis. In this paper, we propose a posterior regularization framework for the variational approach to the weakly supervised sentiment analysis to better control the posterior distribution of the label assignment. The intuition behind the posterior regularization is that if extracted opinion words from two documents are semantically similar, the posterior distributions of two documents should be similar. Our experimental results show that the posterior regularization can improve the original variational approach to the weakly supervised sentiment analysis and the performance is more stable with smaller prediction variance.

pdf bib
Cognition-aware Cognate Detection
Diptesh Kanojia | Prashant Sharma | Sayali Ghodekar | Pushpak Bhattacharyya | Gholamreza Haffari | Malhar Kulkarni

Automatic detection of cognates helps downstream NLP tasks of Machine Translation, Cross-lingual Information Retrieval, Computational Phylogenetics and Cross-lingual Named Entity Recognition. Previous approaches for the task of cognate detection use orthographic, phonetic and semantic similarity based features sets. In this paper, we propose a novel method for enriching the feature sets, with cognitive features extracted from human readers’ gaze behaviour. We collect gaze behaviour data for a small sample of cognates and show that extracted cognitive features help the task of cognate detection. However, gaze data collection and annotation is a costly task. We use the collected gaze behaviour data to predict cognitive features for a larger sample and show that predicted cognitive features, also, significantly improve the task performance. We report improvements of 10 % with the collected gaze features, and 12 % using the predicted gaze features, over the previously proposed approaches. Furthermore, we release the collected gaze behaviour data along with our code and cross-lingual models.

pdf bib
Probing the Probing Paradigm : Does Probing Accuracy Entail Task Relevance?
Abhilasha Ravichander | Yonatan Belinkov | Eduard Hovy

Although neural models have achieved impressive results on several NLP benchmarks, little is understood about the mechanisms they use to perform language tasks. Thus, much recent attention has been devoted to analyzing the sentence representations learned by neural encoders, through the lens of ‘probing’ tasks. However, to what extent was the information encoded in sentence representations, as discovered through a probe, actually used by the model to perform its task? In this work, we examine this probing paradigm through a case study in Natural Language Inference, showing that models can learn to encode linguistic properties even if they are not needed for the task on which the model was trained. We further identify that pretrained word embeddings play a considerable role in encoding these properties rather than the training task itself, highlighting the importance of careful controls when designing probing experiments. Finally, through a set of controlled synthetic tasks, we demonstrate models can encode these properties considerably above chance-level, even when distributed in the data as random noise, calling into question the interpretation of absolute claims on probing tasks.

pdf bib
One-class Text Classification with Multi-modal Deep Support Vector Data Description
Chenlong Hu | Yukun Feng | Hidetaka Kamigaito | Hiroya Takamura | Manabu Okumura

This work presents multi-modal deep SVDD (mSVDD) for one-class text classification. By extending the uni-modal SVDD to a multiple modal one, we build mSVDD with multiple hyperspheres, that enable us to build a much better description for target one-class data. Additionally, the end-to-end architecture of mSVDD can jointly handle neural feature learning and one-class text learning. We also introduce a mechanism for incorporating negative supervision in the absence of real negative data, which can be beneficial to the mSVDD model. We conduct experiments on Reuters and 20 Newsgroup datasets, and the experimental results demonstrate that mSVDD outperforms uni-modal SVDD and mSVDD can get further improvements when negative supervision is incorporated.

pdf bib
Unsupervised Word Polysemy Quantification with Multiresolution Grids of Contextual Embeddings
Christos Xypolopoulos | Antoine Tixier | Michalis Vazirgiannis

The number of senses of a given word, or polysemy, is a very subjective notion, which varies widely across annotators and resources. We propose a novel method to estimate polysemy based on simple geometry in the contextual embedding space. Our approach is fully unsupervised and purely data-driven. Through rigorous experiments, we show that our rankings are well correlated, with strong statistical significance, with 6 different rankings derived from famous human-constructed resources such as WordNet, OntoNotes, Oxford, Wikipedia, etc., for 6 different standard metrics. We also visualize and analyze the correlation between the human rankings and make interesting observations. A valuable by-product of our method is the ability to sample, at no extra cost, sentences containing different senses of a given word. Finally, the fully unsupervised nature of our approach makes it applicable to any language. Code and data are publicly available

pdf bib
Disfluency Correction using Unsupervised and Semi-supervised Learning
Nikhil Saini | Drumil Trivedi | Shreya Khare | Tejas Dhamecha | Preethi Jyothi | Samarth Bharadwaj | Pushpak Bhattacharyya

Spoken language is different from the written language in its style and structure. Disfluencies that appear in transcriptions from speech recognition systems generally hamper the performance of downstream NLP tasks. Thus, a disfluency correction system that converts disfluent to fluent text is of great value. This paper introduces a disfluency correction model that translates disfluent to fluent text by drawing inspiration from recent encoder-decoder unsupervised style-transfer models for text. We also show considerable benefits in performance when utilizing a small sample of 500 parallel disfluent-fluent sentences in a semi-supervised way. Our unsupervised approach achieves a BLEU score of 79.39 on the Switchboard corpus test set, with further improvement to a BLEU score of 85.28 with semi-supervision. Both are comparable to two competitive fully-supervised models.

pdf bib
Complex Question Answering on knowledge graphs using machine translation and multi-task learning
Saurabh Srivastava | Mayur Patidar | Sudip Chowdhury | Puneet Agarwal | Indrajit Bhattacharya | Gautam Shroff

Question answering (QA) over a knowledge graph (KG) is a task of answering a natural language (NL) query using the information stored in KG. In a real-world industrial setting, this involves addressing multiple challenges including entity linking, multi-hop reasoning over KG, etc. Traditional approaches handle these challenges in a modularized sequential manner where errors in one module lead to the accumulation of errors in downstream modules. Often these challenges are inter-related and the solutions to them can reinforce each other when handled simultaneously in an end-to-end learning setup. To this end, we propose a multi-task BERT based Neural Machine Translation (NMT) model to address these challenges. Through experimental analysis, we demonstrate the efficacy of our proposed approach on one publicly available and one proprietary dataset.

pdf bib
Communicative-Function-Based Sentence Classification for Construction of an Academic Formulaic Expression Database
Kenichi Iwatsuki | Akiko Aizawa

Formulaic expressions (FEs), such as ‘in this paper, we propose’ are frequently used in scientific papers. FEs convey a communicative function (CF), i.e. ‘showing the aim of the paper’ in the above-mentioned example. Although CF-labelled FEs are helpful in assisting academic writing, the construction of FE databases requires manual labour for assigning CF labels. In this study, we considered a fully automated construction of a CF-labelled FE database using the topdown approach, in which the CF labels are first assigned to sentences, and then the FEs are extracted. For the CF-label assignment, we created a CF-labelled sentence dataset, on which we trained a SciBERT classifier. We show that the classifier and dataset can be used to construct FE databases of disciplines that are different from the training data. The accuracy of in-disciplinary classification was more than 80 %, while cross-disciplinary classification also worked well. We also propose an FE extraction method, which was applied to the CF-labelled sentences. Finally, we constructed and published a new, large CF-labelled FE database. The evaluation of the final CF-labelled FE database showed that approximately 65 % of the FEs are correct and useful, which is sufficiently high considering practical use.

pdf bib
From the Stage to the Audience : Propaganda on RedditReddit
Oana Balalau | Roxana Horincar

Political discussions revolve around ideological conflicts that often split the audience into two opposing parties. Both parties try to win the argument by bringing forward information. However, often this information is misleading, and its dissemination employs propaganda techniques. In this work, we analyze the impact of propaganda on six major political forums on Reddit that target a diverse audience in two countries, the US and the UK. We focus on three research questions : who is posting propaganda? how does propaganda differ across the political spectrum? and how is propaganda received on political forums?

pdf bib
Probing for idiomaticity in vector space models
Marcos Garcia | Tiago Kramer Vieira | Carolina Scarton | Marco Idiart | Aline Villavicencio

Contextualised word representation models have been successfully used for capturing different word usages and they may be an attractive alternative for representing idiomaticity in language. In this paper, we propose probing measures to assess if some of the expected linguistic properties of noun compounds, especially those related to idiomatic meanings, and their dependence on context and sensitivity to lexical choice, are readily available in some standard and widely used representations. For that, we constructed the Noun Compound Senses Dataset, which contains noun compounds and their paraphrases, in context neutral and context informative naturalistic sentences, in two languages : English and Portuguese. Results obtained using four types of probing measures with models like ELMo, BERT and some of its variants, indicate that idiomaticity is not yet accurately represented by contextualised models

pdf bib
Is the Understanding of Explicit Discourse Relations Required in Machine Reading Comprehension?
Yulong Wu | Viktor Schlegel | Riza Batista-Navarro

An in-depth analysis of the level of language understanding required by existing Machine Reading Comprehension (MRC) benchmarks can provide insight into the reading capabilities of machines. In this paper, we propose an ablation-based methodology to assess the extent to which MRC datasets evaluate the understanding of explicit discourse relations. We define seven MRC skills which require the understanding of different discourse relations. We then introduce ablation methods that verify whether these skills are required to succeed on a dataset. By observing the drop in performance of neural MRC models evaluated on the original and the modified dataset, we can measure to what degree the dataset requires these skills, in order to be understood correctly. Experiments on three large-scale datasets with the BERT-base and ALBERT-xxlarge model show that the relative changes for all skills are small (less than 6 %). These results imply that most of the answered questions in the examined datasets do not require understanding the discourse structure of the text. To specifically probe for natural language understanding, there is a need to design more challenging benchmarks that can correctly evaluate the intended skills.

pdf bib
Meta-Learning for Effective Multi-task and Multilingual Modelling
Ishan Tarunesh | Sushil Khyalia | Vishwajeet Kumar | Ganesh Ramakrishnan | Preethi Jyothi

Natural language processing (NLP) tasks (e.g. question-answering in English) benefit from knowledge of other tasks (e.g., named entity recognition in English) and knowledge of other languages (e.g., question-answering in Spanish). Such shared representations are typically learned in isolation, either across tasks or across languages. In this work, we propose a meta-learning approach to learn the interactions between both tasks and languages. We also investigate the role of different sampling strategies used during meta-learning. We present experiments on five different tasks and six different languages from the XTREME multilingual benchmark dataset. Our meta-learned model clearly improves in performance compared to competitive baseline models that also include multi-task baselines. We also present zero-shot evaluations on unseen target languages to demonstrate the utility of our proposed model.

pdf bib
Two Training Strategies for Improving Relation Extraction over Universal Graph
Qin Dai | Naoya Inoue | Ryo Takahashi | Kentaro Inui

This paper explores how the Distantly Supervised Relation Extraction (DS-RE) can benefit from the use of a Universal Graph (UG), the combination of a Knowledge Graph (KG) and a large-scale text collection. A straightforward extension of a current state-of-the-art neural model for DS-RE with a UG may lead to degradation in performance. We first report that this degradation is associated with the difficulty in learning a UG and then propose two training strategies : (1) Path Type Adaptive Pretraining, which sequentially trains the model with different types of UG paths so as to prevent the reliance on a single type of UG path ; and (2) Complexity Ranking Guided Attention mechanism, which restricts the attention span according to the complexity of a UG path so as to force the model to extract features not only from simple UG paths but also from complex ones. Experimental results on both biomedical and NYT10 datasets prove the robustness of our methods and achieve a new state-of-the-art result on the NYT10 dataset. The code and datasets used in this paper are available at


bib (full) Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

pdf bib
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations
Dimitra Gkatzia | Djamé Seddah

pdf bib
Using and comparing Rhetorical Structure Theory parsers with rst-workbenchRhetorical Structure Theory parsers with rst-workbench
Arne Neumann

I present rst-workbench, a software package that simplifies the installation and usage of numerous end-to-end Rhetorical Structure Theory (RST) parsers. The tool offers a web-based interface that allows users to enter text and let multiple RST parsers generate analyses concurrently. The resulting RST trees can be compared visually, manually post-edited (in the browser) and stored for later usage.

pdf bib
MATILDA-Multi-AnnoTator multi-language InteractiveLight-weight Dialogue AnnotatorMATILDA - Multi-AnnoTator multi-language InteractiveLight-weight Dialogue Annotator
Davide Cucurnia | Nikolai Rozanov | Irene Sucameli | Augusto Ciuffoletti | Maria Simi

Dialogue Systems are becoming ubiquitous in various forms and shapes-virtual assistants(Siri, Alexa, etc.), chat-bots, customer sup-port, chit-chat systems just to name a few. The advances in language models and their publication have democratised advanced NLP.However, data remains a crucial bottleneck. Our contribution to this essential pillar isMATILDA, to the best of our knowledge the first multi-annotator, multi-language dialogue annotation tool. MATILDA allows the creation of corpora, the management of users, the annotation of dialogues, the quick adaptation of the user interface to any language and the resolution of inter-annotator disagreement. We evaluate the tool on ease of use, annotation speed and interannotation resolution for both experts and novices and conclude that this tool not only supports the full pipeline for dialogue annotation, but also allows non-technical people to easily use it. We are completely open-sourcing the tool at and provide a tutorial video1.

pdf bib
Forum 4.0 : An Open-Source User Comment Analysis Framework
Marlo Haering | Jakob Smedegaard Andersen | Chris Biemann | Wiebke Loosen | Benjamin Milde | Tim Pietz | Christian Stöcker | Gregor Wiedemann | Olaf Zukunft | Walid Maalej

With the increasing number of user comments in diverse domains, including comments on online journalism and e-commerce websites, the manual content analysis of these comments becomes time-consuming and challenging. However, research showed that user comments contain useful information for different domain experts, which is thus worth finding and utilizing. This paper introduces Forum 4.0, an open-source framework to semi-automatically analyze, aggregate, and visualize user comments based on labels defined by domain experts. We demonstrate the applicability of Forum 4.0 with comments analytics scenarios within the domains of online journalism and app stores. We outline the underlying container architecture, including the web-based user interface, the machine learning component, and the task manager for time-consuming tasks. We finally conduct machine learning experiments with simulated annotations and different sampling strategies on existing datasets from both domains to evaluate Forum 4.0’s performance. Forum 4.0 achieves promising classification results (ROC-AUC 0.9 with 100 annotated samples), utilizing transformer-based embeddings with a lightweight logistic regression model. We explain how Forum 4.0’s architecture is applicable for millions of user comments in real-time, yet at feasible training and classification costs.

pdf bib
SLTEV : Comprehensive Evaluation of Spoken Language TranslationSLTEV: Comprehensive Evaluation of Spoken Language Translation
Ebrahim Ansari | Ondřej Bojar | Barry Haddow | Mohammad Mahmoudi

Automatic evaluation of Machine Translation (MT) quality has been investigated over several decades. Spoken Language Translation (SLT), esp. when simultaneous, needs to consider additional criteria and does not have a standard evaluation procedure and a widely used toolkit. To fill the gap, we develop SLTev, an open-source tool for assessing SLT in a comprehensive way. SLTev reports the quality, latency, and stability of an SLT candidate output based on the time-stamped transcript and reference translation into a target language. For quality, we rely on sacreBLEU which provides MT evaluation measures such as chrF or BLEU. For latency, we propose two new scoring techniques. For stability, we extend the previously defined measures with a normalized Flicker in our work. We also propose a new averaging of older measures. A preliminary version of SLTev was used in the IWSLT 2020 shared task. Moreover, a growing collection of test datasets directly accessible by SLTev are provided for system evaluation comparable across papers.

pdf bib
A Dashboard for Mitigating the COVID-19 MisinfodemicCOVID-19 Misinfodemic
Zhengyuan Zhu | Kevin Meng | Josue Caraballo | Israa Jaradat | Xiao Shi | Zeyu Zhang | Farahnaz Akrami | Haojin Liao | Fatma Arslan | Damian Jimenez | Mohanmmed Samiul Saeef | Paras Pathak | Chengkai Li

This paper describes the current milestones achieved in our ongoing project that aims to understand the surveillance of, impact of and intervention on COVID-19 misinfodemic on Twitter. Specifically, it introduces a public dashboard which, in addition to displaying case counts in an interactive map and a navigational panel, also provides some unique features not found in other places. Particularly, the dashboard uses a curated catalog of COVID-19 related facts and debunks of misinformation, and it displays the most prevalent information from the catalog among Twitter users in user-selected U.S. geographic regions. The paper explains how to use BERT models to match tweets with the facts and misinformation and to detect their stance towards such information. The paper also discusses the results of preliminary experiments on analyzing the spatio-temporal spread of misinformation.

pdf bib
A description and demonstration of SAFAR frameworkSAFAR framework
Karim Bouzoubaa | Younes Jaafar | Driss Namly | Ridouane Tachicart | Rachida Tajmout | Hakima Khamar | Hamid Jaafar | Lhoussain Aouragh | Abdellah Yousfi

Several tools and resources have been developed to deal with Arabic NLP. However, a homogenous and flexible Arabic environment that gathers these components is rarely available. In this perspective, we introduce SAFAR which is a monolingual framework developed in accordance with software engineering requirements and dedicated to Arabic language, especially, the modern standard Arabic and Moroccan dialect. After one decade of integration and development, SAFAR possesses today more than 50 tools and resources that can be exploited either using its API or using its web interface.

pdf bib
LOME : Large Ontology Multilingual ExtractionLOME: Large Ontology Multilingual Extraction
Patrick Xia | Guanghui Qin | Siddharth Vashishtha | Yunmo Chen | Tongfei Chen | Chandler May | Craig Harman | Kyle Rawlins | Aaron Steven White | Benjamin Van Durme

We present LOME, a system for performing multilingual information extraction. Given a text document as input, our core system identifies spans of textual entity and event mentions with a FrameNet (Baker et al., 1998) parser. It subsequently performs coreference resolution, fine-grained entity typing, and temporal relation prediction between events. By doing so, the system constructs an event and entity focused knowledge graph. We can further apply third-party modules for other types of annotation, like relation extraction. Our (multilingual) first-party modules either outperform or are competitive with the (monolingual) state-of-the-art. We achieve this through the use of multilingual encoders like XLM-R (Conneau et al., 2020) and leveraging multilingual training data. LOME is available as a Docker container on Docker Hub. In addition, a lightweight version of the system is accessible as a web demo.

pdf bib
Graph Matching and Graph Rewriting : GREW tools for corpus exploration, maintenance and conversionGREW tools for corpus exploration, maintenance and conversion
Bruno Guillaume

This article presents a set of tools built around the Graph Rewriting computational framework which can be used to compute complex rule-based transformations on linguistic structures. Application of the graph matching mechanism for corpus exploration, error mining or quantitative typology are also given.

pdf bib
Massive Choice, Ample Tasks (MaChAmp): A Toolkit for Multi-task Learning in NLPMaChAmp): A Toolkit for Multi-task Learning in NLP
Rob van der Goot | Ahmet Üstün | Alan Ramponi | Ibrahim Sharaf | Barbara Plank

Transfer learning, particularly approaches that combine multi-task learning with pre-trained contextualized embeddings and fine-tuning, have advanced the field of Natural Language Processing tremendously in recent years. In this paper we present MaChAmp, a toolkit for easy fine-tuning of contextualized embeddings in multi-task settings. The benefits of MaChAmp are its flexible configuration options, and the support of a variety of natural language processing tasks in a uniform toolkit, from text classification and sequence labeling to dependency parsing, masked language modeling, and text generation.

pdf bib
European Language Grid : A Joint Platform for the European Language Technology CommunityEuropean Language Grid: A Joint Platform for the European Language Technology Community
Georg Rehm | Stelios Piperidis | Kalina Bontcheva | Jan Hajic | Victoria Arranz | Andrejs Vasiļjevs | Gerhard Backfried | Jose Manuel Gomez-Perez | Ulrich Germann | Rémi Calizzano | Nils Feldhus | Stefanie Hegele | Florian Kintzel | Katrin Marheinecke | Julian Moreno-Schneider | Dimitris Galanis | Penny Labropoulou | Miltos Deligiannis | Katerina Gkirtzou | Athanasia Kolovou | Dimitris Gkoumas | Leon Voukoutis | Ian Roberts | Jana Hamrlova | Dusan Varis | Lukas Kacena | Khalid Choukri | Valérie Mapelli | Mickaël Rigault | Julija Melnika | Miro Janosik | Katja Prinz | Andres Garcia-Silva | Cristian Berrio | Ondrej Klejch | Steve Renals

Europe is a multilingual society, in which dozens of languages are spoken. The only option to enable and to benefit from multilingualism is through Language Technologies (LT), i.e., Natural Language Processing and Speech Technologies. We describe the European Language Grid (ELG), which is targeted to evolve into the primary platform and marketplace for LT in Europe by providing one umbrella platform for the European LT landscape, including research and industry, enabling all stakeholders to upload, share and distribute their services, products and resources. At the end of our EU project, which will establish a legal entity in 2022, the ELG will provide access to approx. 1300 services for all European languages as well as thousands of data sets.

pdf bib
A New Surprise Measure for Extracting Interesting Relationships between Persons
Hidetaka Kamigaito | Jingun Kwon | Young-In Song | Manabu Okumura

One way to enhance user engagement in search engines is to suggest interesting facts to the user. Although relationships between persons are important as a target for text mining, there are few effective approaches for extracting the interesting relationships between persons. We therefore propose a method for extracting interesting relationships between persons from natural language texts by focusing on their surprisingness. Our method first extracts all personal relationships from dependency trees for the texts and then calculates surprise scores for distributed representations of the extracted relationships in an unsupervised manner. The unique point of our method is that it does not require any labeled dataset with annotation for the surprising personal relationships. The results of the human evaluation show that the proposed method could extract more interesting relationships between persons from Japanese Wikipedia articles than a popularity-based baseline method. We demonstrate our proposed method as a chrome plugin on google search.

pdf bib
Paladin : an annotation tool based on active and proactive learning
Minh-Quoc Nghiem | Paul Baylis | Sophia Ananiadou

In this paper, we present Paladin, an open-source web-based annotation tool for creating high-quality multi-label document-level datasets. By integrating active learning and proactive learning to the annotation task, Paladin makes the task less time-consuming and requiring less human effort. Although Paladin is designed for multi-label settings, the system is flexible and can be adapted to other tasks in single-label settings.

pdf bib
Story Centaur : Large Language Model Few Shot Learning as a Creative Writing Tool
Ben Swanson | Kory Mathewson | Ben Pietrzak | Sherol Chen | Monica Dinalescu

Few shot learning with large language models has the potential to give individuals without formal machine learning training the access to a wide range of text to text models. We consider how this applies to creative writers and present Story Centaur, a user interface for prototyping few shot models and a set of recombinable web components that deploy them. Story Centaur’s goal is to expose creative writers to few shot learning with a simple but powerful interface that lets them compose their own co-creation tools that further their own unique artistic directions. We build out several examples of such tools, and in the process probe the boundaries and issues surrounding generation with large language models.

pdf bib
OCTIS : Comparing and Optimizing Topic models is Simple !OCTIS: Comparing and Optimizing Topic models is Simple!
Silvia Terragni | Elisabetta Fersini | Bruno Giovanni Galuzzi | Pietro Tropeano | Antonio Candelieri

In this paper, we present OCTIS, a framework for training, analyzing, and comparing Topic Models, whose optimal hyper-parameters are estimated using a Bayesian Optimization approach. The proposed solution integrates several state-of-the-art topic models and evaluation metrics. These metrics can be targeted as objective by the underlying optimization procedure to determine the best hyper-parameter configuration. OCTIS allows researchers and practitioners to have a fair comparison between topic models of interest, using several benchmark datasets and well-known evaluation metrics, to integrate novel algorithms, and to have an interactive visualization of the results for understanding the behavior of each model. The code is available at the following link :

pdf bib
Breaking Writer’s Block : Low-cost Fine-tuning of Natural Language Generation Models
Alexandre Duval | Thomas Lamson | Gaël de Léséleuc de Kérouara | Matthias Gallé

It is standard procedure these days to solve Information Extraction task by fine-tuning large pre-trained language models. This is not the case for generation task, which relies on a variety of techniques for controlled language generation. In this paper, we describe a system that fine-tunes a natural language generation model for the problem of solving writer’s block. The fine-tuning changes the conditioning to also include the right context in addition to the left context, as well as an optional list of entities, the size, the genre and a summary of the paragraph that the human author wishes to generate. Our proposed fine-tuning obtains excellent results, even with a small number of epochs and a total cost of USD 150. The system can be accessed as a web-service and all the code is released. A video showcasing the interface and the model is also available.

pdf bib
Domain Expert Platform for Goal-Oriented Dialog Collection
Didzis Goško | Arturs Znotins | Inguna Skadina | Normunds Gruzitis | Gunta Nešpore-Bērzkalne

Today, most dialogue systems are fully or partly built using neural network architectures. A crucial prerequisite for the creation of a goal-oriented neural network dialogue system is a dataset that represents typical dialogue scenarios and includes various semantic annotations, e.g. intents, slots and dialogue actions, that are necessary for training a particular neural network architecture. In this demonstration paper, we present an easy to use interface and its back-end which is oriented to domain experts for the collection of goal-oriented dialogue samples. The platform not only allows to collect or write sample dialogues in a structured way, but also provides a means for simple annotation and interpretation of the dialogues. The platform itself is language-independent ; it depends only on the availability of particular language processing components for a specific language. It is currently being used to collect dialogue samples in Latvian (a highly inflected language) which represent typical communication between students and the student service.

pdf bib
Which is Better for Deep Learning : Python or MATLAB? Answering Comparative Questions in Natural LanguageMATLAB? Answering Comparative Questions in Natural Language
Viktoriia Chekalina | Alexander Bondarenko | Chris Biemann | Meriem Beloucif | Varvara Logacheva | Alexander Panchenko

We present a system for answering comparative questions (Is X better than Y with respect to Z?) in natural language. Answering such questions is important for assisting humans in making informed decisions. The key component of our system is a natural language interface for comparative QA that can be used in personal assistants, chatbots, and similar NLP devices. Comparative QA is a challenging NLP task, since it requires collecting support evidence from many different sources, and direct comparisons of rare objects may be not available even on the entire Web. We take the first step towards a solution for such a task offering a testbed for comparative QA in natural language by probing several methods, making the three best ones available as an online demo.


bib (full) Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop

pdf bib
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop
Ionut-Teodor Sorodoc | Madhumita Sushil | Ece Takmaz | Eneko Agirre

pdf bib
Computationally Efficient Wasserstein Loss for Structured LabelsWasserstein Loss for Structured Labels
Ayato Toyokuni | Sho Yokoi | Hisashi Kashima | Makoto Yamada

The problem of estimating the probability distribution of labels has been widely studied as a label distribution learning (LDL) problem, whose applications include age estimation, emotion analysis, and semantic segmentation. We propose a tree-Wasserstein distance regularized LDL algorithm, focusing on hierarchical text classification tasks. We propose predicting the entire label hierarchy using neural networks, where the similarity between predicted and true labels is measured using the tree-Wasserstein distance. Through experiments using synthetic and real-world datasets, we demonstrate that the proposed method successfully considers the structure of labels during training, and it compares favorably with the Sinkhorn algorithm in terms of computation time and memory usage.

pdf bib
Have Attention Heads in BERT Learned Constituency Grammar?BERT Learned Constituency Grammar?
Ziyang Luo

With the success of pre-trained language models in recent years, more and more researchers focus on opening the black box of these models. Following this interest, we carry out a qualitative and quantitative analysis of constituency grammar in attention heads of BERT and RoBERTa. We employ the syntactic distance method to extract implicit constituency grammar from the attention weights of each head. Our results show that there exist heads that can induce some grammar types much better than baselines, suggesting that some heads act as a proxy for constituency grammar. We also analyze how attention heads’ constituency grammar inducing (CGI) ability changes after fine-tuning with two kinds of tasks, including sentence meaning similarity (SMS) tasks and natural language inference (NLI) tasks. Our results suggest that SMS tasks decrease the average CGI ability of upper layers, while NLI tasks increase it. Lastly, we investigate the connections between CGI ability and natural language understanding ability on QQP and MNLI tasks.

pdf bib
Do we read what we hear? Modeling orthographic influences on spoken word recognition
Nicole Macher | Badr M. Abdullah | Harm Brouwer | Dietrich Klakow

Theories and models of spoken word recognition aim to explain the process of accessing lexical knowledge given an acoustic realization of a word form. There is consensus that phonological and semantic information is crucial for this process. However, there is accumulating evidence that orthographic information could also have an impact on auditory word recognition. This paper presents two models of spoken word recognition that instantiate different hypotheses regarding the influence of orthography on this process. We show that these models reproduce human-like behavior in different ways and provide testable hypotheses for future research on the source of orthographic effects in spoken word recognition.

pdf bib
Automatically Cataloging Scholarly Articles using Library of Congress Subject Headings
Nazmul Kazi | Nathaniel Lane | Indika Kahanda

Institutes are required to catalog their articles with proper subject headings so that the users can easily retrieve relevant articles from the institutional repositories. However, due to the rate of proliferation of the number of articles in these repositories, it is becoming a challenge to manually catalog the newly added articles at the same pace. To address this challenge, we explore the feasibility of automatically annotating articles with Library of Congress Subject Headings (LCSH). We first use web scraping to extract keywords for a collection of articles from the Repository Analytics and Metrics Portal (RAMP). Then, we map these keywords to LCSH names for developing a gold-standard dataset. As a case study, using the subset of Biology-related LCSH concepts, we develop predictive models by formulating this task as a multi-label classification problem. Our experimental results demonstrate the viability of this approach for predicting LCSH for scholarly articles.

pdf bib
Contrasting distinct structured views to learn sentence embeddings
Antoine Simoulin | Benoit Crabbé

We propose a self-supervised method that builds sentence embeddings from the combination of diverse explicit syntactic structures of a sentence. We assume structure is crucial to building consistent representations as we expect sentence meaning to be a function of both syntax and semantic aspects. In this perspective, we hypothesize that some linguistic representations might be better adapted given the considered task or sentence. We, therefore, propose to learn individual representation functions for different syntactic frameworks jointly. Again, by hypothesis, all such functions should encode similar semantic information differently and consequently, be complementary for building better sentential semantic embeddings. To assess such hypothesis, we propose an original contrastive multi-view framework that induces an explicit interaction between models during the training phase. We make experiments combining various structures such as dependency, constituency, or sequential schemes. Our results outperform comparable methods on several tasks from standard sentence embedding benchmarks.

pdf bib
Discrete Reasoning Templates for Natural Language Understanding
Hadeel Al-Negheimish | Pranava Madhyastha | Alessandra Russo

Reasoning about information from multiple parts of a passage to derive an answer is an open challenge for reading-comprehension models. In this paper, we present an approach that reasons about complex questions by decomposing them to simpler subquestions that can take advantage of single-span extraction reading-comprehension models, and derives the final answer according to instructions in a predefined reasoning template. We focus on subtraction based arithmetic questions and evaluate our approach on a subset of the DROP dataset. We show that our approach is competitive with the state of the art while being interpretable and requires little supervision.

pdf bib
Development of Conversational AI for Sleep Coaching ProgrammeAI for Sleep Coaching Programme
Heereen Shim

Almost 30 % of the adult population in the world is experiencing or has experience insomnia. Cognitive Behaviour Therapy for insomnia (CBT-I) is one of the most effective treatment, but it has limitations on accessibility and availability. Utilising technology is one of the possible solutions, but existing methods neglect conversational aspects, which plays a critical role in sleep therapy. To address this issue, we propose a PhD project exploring potentials of developing conversational artificial intelligence (AI) for a sleep coaching programme, which is motivated by CBT-I treatment. This PhD project aims to develop natural language processing (NLP) algorithms to allow the system to interact naturally with a user and provide automated analytic system to support human experts. In this paper, we introduce research questions lying under three phases of the sleep coaching programme : triaging, monitoring the progress, and providing coaching. We expect this research project’s outcomes could contribute to the research domains of NLP and AI but also the healthcare field by providing a more accessible and affordable sleep treatment solution and an automated analytic system to lessen the burden of human experts.

pdf bib
Relating Relations : Meta-Relation Extraction from Online Health Forum Posts
Daniel Stickley

Relation extraction is a key task in knowledge extraction, and is commonly defined as the task of identifying relations that hold between entities in text. This thesis proposal addresses the specific task of identifying meta-relations, a higher order family of relations naturally construed as holding between other relations which includes temporal, comparative, and causal relations. More specifically, we aim to develop theoretical underpinnings and practical solutions for the challenges of (1) incorporating meta-relations into conceptualisations and annotation schemes for (lower-order) relations and named entities, (2) obtaining annotations for them with tolerable cognitive load on annotators, (3) creating models capable of reliably extracting meta-relations, and related to that (4) addressing the limited-data problem exacerbated by the introduction of meta-relations into the learning task. We explore recent works in relation extraction and discuss our plans to formally conceptualise meta-relations for the domain of user-generated health texts, and create a new dataset, annotation scheme and models for meta-relation extraction.

pdf bib
Beyond the English Web : Zero-Shot Cross-Lingual and Lightweight Monolingual Classification of RegistersEnglish Web: Zero-Shot Cross-Lingual and Lightweight Monolingual Classification of Registers
Liina Repo | Valtteri Skantsi | Samuel Rönnqvist | Saara Hellström | Miika Oinonen | Anna Salmela | Douglas Biber | Jesse Egbert | Sampo Pyysalo | Veronika Laippala

We explore cross-lingual transfer of register classification for web documents. Registers, that is, text varieties such as blogs or news are one of the primary predictors of linguistic variation and thus affect the automatic processing of language. We introduce two new register-annotated corpora, FreCORE and SweCORE, for French and Swedish. We demonstrate that deep pre-trained language models perform strongly in these languages and outperform previous state-of-the-art in English and Finnish. Specifically, we show 1) that zero-shot cross-lingual transfer from the large English CORE corpus can match or surpass previously published monolingual models, and 2) that lightweight monolingual classification requiring very little training data can reach or surpass our zero-shot performance. We further analyse classification results finding that certain registers continue to pose challenges in particular for cross-lingual transfer.

pdf bib
Why Find the Right One?
Payal Khullar

The present paper investigates the impact of the anaphoric one words in English on the Neural Machine Translation (NMT) process using English-Hindi as source and target language pair. As expected, the experimental results show that the state-of-the-art Google English-Hindi NMT system achieves significantly poorly on sentences containing anaphoric ones as compared to the sentences containing regular, non-anaphoric ones. But, more importantly, we note that amongst the anaphoric words, the noun class is clearly much harder for NMT than the determinatives. This reaffirms the linguistic disparity of the two phenomenon in recent theoretical syntactic literature, despite the obvious surface similarities.


bib (full) Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Tutorial Abstracts

pdf bib
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Tutorial Abstracts
Isabelle Augenstein | Ivan Habernal

pdf bib
Unsupervised Natural Language Parsing (Introductory Tutorial)
Kewei Tu | Yong Jiang | Wenjuan Han | Yanpeng Zhao

Unsupervised parsing learns a syntactic parser from training sentences without parse tree annotations. Recently, there has been a resurgence of interest in unsupervised parsing, which can be attributed to the combination of two trends in the NLP community : a general trend towards unsupervised training or pre-training, and an emerging trend towards finding or modeling linguistic structures in neural models. In this tutorial, we will introduce to the general audience what unsupervised parsing does and how it can be useful for and beyond syntactic parsing. We will then provide a systematic overview of major classes of approaches to unsupervised parsing, namely generative and discriminative approaches, and analyze their relative strengths and weaknesses. We will cover both decade-old statistical approaches and more recent neural approaches to give the audience a sense of the historical and recent development of the field. We will also discuss emerging research topics such as BERT-based approaches and visually grounded learning.

pdf bib
Aggregating and Learning from Multiple Annotators
Silviu Paun | Edwin Simpson

The success of NLP research is founded on high-quality annotated datasets, which are usually obtained from multiple expert annotators or crowd workers. The standard practice to training machine learning models is to first adjudicate the disagreements and then perform the training. To this end, there has been a lot of work on aggregating annotations, particularly for classification tasks. However, many other tasks, particularly in NLP, have unique characteristics not considered by standard models of annotation, e.g., label interdependencies in sequence labelling tasks, unrestricted labels for anaphoric annotation, or preference labels for ranking texts. In recent years, researchers have picked up on this and are covering the gap. A first objective of this tutorial is to connect NLP researchers with state-of-the-art aggregation models for a diverse set of canonical language annotation tasks. There is also a growing body of recent work arguing that following the convention and training with adjudicated labels ignores any uncertainty the labellers had in their classifications, which results in models with poorer generalisation capabilities. Therefore, a second objective of this tutorial is to teach NLP workers how they can augment their (deep) neural models to learn from data with multiple interpretations.

pdf bib
Tutorial Proposal : End-to-End Speech Translation
Jan Niehues | Elizabeth Salesky | Marco Turchi | Matteo Negri

Speech translation is the translation of speech in one language typically to text in another, traditionally accomplished through a combination of automatic speech recognition and machine translation. Speech translation has attracted interest for many years, but the recent successful applications of deep learning to both individual tasks have enabled new opportunities through joint modeling, in what we today call ‘end-to-end speech translation.’ In this tutorial we will introduce the techniques used in cutting-edge research on speech translation. Starting from the traditional cascaded approach, we will given an overview on data sources and model architectures to achieve state-of-the art performance with end-to-end speech translation for both high- and low-resource languages. In addition, we will discuss methods to evaluate analyze the proposed solutions, as well as the challenges faced when applying speech translation models for real-world applications.


bib (full) Proceedings of the Second Workshop on Domain Adaptation for NLP

pdf bib
Proceedings of the Second Workshop on Domain Adaptation for NLP
Eyal Ben-David | Shay Cohen | Ryan McDonald | Barbara Plank | Roi Reichart | Guy Rotman | Yftah Ziser

pdf bib
Challenges in Annotating and Parsing Spoken, Code-switched, Frisian-Dutch DataFrisian-Dutch Data
Anouck Braggaar | Rob van der Goot

While high performance have been obtained for high-resource languages, performance on low-resource languages lags behind. In this paper we focus on the parsing of the low-resource language Frisian. We use a sample of code-switched, spontaneously spoken data, which proves to be a challenging setup. We propose to train a parser specifically tailored towards the target domain, by selecting instances from multiple treebanks. Specifically, we use Latent Dirichlet Allocation (LDA), with word and character N-grams. We use a deep biaffine parser initialized with mBERT. The best single source treebank (nl_alpino) resulted in an LAS of 54.7 whereas our data selection outperformed the single best transfer treebank and led to 55.6 LAS on the test data. Additional experiments consisted of removing diacritics from our Frisian data, creating more similar training data by cropping sentences and running our best model using XLM-R. These experiments did not lead to a better performance.

pdf bib
Addressing Zero-Resource Domains Using Document-Level Context in Neural Machine Translation
Dario Stojanovski | Alexander Fraser

Achieving satisfying performance in machine translation on domains for which there is no training data is challenging. Traditional supervised domain adaptation is not suitable for addressing such zero-resource domains because it relies on in-domain parallel data. We show that when in-domain parallel data is not available, access to document-level context enables better capturing of domain generalities compared to only having access to a single sentence. Having access to more information provides a more reliable domain estimation. We present two document-level Transformer models which are capable of using large context sizes and we compare these models against strong Transformer baselines. We obtain improvements for the two zero-resource domains we study. We additionally provide an analysis where we vary the amount of context and look at the case where in-domain data is available.

pdf bib
BERTologiCoMix : How does Code-Mixing interact with Multilingual BERT?BERTologiCoMix: How does Code-Mixing interact with Multilingual BERT?
Sebastin Santy | Anirudh Srinivasan | Monojit Choudhury

Models such as mBERT and XLMR have shown success in solving Code-Mixed NLP tasks even though they were not exposed to such text during pretraining. Code-Mixed NLP models have relied on using synthetically generated data along with naturally occurring data to improve their performance. Finetuning mBERT on such data improves it’s code-mixed performance, but the benefits of using the different types of Code-Mixed data are n’t clear. In this paper, we study the impact of finetuning with different types of code-mixed data and outline the changes that occur to the model during such finetuning. Our findings suggest that using naturally occurring code-mixed data brings in the best performance improvement after finetuning and that finetuning with any type of code-mixed text improves the responsivity of it’s attention heads to code-mixed text inputs.

pdf bib
Locality Preserving Loss : Neighbors that Live together, Align together
Ashwinkumar Ganesan | Francis Ferraro | Tim Oates

We present a locality preserving loss (LPL) that improves the alignment between vector space embeddings while separating uncorrelated representations. Given two pretrained embedding manifolds, LPL optimizes a model to project an embedding and maintain its local neighborhood while aligning one manifold to another. This reduces the overall size of the dataset required to align the two in tasks such as crosslingual word alignment. We show that the LPL-based alignment between input vector spaces acts as a regularizer, leading to better and consistent accuracy than the baseline, especially when the size of the training set is small. We demonstrate the effectiveness of LPL-optimized alignment on semantic text similarity (STS), natural language inference (SNLI), multi-genre language inference (MNLI) and cross-lingual word alignment (CLA) showing consistent improvements, finding up to 16 % improvement over our baseline in lower resource settings.

pdf bib
Trajectory-Based Meta-Learning for Out-Of-Vocabulary Word Embedding Learning
Gordon Buck | Andreas Vlachos

Word embedding learning methods require a large number of occurrences of a word to accurately learn its embedding. However, out-of-vocabulary (OOV) words which do not appear in the training corpus emerge frequently in the smaller downstream data. Recent work formulated OOV embedding learning as a few-shot regression problem and demonstrated that meta-learning can improve results obtained. However, the algorithm used, model-agnostic meta-learning (MAML) is known to be unstable and perform worse when a large number of gradient steps are used for parameter updates. In this work, we propose the use of Leap, a meta-learning algorithm which leverages the entire trajectory of the learning process instead of just the beginning and the end points, and thus ameliorates these two issues. In our experiments on a benchmark OOV embedding learning dataset and in an extrinsic evaluation, Leap performs comparably or better than MAML. We go on to examine which contexts are most beneficial to learn an OOV embedding from, and propose that the choice of contexts may matter more than the meta-learning employed.

pdf bib
An Empirical Study of Compound PCFGsPCFGs
Yanpeng Zhao | Ivan Titov

Compound probabilistic context-free grammars (C-PCFGs) have recently established a new state of the art for phrase-structure grammar induction. However, due to the high time-complexity of chart-based representation and inference, it is difficult to investigate them comprehensively. In this work, we rely on a fast implementation of C-PCFGs to conduct evaluation complementary to that of (CITATION). We highlight three key findings : (1) C-PCFGs are data-efficient, (2) C-PCFGs make the best use of global sentence-level information in preterminal rule probabilities, and (3) the best configurations of C-PCFGs on English do not always generalize to morphology-rich languages.

pdf bib
Effective Distant Supervision for Temporal Relation Extraction
Xinyu Zhao | Shih-Ting Lin | Greg Durrett

A principal barrier to training temporal relation extraction models in new domains is the lack of varied, high quality examples and the challenge of collecting more. We present a method of automatically collecting distantly-supervised examples of temporal relations. We scrape and automatically label event pairs where the temporal relations are made explicit in text, then mask out those explicit cues, forcing a model trained on this data to learn other signals. We demonstrate that a pre-trained Transformer model is able to transfer from the weakly labeled examples to human-annotated benchmarks in both zero-shot and few-shot settings, and that the masking scheme is important in improving generalization.

pdf bib
Zero-Shot Cross-Lingual Dependency Parsing through Contextual Embedding Transformation
Haoran Xu | Philipp Koehn

Linear embedding transformation has been shown to be effective for zero-shot cross-lingual transfer tasks and achieve surprisingly promising results. However, cross-lingual embedding space mapping is usually studied in static word-level embeddings, where a space transformation is derived by aligning representations of translation pairs that are referred from dictionaries. We move further from this line and investigate a contextual embedding alignment approach which is sense-level and dictionary-free. To enhance the quality of the mapping, we also provide a deep view of properties of contextual embeddings, i.e., the anisotropy problem and its solution. Experiments on zero-shot dependency parsing through the concept-shared space built by our embedding transformation substantially outperform state-of-the-art methods using multilingual embeddings.

pdf bib
Gradual Fine-Tuning for Low-Resource Domain Adaptation
Haoran Xu | Seth Ebner | Mahsa Yarmohammadi | Aaron Steven White | Benjamin Van Durme | Kenton Murray

Fine-tuning is known to improve NLP models by adapting an initial model trained on more plentiful but less domain-salient examples to data in a target domain. Such domain adaptation is typically done using one stage of fine-tuning. We demonstrate that gradually fine-tuning in a multi-step process can yield substantial further gains and can be applied without modifying the model or learning objective.

pdf bib
Semantic Parsing of Brief and Multi-Intent Natural Language Utterances
Logan Lebanoff | Charles Newton | Victor Hung | Beth Atkinson | John Killilea | Fei Liu

Many military communication domains involve rapidly conveying situation awareness with few words. Converting natural language utterances to logical forms in these domains is challenging, as these utterances are brief and contain multiple intents. In this paper, we present a first effort toward building a weakly-supervised semantic parser to transform brief, multi-intent natural utterances into logical forms. Our findings suggest a new projection and reduction method that iteratively performs projection from natural to canonical utterances followed by reduction of natural utterances is the most effective. We conduct extensive experiments on two military and a general-domain dataset and provide a new baseline for future research toward accurate parsing of multi-intent utterances.


bib (full) Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications

pdf bib
Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications
Jill Burstein | Andrea Horbach | Ekaterina Kochmar | Ronja Laarmann-Quante | Claudia Leacock | Nitin Madnani | Ildikó Pilán | Helen Yannakoudakis | Torsten Zesch

pdf bib
Text Simplification by TaggingText Simplification by Tagging
Kostiantyn Omelianchuk | Vipul Raheja | Oleksandr Skurzhanskyi

Edit-based approaches have recently shown promising results on multiple monolingual sequence transduction tasks. In contrast to conventional sequence-to-sequence (Seq2Seq) models, which learn to generate text from scratch as they are trained on parallel corpora, these methods have proven to be much more effective since they are able to learn to make fast and accurate transformations while leveraging powerful pre-trained language models. Inspired by these ideas, we present TST, a simple and efficient Text Simplification system based on sequence Tagging, leveraging pre-trained Transformer-based encoders. Our system makes simplistic data augmentations and tweaks in training and inference on a pre-existing system, which makes it less reliant on large amounts of parallel training data, provides more control over the outputs and enables faster inference speeds. Our best model achieves near state-of-the-art performance on benchmark test datasets for the task. Since it is fully non-autoregressive, it achieves faster inference speeds by over 11 times than the current state-of-the-art text simplification system.

pdf bib
Broad Linguistic Complexity Analysis for Greek Readability ClassificationGreek Readability Classification
Savvas Chatzipanagiotidis | Maria Giagkou | Detmar Meurers

This paper explores the linguistic complexity of Greek textbooks as a readability classification task. We analyze textbook corpora for different school subjects and textbooks for Greek as a Second Language, covering a very wide spectrum of school age groups and proficiency levels. A broad range of quantifiable linguistic complexity features (lexical, morphological and syntactic) are extracted and calculated. Conducting experiments with different feature subsets, we show that the different linguistic dimensions contribute orthogonal information, each contributing towards the highest result achieved using all linguistic feature subsets. A readability classifier trained on this basis reaches a classification accuracy of 88.16 % for the Greek as a Second Language corpus. To investigate the generalizability of the classification models, we also perform cross-corpus evaluations. We show that the model trained on the most varied text collection (for Greek as a school subject) generalizes best. In addition to advancing the state of the art for Greek readability analysis, the paper also contributes insights on the role of different feature sets and training setups for generalizable readability classification.

pdf bib
Parsing Argumentative Structure in English-as-Foreign-Language EssaysEnglish-as-Foreign-Language Essays
Jan Wira Gotama Putra | Simone Teufel | Takenobu Tokunaga

This paper presents a study on parsing the argumentative structure in English-as-foreign-language (EFL) essays, which are inherently noisy. The parsing process consists of two steps, linking related sentences and then labelling their relations. We experiment with several deep learning architectures to address each task independently. In the sentence linking task, a biaffine model performed the best. In the relation labelling task, a fine-tuned BERT model performed the best. Two sentence encoders are employed, and we observed that non-fine-tuning models generally performed better when using Sentence-BERT as opposed to BERT encoder. We trained our models using two types of parallel texts : original noisy EFL essays and those improved by annotators, then evaluate them on the original essays. The experiment shows that an end-to-end in-domain system achieved an accuracy of.341. On the other hand, the cross-domain system achieved 94 % performance of the in-domain system. This signals that well-written texts can also be useful to train argument mining system for noisy texts.

pdf bib
Training and Domain Adaptation for Supervised Text Segmentation
Goran Glavaš | Ananya Ganesh | Swapna Somasundaran

Unlike traditional unsupervised text segmentation methods, recent supervised segmentation models rely on Wikipedia as the source of large-scale segmentation supervision. These models have, however, predominantly been evaluated on the in-domain (Wikipedia-based) test sets, preventing conclusions about their general segmentation efficacy. In this work, we focus on the domain transfer performance of supervised neural text segmentation in the educational domain. To this end, we first introduce K12Seg, a new dataset for evaluation of supervised segmentation, created from educational reading material for grade-1 to college-level students. We then benchmark a hierarchical text segmentation model (HITS), based on RoBERTa, in both in-domain and domain-transfer segmentation experiments. While HITS produces state-of-the-art in-domain performance (on three Wikipedia-based test sets), we show that, subject to the standard full-blown fine-tuning, it is susceptible to domain overfitting. We identify adapter-based fine-tuning as a remedy that substantially improves transfer performance.

pdf bib
C-Test Collector : A Proficiency Testing Application to Collect Training Data for C-TestsC-Test Collector: A Proficiency Testing Application to Collect Training Data for C-Tests
Christian Haring | Rene Lehmann | Andrea Horbach | Torsten Zesch

We present the C-Test Collector, a web-based tool that allows language learners to test their proficiency level using c-tests. Our tool collects anonymized data on test performance, which allows teachers to gain insights into common error patterns. At the same time, it allows NLP researchers to collect training data for being able to generate c-test variants at the desired difficulty level.

pdf bib
Sharks are not the threat humans are : Argument Component Segmentation in School Student Essays
Tariq Alhindi | Debanjan Ghosh

Argument mining is often addressed by a pipeline method where segmentation of text into argumentative units is conducted first and proceeded by an argument component identification task. In this research, we apply a token-level classification to identify claim and premise tokens from a new corpus of argumentative essays written by middle school students. To this end, we compare a variety of state-of-the-art models such as discrete features and deep learning architectures (e.g., BiLSTM networks and BERT-based architectures) to identify the argument components. We demonstrate that a BERT-based multi-task learning architecture (i.e., token and sentence level classification) adaptively pretrained on a relevant unlabeled dataset obtains the best results.


bib (full) Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing

pdf bib
Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing
Bogdan Babych | Olga Kanishcheva | Preslav Nakov | Jakub Piskorski | Lidia Pivovarova | Vasyl Starko | Josef Steinberger | Roman Yangarber | Michał Marcińczuk | Senja Pollak | Pavel Přibáň | Marko Robnik-Šikonja

pdf bib
HerBERT : Efficiently Pretrained Transformer-based Language Model for PolishHerBERT: Efficiently Pretrained Transformer-based Language Model for Polish
Robert Mroczkowski | Piotr Rybak | Alina Wróblewska | Ireneusz Gawlik

BERT-based models are currently used for solving nearly all Natural Language Processing (NLP) tasks and most often achieve state-of-the-art results. Therefore, the NLP community conducts extensive research on understanding these models, but above all on designing effective and efficient training procedures. Several ablation studies investigating how to train BERT-like models have been carried out, but the vast majority of them concerned only the English language. A training procedure designed for English does not have to be universal and applicable to other especially typologically different languages. Therefore, this paper presents the first ablation study focused on Polish, which, unlike the isolating English language, is a fusional language. We design and thoroughly evaluate a pretraining procedure of transferring knowledge from multilingual to monolingual BERT-based models. In addition to multilingual model initialization, other factors that possibly influence pretraining are also explored, i.e. training objective, corpus size, BPE-Dropout, and pretraining length. Based on the proposed procedure, a Polish BERT-based language model HerBERT is trained. This model achieves state-of-the-art results on multiple downstream tasks.

pdf bib
Detecting Inappropriate Messages on Sensitive Topics that Could Harm a Company’s Reputation
Nikolay Babakov | Varvara Logacheva | Olga Kozlova | Nikita Semenov | Alexander Panchenko

Not all topics are equally flammable in terms of toxicity : a calm discussion of turtles or fishing less often fuels inappropriate toxic dialogues than a discussion of politics or sexual minorities. We define a set of sensitive topics that can yield inappropriate and toxic messages and describe the methodology of collecting and labelling a dataset for appropriateness. While toxicity in user-generated data is well-studied, we aim at defining a more fine-grained notion of inappropriateness. The core of inappropriateness is that it can harm the reputation of a speaker. This is different from toxicity in two respects : (i) inappropriateness is topic-related, and (ii) inappropriate message is not toxic but still unacceptable. We collect and release two datasets for Russian : a topic-labelled dataset and an appropriateness-labelled dataset. We also release pre-trained classification models trained on this data.

pdf bib
RuSentEval : Linguistic Source, Encoder Force !RuSentEval: Linguistic Source, Encoder Force!
Vladislav Mikhailov | Ekaterina Taktasheva | Elina Sigdel | Ekaterina Artemova

The success of pre-trained transformer language models has brought a great deal of interest on how these models work, and what they learn about language. However, prior research in the field is mainly devoted to English, and little is known regarding other languages. To this end, we introduce RuSentEval, an enhanced set of 14 probing tasks for Russian, including ones that have not been explored yet. We apply a combination of complementary probing methods to explore the distribution of various linguistic properties in five multilingual transformers for two typologically contrasting languages Russian and English. Our results provide intriguing findings that contradict the common understanding of how linguistic knowledge is represented, and demonstrate that some properties are learned in a similar manner despite the language differences.

pdf bib
Exploratory Analysis of News Sentiment Using Subgroup Discovery
Anita Valmarska | Luis Adrián Cabrera-Diego | Elvys Linhares Pontes | Senja Pollak

In this study, we present an exploratory analysis of a Slovenian news corpus, in which we investigate the association between named entities and sentiment in the news. We propose a methodology that combines Named Entity Recognition and Subgroup Discovery-a descriptive rule learning technique for identifying groups of examples that share the same class label (sentiment) and pattern (features-Named Entities). The approach is used to induce the positive and negative sentiment class rules that reveal interesting patterns related to different Slovenian and international politicians, organizations, and locations.

pdf bib
Creating an Aligned Russian Text Simplification Dataset from Language Learner DataRussian Text Simplification Dataset from Language Learner Data
Anna Dmitrieva | Jörg Tiedemann

Parallel language corpora where regular texts are aligned with their simplified versions can be used in both natural language processing and theoretical linguistic studies. They are essential for the task of automatic text simplification, but can also provide valuable insights into the characteristics that make texts more accessible and reveal strategies that human experts use to simplify texts. Today, there exist a few parallel datasets for English and Simple English, but many other languages lack such data. In this paper we describe our work on creating an aligned Russian-Simple Russian dataset composed of Russian literature texts adapted for learners of Russian as a foreign language. This will be the first parallel dataset in this domain, and one of the first Simple Russian datasets in general.

pdf bib
Multilingual Named Entity Recognition and Matching Using BERT and Dedupe for Slavic LanguagesBERT and Dedupe for Slavic Languages
Marko Prelevikj | Slavko Zitnik

This paper describes the University of Ljubljana (UL FRI) Group’s submissions to the shared task at the Balto-Slavic Natural Language Processing (BSNLP) 2021 Workshop. We experiment with multiple BERT-based models, pre-trained in multi-lingual, Croatian-Slovene-English and Slovene-only data. We perform training iteratively and on the concatenated data of previously available NER datasets. For the normalization task we use Stanza lemmatizer, while for entity matching we implemented a baseline using the Dedupe library. The performance of evaluations suggests that multi-source settings outperform less-resourced approaches. The best NER models achieve 0.91 F-score on Slovene training data splits while the best official submission achieved F-scores of 0.84 and 0.78 for relaxed partial matching and strict settings, respectively. In multi-lingual NER setting we achieve F-scores of 0.82 and 0.74.

pdf bib
Benchmarking Pre-trained Language Models for Multilingual NER : TraSpaS at the BSNLP2021 Shared TaskNER: TraSpaS at the BSNLP2021 Shared Task
Marek Suppa | Ondrej Jariabka

In this paper we describe TraSpaS, a submission to the third shared task on named entity recognition hosted as part of the Balto-Slavic Natural Language Processing (BSNLP) Workshop. In it we evaluate various pre-trained language models on the NER task using three open-source NLP toolkits : character level language model with Stanza, language-specific BERT-style models with SpaCy and Adapter-enabled XLM-R with Trankit. Our results show that the Trankit-based models outperformed those based on the other two toolkits, even when trained on smaller amounts of data. Our code is available at.

pdf bib
Named Entity Recognition and Linking Augmented with Large-Scale Structured Data
Paweł Rychlikowski | Bartłomiej Najdecki | Adrian Lancucki | Adam Kaczmarek

In this paper we describe our submissions to the 2nd and 3rd SlavNER Shared Tasks held at BSNLP 2019 and BSNLP 2021, respectively. The tasks focused on the analysis of Named Entities in multilingual Web documents in Slavic languages with rich inflection. Our solution takes advantage of large collections of both unstructured and structured documents. The former serve as data for unsupervised training of language models and embeddings of lexical units. The latter refers to Wikipedia and its structured counterpart-Wikidata, our source of lemmatization rules, and real-world entities. With the aid of those resources, our system could recognize, normalize and link entities, while being trained with only small amounts of labeled data.

pdf bib
Slav-NER : the 3rd Cross-lingual Challenge on Recognition, Normalization, Classification, and Linking of Named Entities across Slavic LanguagesNER: the 3rd Cross-lingual Challenge on Recognition, Normalization, Classification, and Linking of Named Entities across Slavic Languages
Jakub Piskorski | Bogdan Babych | Zara Kancheva | Olga Kanishcheva | Maria Lebedeva | Michał Marcińczuk | Preslav Nakov | Petya Osenova | Lidia Pivovarova | Senja Pollak | Pavel Přibáň | Ivaylo Radev | Marko Robnik-Sikonja | Vasyl Starko | Josef Steinberger | Roman Yangarber

This paper describes Slav-NER : the 3rd Multilingual Named Entity Challenge in Slavic languages. The tasks involve recognizing mentions of named entities in Web documents, normalization of the names, and cross-lingual linking. The Challenge covers six languages and five entity types, and is organized as part of the 8th Balto-Slavic Natural Language Processing Workshop, co-located with the EACL 2021 Conference. Ten teams participated in the competition. Performance for the named entity recognition task reached 90 % F-measure, much higher than reported in the first edition of the Challenge. Seven teams covered all six languages, and five teams participated in the cross-lingual entity linking task. Detailed valuation information is available on the shared task web page.


bib (full) Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages

pdf bib
Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages
Bharathi Raja Chakravarthi | Ruba Priyadharshini | Anand Kumar M | Parameswari Krishnamurthy | Elizabeth Sherly

pdf bib
Towards Offensive Language Identification for Dravidian LanguagesDravidian Languages
Siva Sai | Yashvardhan Sharma

Offensive speech identification in countries like India poses several challenges due to the usage of code-mixed and romanized variants of multiple languages by the users in their posts on social media. The challenge of offensive language identification on social media for Dravidian languages is harder, considering the low resources available for the same. In this paper, we explored the zero-shot learning and few-shot learning paradigms based on multilingual language models for offensive speech detection in code-mixed and romanized variants of three Dravidian languages-Malayalam, Tamil, and Kannada. We propose a novel and flexible approach of selective translation and transliteration to reap better results from fine-tuning and ensembling multilingual transformer networks like XLMRoBERTa and mBERT. We implemented pretrained, fine-tuned, and ensembled versions of XLM-RoBERTa for offensive speech classification. Further, we experimented with interlanguage, inter-task, and multi-task transfer learning techniques to leverage the rich resources available for offensive speech identification in the English language and to enrich the models with knowledge transfer from related tasks. The proposed models yielded good results and are promising for effective offensive speech identification in low resource settings.

pdf bib
Sentiment Classification of Code-Mixed Tweets using Bi-Directional RNN and Language TagsRNN and Language Tags
Sainik Mahata | Dipankar Das | Sivaji Bandyopadhyay

Sentiment analysis tools and models have been developed extensively throughout the years, for European languages. In contrast, similar tools for Indian Languages are scarce. This is because, state-of-the-art pre-processing tools like POS tagger, shallow parsers, etc., are not readily available for Indian languages. Although, such working tools for Indian languages, like Hindi and Bengali, that are spoken by the majority of the population, are available, finding the same for less spoken languages like, Tamil, Telugu, and Malayalam, is difficult. Moreover, due to the advent of social media, the multi-lingual population of India, who are comfortable with both English ad their regional language, prefer to communicate by mixing both languages. This gives rise to massive code-mixed content and automatically annotating them with their respective sentiment labels becomes a challenging task. In this work, we take up a similar challenge of developing a sentiment analysis model that can work with English-Tamil code-mixed data. The proposed work tries to solve this by using bi-directional LSTMs along with language tagging. Other traditional methods, based on classical machine learning algorithms have also been discussed in the literature, and they also act as the baseline systems to which we will compare our Neural Network based model. The performance of the developed algorithm, based on Neural Network architecture, garnered precision, recall, and F1 scores of 0.59, 0.66, and 0.58 respectively.

pdf bib
Task-Oriented Dialog Systems for Dravidian LanguagesDravidian Languages
Tushar Kanakagiri | Karthik Radhakrishnan

Task-oriented dialog systems help a user achieve a particular goal by parsing user requests to execute a particular action. These systems typically require copious amounts of training data to effectively understand the user intent and its corresponding slots. Acquiring large training corpora requires significant manual effort in annotation, rendering its construction infeasible for low-resource languages. In this paper, we present a two-step approach for automatically constructing task-oriented dialogue data in such languages by making use of annotated data from high resource languages. First, we use a machine translation (MT) system to translate the utterance and slot information to the target language. Second, we use token prefix matching and mBERT based semantic matching to align the slot tokens to the corresponding tokens in the utterance. We hand-curate a new test dataset in two low-resource Dravidian languages and show the significance and impact of our training dataset construction using a state-of-the-art mBERT model-achieving a Slot F1 of 81.51 (Kannada) and 78.82 (Tamil) on our test sets.

pdf bib
A Survey on Paralinguistics in Tamil Speech ProcessingTamil Speech Processing
Anosha Ignatius | Uthayasanker Thayasivam

Speech carries not only the semantic content but also the paralinguistic information which captures the speaking style. Speaker traits and emotional states affect how words are being spoken. The research on paralinguistic information is an emerging field in speech and language processing and it has many potential applications including speech recognition, speaker identification and verification, emotion recognition and accent recognition. Among them, there is a significant interest in emotion recognition from speech. A detailed study of paralinguistic information present in speech signal and an overview of research work related to speech emotion for Tamil Language is presented in this paper.

pdf bib
Is this Enough?-Evaluation of Malayalam WordnetMalayalam Wordnet
Nandu Chandran Nair | Maria-chiara Giangregorio | Fausto Giunchiglia

Quality of a product is the degree to which a product meets the customer’s expectation, which must also be valid for the case of lexical semantic resources. Conducting a periodic evaluation of resources is essential to ensure if the resources meet a native speaker’s expectations and free from errors. This paper defines the possible mistakes in a lexical semantic resource and explains the steps applied to quantify Malayalam wordnet quality. Malayalam is one of the classical languages of India. We hope to subset the less quality part of the wordnet and perform crowdsourcing to make it better.

pdf bib
LA-SACo : A Study of Learning Approaches for Sentiments Analysis inCode-Mixing TextsLA-SACo: A Study of Learning Approaches for Sentiments Analysis inCode-Mixing Texts
Fazlourrahman Balouchzahi | H L Shashirekha

Substantial amount of text data which is increasingly being generated and shared on the internet and social media every second affect the society positively or negatively almost in any aspect of online world and also business and industries. Sentiments / opinions / reviews’ of users posted on social media are the valuable information that have motivated researchers to analyze them to get better insight and feedbacks about any product such as a video in Instagram, a movie in Netflix, or even new brand car introduced by BMW. Sentiments are usually written using a combination of languages such as English which is resource rich and regional languages such as Tamil, Kannada, Malayalam, etc. which are resource poor. However, due to technical constraints, many users prefer to pen their opinions in Roman script. These kinds of texts written in two or more languages using a common language script or different language scripts are called code-mixing texts. Code-mixed texts are increasing day-by-day with the increase in the number of users depending on various online platforms. Analyzing such texts pose a real challenge for the researchers. In view of the challenges posed by the code-mixed texts, this paper describes three proposed models namely, SACo-Ensemble, SACo-Keras, and SACo-ULMFiT using Machine Learning (ML), Deep Learning (DL), and Transfer Learning (TL) approaches respectively for the task of Sentiments Analysis in Tamil-English and Malayalam-English code-mixed texts.

pdf bib
Findings of the Shared Task on Troll Meme Classification in TamilTamil
Shardul Suryawanshi | Bharathi Raja Chakravarthi

The internet has facilitated its user-base with a platform to communicate and express their views without any censorship. On the other hand, this freedom of expression or free speech can be abused by its user or a troll to demean an individual or a group. Demeaning people based on their gender, sexual orientation, religious believes or any other characteristics trolling could cause great distress in the online community. Hence, the content posted by a troll needs to be identified and dealt with before causing any more damage. Amongst all the forms of troll content, memes are most prevalent due to their popularity and ability to propagate across cultures. A troll uses a meme to demean, attack or offend its targetted audience. In this shared task, we provide a resource (TamilMemes) that could be used to train a system capable of identifying a troll meme in the Tamil language. In our TamilMemes dataset, each meme has been categorized into either a troll or a not_troll class. Along with the meme images, we also provided the Latin transcripted text from memes. We received 10 system submissions from the participants which were evaluated using the weighted average F1-score. The system with the weighted average F1-score of 0.55 secured the first rank.

pdf bib
Findings of the Shared Task on Offensive Language Identification in Tamil, Malayalam, and KannadaTamil, Malayalam, and Kannada
Bharathi Raja Chakravarthi | Ruba Priyadharshini | Navya Jose | Anand Kumar M | Thomas Mandl | Prasanna Kumar Kumaresan | Rahul Ponnusamy | Hariharan R L | John P. McCrae | Elizabeth Sherly

Detecting offensive language in social media in local languages is critical for moderating user-generated content. Thus, the field of offensive language identification in under-resourced Tamil, Malayalam and Kannada languages are essential. As the user-generated content is more code-mixed and not well studied for under-resourced languages, it is imperative to create resources and conduct benchmarking studies to encourage research in under-resourced Dravidian languages. We created a shared task on offensive language detection in Dravidian languages. We summarize here the dataset for this challenge which are openly available at, and present an overview of the methods and the results of the competing systems.

pdf bib
GX@DravidianLangTech-EACL2021 : Multilingual Neural Machine Translation and Back-translationGX@DravidianLangTech-EACL2021: Multilingual Neural Machine Translation and Back-translation
Wanying Xie

In this paper, we describe the GX system in the EACL2021 shared task on machine translation in Dravidian languages. Given the low amount of parallel training data, We adopt two methods to improve the overall performance : (1) multilingual translation, we use a shared encoder-decoder multilingual translation model handling multiple languages simultaneously to facilitate the translation performance of these languages ; (2) back-translation, we collected other open-source parallel and monolingual data and apply back-translation to benefit from the monolingual data. The experimental results show that we can achieve satisfactory translation results in these Dravidian languages and rank first in English-Telugu and Tamil-Telugu translation.

pdf bib
Simon @ DravidianLangTech-EACL2021 : Detecting Offensive Content in Kannada LanguageDravidianLangTech-EACL2021: Detecting Offensive Content in Kannada Language
Qinyu Que

This article introduces the system for the shared task of Offensive Language Identification in Dravidian Languages-EACL 2021. The world’s information technology develops at a high speed. People are used to expressing their views and opinions on social media. This leads to a lot of offensive language on social media. As people become more dependent on social media, the detection of offensive language becomes more and more necessary. This shared task is in three languages : Tamil, Malayalam, and Kannada. Our team takes part in the Kannada language task. To accomplish the task, we use the XLM-Roberta model for pre-training. But the capabilities of the XLM-Roberta model do not satisfy us in terms of statement information collection. So we made some tweaks to the output of this model. In this paper, we describe the models and experiments for accomplishing the task of the Kannada language.

pdf bib
Hypers@DravidianLangTech-EACL2021 : Offensive language identification in Dravidian code-mixed YouTube Comments and PostsDravidianLangTech-EACL2021: Offensive language identification in Dravidian code-mixed YouTube Comments and Posts
Charangan Vasantharajan | Uthayasanker Thayasivam

Code-Mixed Offensive contents are used pervasively in social media posts in the last few years. Consequently, gained the significant attraction of the research community for identifying the different forms of such content (e.g., hate speech, and sentiments) and contributed to the creation of datasets. Most of the recent studies deal with high-resource languages (e.g., English) due to many publicly available datasets, and by the lack of dataset in low-resource anguages, those studies are slightly involved in these languages. Therefore, this study has the focus on offensive language identification on code-mixed low-resourced Dravidian languages such as Tamil, Kannada, and Malayalam using the bidirectional approach and fine-tuning strategies. According to the leaderboard, the proposed model got a 0.96 F1-score for Malayalam, 0.73 F1-score for Tamil, and 0.70 F1-score for Kannada in the bench-mark. Moreover, in the view of multilingual models, this modal ranked 3rd and achieved favorable results and confirmed the model as the best among all systems submitted to these shared tasks in these three languages.

pdf bib
ZYJ123@DravidianLangTech-EACL2021 : Offensive Language Identification based on XLM-RoBERTa with DPCNNZYJ123@DravidianLangTech-EACL2021: Offensive Language Identification based on XLM-RoBERTa with DPCNN
Yingjia Zhao | Xin Tao

The development of online media platforms has given users more opportunities to post and comment freely, but the negative impact of offensive language has become increasingly apparent. It is very necessary for the automatic identification system of offensive language. This paper describes our work on the task of Offensive Language Identification in Dravidian language-EACL 2021. To complete this task, we propose a system based on the multilingual model XLM-Roberta and DPCNN. The test results on the official test data set confirm the effectiveness of our system. The weighted average F1-score of Kannada, Malayalam, and Tami language are 0.69, 0.92, and 0.76 respectively, ranked 6th, 6th, and 3rd

pdf bib
CUSATNLP@DravidianLangTech-EACL2021 : Language Agnostic Classification of Offensive Content in TweetsCUSATNLP@DravidianLangTech-EACL2021:Language Agnostic Classification of Offensive Content in Tweets
Sara Renjit | Sumam Mary Idicula

Identifying offensive information from tweets is a vital language processing task. This task concentrated more on English and other foreign languages these days. In this shared task on Offensive Language Identification in Dravidian Languages, in the First Workshop of Speech and Language Technologies for Dravidian Languages in EACL 2021, the aim is to identify offensive content from code mixed Dravidian Languages Kannada, Malayalam, and Tamil. Our team used language agnostic BERT (Bidirectional Encoder Representation from Transformers) for sentence embedding and a Softmax classifier. The language-agnostic representation based classification helped obtain good performance for all the three languages, out of which results for the Malayalam language are good enough to obtain a third position among the participating teams.

pdf bib
Maoqin @ DravidianLangTech-EACL2021 : The Application of Transformer-Based ModelDravidianLangTech-EACL2021: The Application of Transformer-Based Model
Maoqin Yang

This paper describes the result of team-Maoqin at DravidianLangTech-EACL2021. The provided task consists of three languages(Tamil, Malayalam, and Kannada), I only participate in one of the language task-Malayalam. The goal of this task is to identify offensive language content of the code-mixed dataset of comments / posts in Dravidian Languages (Tamil-English, Malayalam-English, and Kannada-English) collected from social media. This is a classification task at the comment / post level. Given a Youtube comment, systems have to classify it into Not-offensive, Offensive-untargeted, Offensive-targeted-individual, Offensive-targeted-group, Offensive-targeted-other, or Not-in-indented-language. I use the transformer-based language model with BiGRU-Attention to complete this task. To prove the validity of the model, I also use some other neural network models for comparison. And finally, the team ranks 5th in this task with a weighted average F1 score of 0.93 on the private leader board.

pdf bib
Simon @ DravidianLangTech-EACL2021 : Meme Classification for Tamil with BERTDravidianLangTech-EACL2021: Meme Classification for Tamil with BERT
Qinyu Que

In this paper, we introduce the system for the task of meme classification for Tamil, submitted by our team. In today’s society, social media has become an important platform for people to communicate. We use social media to share information about ourselves and express our views on things. It has gradually developed a unique form of emotional expression on social media meme. The meme is an expression that is often ironic. This also gives the meme a unique sense of humor. But it’s not just positive content on social media. There’s also a lot of offensive content. Meme’s unique expression makes it often used by some users to post offensive content. Therefore, it is very urgent to detect the offensive content of the meme. Our team uses the natural language processing method to classify the offensive content of the meme. Our team combines the BERT model with the CNN to improve the model’s ability to collect statement information. Finally, the F1-score of our team in the official test set is 0.49, and our method ranks 5th.

pdf bib
SSNCSE_NLP@DravidianLangTech-EACL2021 : Offensive Language Identification on Multilingual Code Mixing TextSSNCSE_NLP@DravidianLangTech-EACL2021: Offensive Language Identification on Multilingual Code Mixing Text
Bharathi B | Agnusimmaculate Silvia A

Social networks made a huge impact in almost all fields in recent years. Text messaging through the Internet or cellular phones has become a major medium of personal and commercial communication. Everyday we have to deal with texts, emails or different types of messages in which there are a variety of attacks and abusive phrases. It is the moderator’s decision which comments to remove from the platform because of violations and which ones to keep but an automatic software for detecting abusive languages would be useful in recent days. In this paper we describe an automatic offensive language identification from Dravidian languages with various machine learning algorithms. This is work is shared task in DravidanLangTech-EACL2021. The goal of this task is to identify offensive language content of the code-mixed dataset of comments / posts in Dravidian Languages ((Tamil-English, Malayalam-English, and Kannada-English)) collected from social media. This work explains the submissions made by SSNCSE_NLP in DravidanLangTech-EACL2021 Code-mix tasks for Offensive language detection. We achieve F1 scores of 0.95 for Malayalam, 0.7 for Kannada and 0.73 for task2-Tamil on the test-set.

pdf bib
MUCS@DravidianLangTech-EACL2021 : COOLI-Code-Mixing Offensive Language IdentificationMUCS@DravidianLangTech-EACL2021:COOLI-Code-Mixing Offensive Language Identification
Fazlourrahman Balouchzahi | Aparna B K | H L Shashirekha

This paper describes the models submitted by the team MUCS for Offensive Language Identification in Dravidian Languages-EACL 2021 shared task that aims at identifying and classifying code-mixed texts of three language pairs namely, Kannada-English (Kn-En), Malayalam-English (Ma-En), and Tamil-English (Ta-En) into six predefined categories (5 categories in Ma-En language pair). Two models, namely, COOLI-Ensemble and COOLI-Keras are trained with the char sequences extracted from the sentences combined with words as features. Out of the two proposed models, COOLI-Ensemble model (best among our models) obtained first rank for Ma-En language pair with 0.97 weighted F1-score and fourth and sixth ranks with 0.75 and 0.69 weighted F1-score for Ta-En and Kn-En language pairs respectively.

pdf bib
OffTamil@DravideanLangTech-EASL2021 : Offensive Language Identification in Tamil TextOffTamil@DravideanLangTech-EASL2021: Offensive Language Identification in Tamil Text
Disne Sivalingam | Sajeetha Thavareesan

In the last few decades, Code-Mixed Offensive texts are used penetratingly in social media posts. Social media platforms and online communities showed much interest on offensive text identification in recent years. Consequently, research community is also interested in identifying such content and also contributed to the development of corpora. Many publicly available corpora are there for research on identifying offensive text written in English language but rare for low resourced languages like Tamil. The first code-mixed offensive text for Dravidian languages are developed by shared task organizers which is used for this study. This study focused on offensive language identification on code-mixed low-resourced Dravidian language Tamil using four classifiers (Support Vector Machine, random forest, k- Nearest Neighbour and Naive Bayes) using chi2 feature selection technique along with BoW and TF-IDF feature representation techniques using different combinations of n-grams. This proposed model achieved an accuracy of 76.96 % while using linear SVM with TF-IDF feature representation technique.


pdf (full)
bib (full)
Proceedings of the 11th Global Wordnet Conference

pdf bib
Proceedings of the 11th Global Wordnet Conference
Piek Vossen | Christiane Fellbaum

pdf bib
On Universal Colexifications
Hongchang Bao | Bradley Hauer | Grzegorz Kondrak

Colexification occurs when two distinct concepts are lexified by the same word. The term covers both polysemy and homonymy. We posit and investigate the hypothesis that no pair of concepts are colexified in every language. We test our hypothesis by analyzing colexification data from BabelNet, Open Multilingual WordNet, and CLICS. The results show that our hypothesis is supported by over 99.9 % of colexified concept pairs in these three lexical resources.

pdf bib
Practical Approach on Implementation of WordNets for South African LanguagesWordNets for South African Languages
Tshephisho Joseph Sefara | Tumisho Billson Mokgonyane | Vukosi Marivate

This paper proposes the implementation of WordNets for five South African languages, namely, Sepedi, Setswana, Tshivenda, isiZulu and isiXhosa to be added to open multilingual WordNets (OMW) on natural language toolkit (NLTK). The African WordNets are converted from Princeton WordNet (PWN) 2.0 to 3.0 to match the synsets in PWN 3.0. After conversion, there were 7157, 11972, 1288, 6380, and 9460 lemmas for Sepedi, Setswana, Tshivenda, isiZulu and isiX- hosa respectively. Setswana, isiXhosa, Sepedi contains more lemmas compared to 8 languages in OMW and isiZulu contains more lemmas compared to 7 languages in OMW. A library has been published for continuous development of African WordNets in OMW using NLTK.

pdf bib
Ask2Transformers : Zero-Shot Domain labelling with Pretrained Language ModelsAsk2Transformers: Zero-Shot Domain labelling with Pretrained Language Models
Oscar Sainz | German Rigau

In this paper we present a system that exploits different pre-trained Language Models for assigning domain labels to WordNet synsets without any kind of supervision. Furthermore, the system is not restricted to use a particular set of domain labels. We exploit the knowledge encoded within different off-the-shelf pre-trained Language Models and task formulations to infer the domain label of a particular WordNet definition. The proposed zero-shot system achieves a new state-of-the-art on the English dataset used in the evaluation.

pdf bib
Monolingual Word Sense Alignment as a Classification Problem
Sina Ahmadi | John P. McCrae

Words are defined based on their meanings in various ways in different resources. Aligning word senses across monolingual lexicographic resources increases domain coverage and enables integration and incorporation of data. In this paper, we explore the application of classification methods using manually-extracted features along with representation learning techniques in the task of word sense alignment and semantic relationship detection. We demonstrate that the performance of classification methods dramatically varies based on the type of semantic relationships due to the nature of the task but outperforms the previous experiments.

pdf bib
The GlobalWordNet Formats : Updates for 2020GlobalWordNet Formats: Updates for 2020
John P. McCrae | Michael Wayne Goodman | Francis Bond | Alexandre Rademaker | Ewa Rudnicka | Luis Morgado Da Costa

The Global Wordnet Formats have been introduced to enable wordnets to have a common representation that can be integrated through the Global WordNet Grid. As a result of their adoption, a number of shortcomings of the format were identified, and in this paper we describe the extensions to the formats that address these issues. These include : ordering of senses, dependencies between wordnets, pronunciation, syntactic modelling, relations, sense keys, metadata and RDF support. Furthermore, we provide some perspectives on how these changes help in the integration of wordnets.

pdf bib
Semantic Analysis of Verb-Noun Derivation in Princeton WordNetPrinceton WordNet
Verginica Mititelu | Svetlozara Leseva | Ivelina Stoyanova

We present here the results of a morphosemantic analysis of the verb-noun pairs in the Princeton WordNet as reflected in the standoff file containing pairs annotated with a set of 14 semantic relations. We have automatically distinguished between zero-derivation and affixal derivation in the data and identified the affixes and manually checked the results. The data show that for each semantic relation an affix prevails in creating new words, although we can not talk about their specificity with respect to such a relation. Moreover, certain pairs of verb-noun semantic primes are better represented for each semantic relation, and some semantic clusters (in the form of WordNet subtrees) take shape as a result. We thus employ a large-scale data-driven linguistically motivated analysis afforded by the rich derivational and morphosemantic description in WordNet to the end of capturing finer regularities in the process of derivation as represented in the semantic properties of the words involved and as reflected in the structure of the lexicon.

pdf bib
OdeNet : Compiling a GermanWordNet from other ResourcesOdeNet: Compiling a GermanWordNet from other Resources
Melanie Siegel | Francis Bond

The Princeton WordNet for the English language has been used worldwide in NLP projects for many years. With the OMW initiative, wordnets for different languages of the world are being linked via identifiers. The parallel development and linking allows new multilingual application perspectives. The development of a wordnet for the German language is also in this context. To save development time, existing resources were combined and recompiled. The result was then evaluated and improved. In a relatively short time a resource was created that can be used in projects and continuously improved and extended.

pdf bib
Text Document Clustering : Wordnet vs. TF-IDF vs. Word EmbeddingsWordnet vs. TF-IDF vs. Word Embeddings
Michał Marcińczuk | Mateusz Gniewkowski | Tomasz Walkowiak | Marcin Będkowski

In the paper, we deal with the problem of unsupervised text document clustering for the Polish language. Our goal is to compare the modern approaches based on language modeling (doc2vec and BERT) with the classical ones, i.e., TF-IDF and wordnet-based. The experiments are conducted on three datasets containing qualification descriptions. The experiments’ results showed that wordnet-based similarity measures could compete and even outperform modern embedding-based approaches.

pdf bib
Neural Language Models vs Wordnet-based Semantically Enriched Representation in CST Relation RecognitionWordnet-based Semantically Enriched Representation in CST Relation Recognition
Arkadiusz Janz | Maciej Piasecki | Piotr Wątorski

Neural language models, including transformer-based models, that are pre-trained on very large corpora became a common way to represent text in various tasks, including recognition of textual semantic relations, e.g. Cross-document Structure Theory. Pre-trained models are usually fine tuned to downstream tasks and the obtained vectors are used as an input for deep neural classifiers. No linguistic knowledge obtained from resources and tools is utilised. In this paper we compare such universal approaches with a combination of rich graph-based linguistically motivated sentence representation and a typical neural network classifier applied to a task of recognition of CST relation in Polish. The representation describes selected levels of the sentence structure including description of lexical meanings on the basis of the wordnet (plWordNet) synsets and connected SUMO concepts. The obtained results show that in the case of difficult relations and medium size training corpus semantically enriched text representation leads to significantly better results.

pdf bib
What is on Social Media that is not in WordNet? A Preliminary Analysis on the TwitterAAE CorpusWordNet? A Preliminary Analysis on the TwitterAAE Corpus
Cecilia Domingo | Tatiana Gonzalez-Ferrero | Itziar Gonzalez-Dios

Natural Language Processing tools and resources have been so far mainly created and trained for standard varieties of language. Nowadays, with the use of large amounts of data gathered from social media, other varieties and registers need to be processed, which may present other challenges and difficulties. In this work, we focus on English and we present a preliminary analysis by comparing the TwitterAAE corpus, which is annotated for ethnicity, and WordNet by quantifying and explaining the online language that WordNet misses.

pdf bib
Toward the creation of WordNets for ancient Indo-European languagesWordNets for ancient Indo-European languages
Erica Biagetti | Chiara Zanchi | William Michael Short

This paper presents the work in progress toward the creation of a family of WordNets for Sanskrit, Ancient Greek, and Latin. Building on previous attempts in the field, we elaborate these efforts bridging together WordNet relational semantics with theories of meaning from Cognitive Linguistics. We discuss some of the innovations we have introduced to the WordNet architecture, to better capture the polysemy of words, as well as Indo-European language family-specific features. We conclude the paper framing our work within the larger picture of resources available for ancient languages and showing that WordNet-backed search tools have the potential to re-define the kinds of questions that can be asked of ancient language corpora.

pdf bib
Teaching Through Tagging Interactive Lexical Semantics
Francis Bond | Andrew Devadason | Melissa Rui Lin Teo | Luís Morgado da Costa

In this paper we discuss an ongoing effort to enrich students’ learning by involving them in sense tagging. The main goal is to lead students to discover how we can represent meaning and where the limits of our current theories lie. A subsidiary goal is to create sense tagged corpora and an accompanying linked lexicon (in our case wordnets). We present the results of tagging several texts and suggest some ways in which the tagging process could be improved. Two authors of this paper present their own experience as students. Overall, students reported that they found the tagging an enriching experience. The annotated corpora and changes to the wordnet are made available through the NTU multilingual corpus and associated wordnets (NTU-MC).


bib (full) Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation

pdf bib
Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation
Hannu Toivonen | Michele Boggia

pdf bib
Adversarial Training for News Stance Detection : Leveraging Signals from a Multi-Genre Corpus.
Costanza Conforti | Jakob Berndt | Marco Basaldella | Mohammad Taher Pilehvar | Chryssi Giannitsarou | Flavio Toxvaerd | Nigel Collier

Cross-target generalization constitutes an important issue for news Stance Detection (SD). In this short paper, we investigate adversarial cross-genre SD, where knowledge from annotated user-generated data is leveraged to improve news SD on targets unseen during training. We implement a BERT-based adversarial network and show experimental performance improvements over a set of strong baselines. Given the abundance of user-generated data, which are considerably less expensive to retrieve and annotate than news articles, this constitutes a promising research direction.

pdf bib
Aligning Estonian and Russian news industry keywords with the help of subtitle translations and an environmental thesaurusEstonian and Russian news industry keywords with the help of subtitle translations and an environmental thesaurus
Andraž Repar | Andrej Shumakov

This paper presents the implementation of a bilingual term alignment approach developed by Repar et al. (2019) to a dataset of unaligned Estonian and Russian keywords which were manually assigned by journalists to describe the article topic. We started by separating the dataset into Estonian and Russian tags based on whether they are written in the Latin or Cyrillic script. Then we selected the available language-specific resources necessary for the alignment system to work. Despite the domains of the language-specific resources (subtitles and environment) not matching the domain of the dataset (news articles), we were able to achieve respectable results with manual evaluation indicating that almost 3/4 of the aligned keyword pairs are at least partial matches.

pdf bib
Comment Section Personalization : Algorithmic, Interface, and Interaction Design
Yixue Wang

Comment sections allow users to share their personal experiences, discuss and form different opinions, and build communities out of organic conversations. However, many comment sections present chronological ranking to all users. In this paper, I discuss personalization approaches in comment sections based on different objectives for newsrooms and researchers to consider. I propose algorithmic and interface designs when personalizing the presentation of comments based on different objectives including relevance, diversity, and education / background information. I further explain how transparency, user control, and comment type diversity could help users most benefit from the personalized interacting experience.

pdf bib
EMBEDDIA Tools, Datasets and Challenges : Resources and Hackathon ContributionsEMBEDDIA Tools, Datasets and Challenges: Resources and Hackathon Contributions
Senja Pollak | Marko Robnik-Šikonja | Matthew Purver | Michele Boggia | Ravi Shekhar | Marko Pranjić | Salla Salmela | Ivar Krustok | Tarmo Paju | Carl-Gustav Linden | Leo Leppänen | Elaine Zosa | Matej Ulčar | Linda Freienthal | Silver Traat | Luis Adrián Cabrera-Diego | Matej Martinc | Nada Lavrač | Blaž Škrlj | Martin Žnidaršič | Andraž Pelicon | Boshko Koloski | Vid Podpečan | Janez Kranjc | Shane Sheehan | Emanuela Boros | Jose G. Moreno | Antoine Doucet | Hannu Toivonen

This paper presents tools and data sources collected and released by the EMBEDDIA project, supported by the European Union’s Horizon 2020 research and innovation program. The collected resources were offered to participants of a hackathon organized as part of the EACL Hackashop on News Media Content Analysis and Automated Report Generation in February 2021. The hackathon had six participating teams who addressed different challenges, either from the list of proposed challenges or their own news-industry-related tasks. This paper goes beyond the scope of the hackathon, as it brings together in a coherent and compact form most of the resources developed, collected and released by the EMBEDDIA project. Moreover, it constitutes a handy source for news media industry and researchers in the fields of Natural Language Processing and Social Science.

pdf bib
A COVID-19 news coverage mood map of EuropeCOVID-19 news coverage mood map of Europe
Frankie Robertson | Jarkko Lagus | Kaisla Kajava

We present a COVID-19 news dashboard which visualizes sentiment in pandemic news coverage in different languages across Europe. The dashboard shows analyses for positive / neutral / negative sentiment and moral sentiment for news articles across countries and languages. First we extract news articles from news-crawl. Then we use a pre-trained multilingual BERT model for sentiment analysis of news article headlines and a dictionary and word vectors -based method for moral sentiment analysis of news articles. The resulting dashboard gives a unified overview of news events on COVID-19 news overall sentiment, and the region and language of publication from the period starting from the beginning of January 2020 to the end of January 2021.

pdf bib
Interesting cross-border news discovery using cross-lingual article linking and document similarity
Boshko Koloski | Elaine Zosa | Timen Stepišnik-Perdih | Blaž Škrlj | Tarmo Paju | Senja Pollak

Team Name : team-8 Embeddia Tool : Cross-Lingual Document Retrieval Zosa et al. Dataset : Estonian and Latvian news datasets abstract : Contemporary news media face increasing amounts of available data that can be of use when prioritizing, selecting and discovering new news. In this work we propose a methodology for retrieving interesting articles in a cross-border news discovery setting. More specifically, we explore how a set of seed documents in Estonian can be projected in Latvian document space and serve as a basis for discovery of novel interesting pieces of Latvian news that would interest Estonian readers. The proposed methodology was evaluated by Estonian journalist who confirmed that in the best setting, from top 10 retrieved Latvian documents, half of them represent news that are potentially interesting to be taken by the Estonian media house and presented to Estonian readers.


bib (full) Proceedings of the First Workshop on Bridging Human–Computer Interaction and Natural Language Processing

pdf bib
Proceedings of the First Workshop on Bridging Human–Computer Interaction and Natural Language Processing
Su Lin Blodgett | Michael Madaio | Brendan O'Connor | Hanna Wallach | Qian Yang

pdf bib
Towards Human-Centered Summarization : A Case Study on Financial News
Tatiana Passali | Alexios Gidiotis | Efstathios Chatzikyriakidis | Grigorios Tsoumakas

Recent Deep Learning (DL) summarization models greatly outperform traditional summarization methodologies, generating high-quality summaries. Despite their success, there are still important open issues, such as the limited engagement and trust of users in the whole process. In order to overcome these issues, we reconsider the task of summarization from a human-centered perspective. We propose to integrate a user interface with an underlying DL model, instead of tackling summarization as an isolated task from the end user. We present a novel system, where the user can actively participate in the whole summarization process. We also enable the user to gather insights into the causative factors that drive the model’s behavior, exploiting the self-attention mechanism. We focus on the financial domain, in order to demonstrate the efficiency of generic DL models for domain-specific applications. Our work takes a first step towards a model-interface co-design approach, where DL models evolve along user needs, paving the way towards human-computer text summarization interfaces.

pdf bib
Methods for the Design and Evaluation of HCI+NLP SystemsHCI+NLP Systems
Hendrik Heuer | Daniel Buschek

HCI and NLP traditionally focus on different evaluation methods. While HCI involves a small number of people directly and deeply, NLP traditionally relies on standardized benchmark evaluations that involve a larger number of people indirectly. We present five methodological proposals at the intersection of HCI and NLP and situate them in the context of ML-based NLP models. Our goal is to foster interdisciplinary collaboration and progress in both fields by emphasizing what the fields can learn from each other.

pdf bib
Challenges in Designing Natural Language Interfaces for Complex Visual Models
Henrik Voigt | Monique Meuschke | Kai Lawonn | Sina Zarrieß

Intuitive interaction with visual models becomes an increasingly important task in the field of Visualization (VIS) and verbal interaction represents a significant aspect of it. Vice versa, modeling verbal interaction in visual environments is a major trend in ongoing research in NLP. To date, research on Language & Vision, however, mostly happens at the intersection of NLP and Computer Vision (CV), and much less at the intersection of NLP and Visualization, which is an important area in Human-Computer Interaction (HCI). This paper presents a brief survey of recent work on interactive tasks and set-ups in NLP and Visualization. We discuss the respective methods, show interesting gaps, and conclude by suggesting neural, visually grounded dialogue modeling as a promising potential for NLIs for visual models.

pdf bib
RE-AIMing Predictive TextRE-AIMing Predictive Text
Matthew Higgs | Claire McCallum | Selina Sutton | Mark Warner

Our increasing reliance on mobile applications means much of our communication is mediated with the support of predictive text systems. How do these systems impact interpersonal communication and broader society? In what ways are predictive text systems harmful, to whom, and why? In this paper, we focus on predictive text systems on mobile devices and attempt to answer these questions. We introduce the concept of a ‘text entry intervention’ as a way to evaluate predictive text systems through an interventional lens, and consider the Reach, Effectiveness, Adoption, Implementation, and Maintenance (RE-AIM) of predictive text systems. We finish with a discussion of opportunities for NLP.

pdf bib
How do people interact with biased text prediction models while writing?
Advait Bhat | Saaket Agashe | Anirudha Joshi

Recent studies have shown that a bias in thetext suggestions system can percolate in theuser’s writing. In this pilot study, we ask thequestion : How do people interact with text pre-diction models, in an inline next phrase sugges-tion interface and how does introducing senti-ment bias in the text prediction model affecttheir writing? We present a pilot study as afirst step to answer this question.


pdf (full)
bib (full)
Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval)

pdf bib
Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval)
Anya Belz | Shubham Agarwal | Yvette Graham | Ehud Reiter | Anastasia Shimorina

pdf bib
Trading Off Diversity and Quality in Natural Language Generation
Hugh Zhang | Daniel Duckworth | Daphne Ippolito | Arvind Neelakantan

For open-ended language generation tasks such as storytelling or dialogue, choosing the right decoding algorithm is vital for controlling the tradeoff between generation quality and diversity. However, there presently exists no consensus on which decoding procedure is best or even the criteria by which to compare them. In this paper, we cast decoding as a tradeoff between response quality and diversity, and we perform the first large-scale evaluation of decoding methods along the entire quality-diversity spectrum. Our experiments confirm the existence of the likelihood trap : the counter-intuitive observation that high likelihood sequences are often surprisingly low quality. We also find that when diversity is a priority, all methods perform similarly, but when quality is viewed as more important, nucleus sampling (Holtzman et al., 2019) outperforms all other evaluated decoding algorithms.quality and diversity. However, there presently exists no consensus on which decoding procedure is best or even the criteria by which to compare them. In this paper, we cast decoding as a tradeoff between response quality and diversity, and we perform the first large-scale evaluation of decoding methods along the entire quality-diversity spectrum. Our experiments confirm the existence of the likelihood trap: the counter-intuitive observation that high likelihood sequences are often surprisingly low quality. We also find that when diversity is a priority, all methods perform similarly, but when quality is viewed as more important, nucleus sampling (Holtzman et al., 2019) outperforms all other evaluated decoding algorithms.

pdf bib
Is This Translation Error Critical? : Classification-Based Human and Automatic Machine Translation Evaluation Focusing on Critical Errors
Katsuhito Sudoh | Kosuke Takahashi | Satoshi Nakamura

This paper discusses a classification-based approach to machine translation evaluation, as opposed to a common regression-based approach in the WMT Metrics task. Recent machine translation usually works well but sometimes makes critical errors due to just a few wrong word choices. Our classification-based approach focuses on such errors using several error type labels, for practical machine translation evaluation in an age of neural machine translation. We made additional annotations on the WMT 2015-2017 Metrics datasets with fluency and adequacy labels to distinguish different types of translation errors from syntactic and semantic viewpoints. We present our human evaluation criteria for the corpus development and automatic evaluation experiments using the corpus. The human evaluation corpus will be publicly available upon publication.

pdf bib
Towards Objectively Evaluating the Quality of Generated Medical Summaries
Francesco Moramarco | Damir Juric | Aleksandar Savkov | Ehud Reiter

We propose a method for evaluating the quality of generated text by asking evaluators to count facts, and computing precision, recall, f-score, and accuracy from the raw counts. We believe this approach leads to a more objective and easier to reproduce evaluation. We apply this to the task of medical report summarisation, where measuring objective quality and accuracy is of paramount importance.

pdf bib
Reliability of Human Evaluation for Text Summarization : Lessons Learned and Challenges Ahead
Neslihan Iskender | Tim Polzehl | Sebastian Möller

Only a small portion of research papers with human evaluation for text summarization provide information about the participant demographics, task design, and experiment protocol. Additionally, many researchers use human evaluation as gold standard without questioning the reliability or investigating the factors that might affect the reliability of the human evaluation. As a result, there is a lack of best practices for reliable human summarization evaluation grounded by empirical evidence. To investigate human evaluation reliability, we conduct a series of human evaluation experiments, provide an overview of participant demographics, task design, experimental set-up and compare the results from different experiments. Based on our empirical analysis, we provide guidelines to ensure the reliability of expert and non-expert evaluations, and we determine the factors that might affect the reliability of the human evaluation.

pdf bib
Interrater Disagreement Resolution : A Systematic Procedure to Reach Consensus in Annotation Tasks
Yvette Oortwijn | Thijs Ossenkoppele | Arianna Betti

We present a systematic procedure for interrater disagreement resolution. The procedure is general, but of particular use in multiple-annotator tasks geared towards ground truth construction. We motivate our proposal by arguing that, barring cases in which the researchers’ goal is to elicit different viewpoints, interrater disagreement is a sign of poor quality in the design or the description of a task. Consensus among annotators, we maintain, should be striven for, through a systematic procedure for disagreement resolution such as the one we describe.


bib (full) Proceedings of the Third Workshop on Beyond Vision and LANguage: inTEgrating Real-world kNowledge (LANTERN)

pdf bib
Proceedings of the Third Workshop on Beyond Vision and LANguage: inTEgrating Real-world kNowledge (LANTERN)
Marius Mosbach | Michael A. Hedderich | Sandro Pezzelle | Aditya Mogadala | Dietrich Klakow | Marie-Francine Moens | Zeynep Akata

pdf bib
Visual Grounding Strategies for Text-Only Natural Language Processing
Damien Sileo

Visual grounding is a promising path toward more robust and accurate Natural Language Processing (NLP) models. Many multimodal extensions of BERT (e.g., VideoBERT, LXMERT, VL-BERT) allow a joint modeling of texts and images that lead to state-of-the-art results on multimodal tasks such as Visual Question Answering. Here, we leverage multimodal modeling for purely textual tasks (language modeling and classification) with the expectation that the multimodal pretraining provides a grounding that can improve text processing accuracy. We propose possible strategies in this respect. A first type of strategy, referred to as transferred grounding consists in applying multimodal models to text-only tasks using a placeholder to replace image input. The second one, which we call associative grounding, harnesses image retrieval to match texts with related images during both pretraining and text-only downstream tasks. We draw further distinctions into both strategies and then compare them according to their impact on language modeling and commonsense-related downstream tasks, showing improvement over text-only baselines.transferred grounding consists in applying multimodal models to text-only tasks using a placeholder to replace image input. The second one, which we call associative grounding, harnesses image retrieval to match texts with related images during both pretraining and text-only downstream tasks. We draw further distinctions into both strategies and then compare them according to their impact on language modeling and commonsense-related downstream tasks, showing improvement over text-only baselines.

pdf bib
What Did This Castle Look like before? Exploring Referential Relations in Naturally Occurring Multimodal Texts
Ronja Utescher | Sina Zarrieß

Multi-modal texts are abundant and diverse in structure, yet Language & Vision research of these naturally occurring texts has mostly focused on genres that are comparatively light on text, like tweets. In this paper, we discuss the challenges and potential benefits of a L&V framework that explicitly models referential relations, taking Wikipedia articles about buildings as an example. We briefly survey existing related tasks in L&V and propose multi-modal information extraction as a general direction for future research.


bib (full) Proceedings of the 12th International Workshop on Health Text Mining and Information Analysis

pdf bib
Proceedings of the 12th International Workshop on Health Text Mining and Information Analysis
Eben Holderness | Antonio Jimeno Yepes | Alberto Lavelli | Anne-Lyse Minard | James Pustejovsky | Fabio Rinaldi

pdf bib
Understanding Social Support Expressed in a COVID-19 Online ForumCOVID-19 Online Forum
Anietie Andy | Brian Chu | Ramie Fathy | Barrington Bennett | Daniel Stokes | Sharath Chandra Guntuku

In online forums focused on health and wellbeing, individuals tend to seek and give the following social support : emotional and informational support. Understanding the expressions of these social supports in an online COVID- 19 forum is important for : (a) the forum and its members to provide the right type of support to individuals and (b) determining the long term effects of the COVID-19 pandemic on the well-being of the public, thereby informing interventions. In this work, we build four machine learning models to measure the extent of the following social supports expressed in each post in a COVID-19 online forum : (a) emotional support given (b) emotional support sought (c) informational support given, and (d) informational support sought. Using these models, we aim to : (i) determine if there is a correlation between the different social supports expressed in posts e.g. when members of the forum give emotional support in posts, do they also tend to give or seek informational support in the same post? (ii) determine how these social supports sought and given changes over time in published posts. We find that (i) there is a positive correlation between the informational support given in posts and the emotional support given and emotional support sought, respectively, in these posts and (ii) over time, users tended to seek more emotional support and give less emotional support.

pdf bib
Integrating Higher-Level Semantics into Robust Biomedical Name Representations
Pieter Fivez | Simon Suster | Walter Daelemans

Neural encoders of biomedical names are typically considered robust if representations can be effectively exploited for various downstream NLP tasks. To achieve this, encoders need to model domain-specific biomedical semantics while rivaling the universal applicability of pretrained self-supervised representations. Previous work on robust representations has focused on learning low-level distinctions between names of fine-grained biomedical concepts. These fine-grained concepts can also be clustered together to reflect higher-level, more general semantic distinctions, such as grouping the names nettle sting and tick-borne fever together under the description puncture wound of skin. It has not yet been empirically confirmed that training biomedical name encoders on fine-grained distinctions automatically leads to bottom-up encoding of such higher-level semantics. In this paper, we show that this bottom-up effect exists, but that it is still relatively limited. As a solution, we propose a scalable multi-task training regime for biomedical name encoders which can also learn robust representations using only higher-level semantic classes. These representations can generalise both bottom-up as well as top-down among various semantic hierarchies. Moreover, we show how they can be used out-of-the-box for improved unsupervised detection of hypernyms, while retaining robust performance on various semantic relatedness benchmarks.

pdf bib
FuzzyBIO : A Proposal for Fuzzy Representation of Discontinuous EntitiesFuzzyBIO: A Proposal for Fuzzy Representation of Discontinuous Entities
Anne Dirkson | Suzan Verberne | Wessel Kraaij

Discontinuous entities pose a challenge to named entity recognition (NER). These phenomena occur commonly in the biomedical domain. As a solution, expansions of the BIO representation scheme that can handle these entity types are commonly used (i.e. BIOHD). However, the extra tag types make the NER task more difficult to learn. In this paper we propose an alternative ; a fuzzy continuous BIO scheme (FuzzyBIO). We focus on the task of Adverse Drug Response extraction and normalization to compare FuzzyBIO to BIOHD. We find that FuzzyBIO improves recall of NER for two of three data sets and results in a higher percentage of correctly identified disjoint and composite entities for all data sets. Using FuzzyBIO also improves end-to-end performance for continuous and composite entities in two of three data sets. Since FuzzyBIO improves performance for some data sets and the conversion from BIOHD to FuzzyBIO is straightforward, we recommend investigating which is more effective for any data set containing discontinuous entities.

pdf bib
Scientific Claim Verification with VerT5eriniVerT5erini
Ronak Pradeep | Xueguang Ma | Rodrigo Nogueira | Jimmy Lin

This work describes the adaptation of a pretrained sequence-to-sequence model to the task of scientific claim verification in the biomedical domain. We propose a system called VerT5erini that exploits T5 for abstract retrieval, sentence selection, and label prediction, which are three critical sub-tasks of claim verification. We evaluate our pipeline on SciFACT, a newly curated dataset that requires models to not just predict the veracity of claims but also provide relevant sentences from a corpus of scientific literature that support the prediction. Empirically, our system outperforms a strong baseline in each of the three sub-tasks. We further show VerT5erini’s ability to generalize to two new datasets of COVID-19 claims using evidence from the CORD-19 corpus.


bib (full) Proceedings of the First Workshop on Language Technology for Equality, Diversity and Inclusion

pdf bib
Proceedings of the First Workshop on Language Technology for Equality, Diversity and Inclusion
Bharathi Raja Chakravarthi | John P. McCrae | Manel Zarrouk | Kalika Bali | Paul Buitelaar

pdf bib
Impact of COVID-19 in Natural Language Processing Publications : a Disaggregated Study in Gender, Contribution and ExperienceCOVID-19 in Natural Language Processing Publications: a Disaggregated Study in Gender, Contribution and Experience
Christine Basta | Marta R. Costa-jussa

This study sheds light on the effects of COVID-19 in the particular field of Computational Linguistics and Natural Language Processing within Artificial Intelligence. We provide an inter-sectional study on gender, contribution, and experience that considers one school year (from August 2019 to August 2020) as a pandemic year. August is included twice for the purpose of an inter-annual comparison. While the trend in publications increased with the crisis, the results show that the ratio between female and male publications decreased. This only helps to reduce the importance of the female role in the scientific contributions of computational linguistics (it is now far below its peak of 0.24). The pandemic has a particularly negative effect on the production of female senior researchers in the first position of authors (maximum work), followed by the female junior researchers in the last position of authors (supervision or collaborative work).

pdf bib
hBERT + BiasCorp-Fighting Racism on the WebBERT + BiasCorp - Fighting Racism on the Web
Olawale Onabola | Zhuang Ma | Xie Yang | Benjamin Akera | Ibraheem Abdulrahman | Jia Xue | Dianbo Liu | Yoshua Bengio

Subtle and overt racism is still present both in physical and online communities today and has impacted many lives in different segments of the society. In this short piece of work, we present how we’re tackling this societal issue with Natural Language Processing. We are releasing BiasCorp, a dataset containing 139,090 comments and news segment from three specific sources-Fox News, BreitbartNews and YouTube. The first batch (45,000 manually annotated) is ready for publication. We are currently in the final phase of manually labeling the remaining dataset using Amazon Mechanical Turk. BERT has been used widely in several downstream tasks. In this work, we present hBERT, where we modify certain layers of the pretrained BERT model with the new Hopfield Layer. hBert generalizes well across different distributions with the added advantage of a reduced model complexity. We are also releasing a JavaScript library 3 and a Chrome Extension Application, to help developers make use of our trained model in web applications (say chat application) and for users to identify and report racially biased contents on the web respectively

pdf bib
An Overview of Fairness in Data Illuminating the Bias in Data Pipeline
Senthil Kumar B | Aravindan Chandrabose | Bharathi Raja Chakravarthi

Data in general encodes human biases by default ; being aware of this is a good start, and the research around how to handle it is ongoing. The term ‘bias’ is extensively used in various contexts in NLP systems. In our research the focus is specific to biases such as gender, racism, religion, demographic and other intersectional views on biases that prevail in text processing systems responsible for systematically discriminating specific population, which is not ethical in NLP. These biases exacerbate the lack of equality, diversity and inclusion of specific population while utilizing the NLP applications. The tools and technology at the intermediate level utilize biased data, and transfer or amplify this bias to the downstream applications. However, it is not enough to be colourblind, gender-neutral alone when designing a unbiased technology instead, we should take a conscious effort by designing a unified framework to measure and benchmark the bias. In this paper, we recommend six measures and one augment measure based on the observations of the bias in data, annotations, text representations and debiasing techniques.

pdf bib
GEPSA, a tool for monitoring social challenges in digital pressGEPSA, a tool for monitoring social challenges in digital press
Iñaki San Vicente | Xabier Saralegi | Nerea Zubia

This papers presents a platform for monitoring press narratives with respect to several social challenges, including gender equality, migrations and minority languages. As narratives are encoded in natural language, we have to use natural processing techniques to automate their analysis. Thus, crawled news are processed by means of several NLP modules, including named entity recognition, keyword extraction, document classification for social challenge detection, and sentiment analysis. A Flask powered interface provides data visualization for a user-based analysis of the data. This paper presents the architecture of the system and describes in detail its different components. Evaluation is provided for the modules related to extraction and classification of information regarding social challenges.

pdf bib
Finding Spoiler Bias in Tweets by Zero-shot Learning and Knowledge Distilling from Neural Text Simplification
Avi Bleiweiss

Automatic detection of critical plot information in reviews of media items poses unique challenges to both social computing and computational linguistics. In this paper we propose to cast the problem of discovering spoiler bias in online discourse as a text simplification task. We conjecture that for an item-user pair, the simpler the user review we learn from an item summary the higher its likelihood to present a spoiler. Our neural model incorporates the advanced transformer network to rank the severity of a spoiler in user tweets. We constructed a sustainable high-quality movie dataset scraped from unsolicited review tweets and paired with a title summary and meta-data extracted from a movie specific domain. To a large extent, our quantitative and qualitative results weigh in on the performance impact of named entity presence in plot summaries. Pretrained on a split-and-rephrase corpus with knowledge distilled from English Wikipedia and fine-tuned on our movie dataset, our neural model shows to outperform both a language modeler and monolingual translation baselines.

pdf bib
IIITT@LT-EDI-EACL2021-Hope Speech Detection : There is always hope in TransformersIIITT@LT-EDI-EACL2021-Hope Speech Detection: There is always hope in Transformers
Karthik Puranik | Adeep Hande | Ruba Priyadharshini | Sajeetha Thavareesan | Bharathi Raja Chakravarthi

In a world with serious challenges like climate change, religious and political conflicts, global pandemics, terrorism, and racial discrimination, an internet full of hate speech, abusive and offensive content is the last thing we desire for. In this paper, we work to identify and promote positive and supportive content on these platforms. We work with several transformer-based models to classify social media comments as hope speech or not hope speech in English, Malayalam, and Tamil languages. This paper portrays our work for the Shared Task on Hope Speech Detection for Equality, Diversity, and Inclusion at LT-EDI 2021- EACL 2021. The codes for our best submission can be viewed.

pdf bib
ZYJ@LT-EDI-EACL2021 : XLM-RoBERTa-Based Model with Attention for Hope Speech DetectionZYJ@LT-EDI-EACL2021:XLM-RoBERTa-Based Model with Attention for Hope Speech Detection
Yingjia Zhao | Xin Tao

Due to the development of modern computer technology and the increase in the number of online media users, we can see all kinds of posts and comments everywhere on the internet. Hope speech can not only inspire the creators but also make other viewers pleasant. It is necessary to effectively and automatically detect hope speech. This paper describes the approach of our team in the task of hope speech detection. We use the attention mechanism to adjust the weight of all the output layers of XLM-RoBERTa to make full use of the information extracted from each layer, and use the weighted sum of all the output layers to complete the classification task. And we use the Stratified-K-Fold method to enhance the training data set. We achieve a weighted average F1-score of 0.59, 0.84, and 0.92 for Tamil, Malayalam, and English language, ranked 3rd, 2nd, and 2nd.

pdf bib
TeamUNCC@LT-EDI-EACL2021 : Hope Speech Detection using Transfer Learning with TransformersTeamUNCC@LT-EDI-EACL2021: Hope Speech Detection using Transfer Learning with Transformers
Khyati Mahajan | Erfan Al-Hossami | Samira Shaikh

In this paper, we describe our approach towards utilizing pre-trained models for the task of hope speech detection. We participated in Task 2 : Hope Speech Detection for Equality, Diversity and Inclusion at LT-EDI-2021 @ EACL2021. The goal of this task is to predict the presence of hope speech, along with the presence of samples that do not belong to the same language in the dataset. We describe our approach to fine-tuning RoBERTa for Hope Speech detection in English and our approach to fine-tuning XLM-RoBERTa for Hope Speech detection in Tamil and Malayalam, two low resource Indic languages. We demonstrate the performance of our approach on classifying text into hope-speech, non-hope and not-language. Our approach ranked 1st in English (F1 = 0.93), 1st in Tamil (F1 = 0.61) and 3rd in Malayalam (F1 = 0.83).

pdf bib
Autobots@LT-EDI-EACL2021 : One World, One Family : Hope Speech Detection with BERT Transformer ModelLT-EDI-EACL2021: One World, One Family: Hope Speech Detection with BERT Transformer Model
Sunil Gundapu | Radhika Mamidi

The rapid rise of online social networks like YouTube, Facebook, Twitter allows people to express their views more widely online. However, at the same time, it can lead to an increase in conflict and hatred among consumers in the form of freedom of speech. Therefore, it is essential to take a positive strengthening method to research on encouraging, positive, helping, and supportive social media content. In this paper, we describe a Transformer-based BERT model for Hope speech detection for equality, diversity, and inclusion, submitted for LT-EDI-2021 Task 2. Our model achieves a weighted averaged f1-score of 0.93 on the test set.

pdf bib
Hopeful NLP@LT-EDI-EACL2021 : Finding Hope in YouTube Comment SectionNLP@LT-EDI-EACL2021: Finding Hope in YouTube Comment Section
Vasudev Awatramani

The proliferation of Hate Speech and misinformation in social media is fast becoming a menace to society. In compliment, the dissemination of hate-diffusing, promising and anti-oppressive messages become a unique alternative. Unfortunately, due to its complex nature as well as the relatively limited manifestation in comparison to hostile and neutral content, the identification of Hope Speech becomes a challenge. This work revolves around the detection of Hope Speech in Youtube comments, for the Shared Task on Hope Speech Detection for Equality, Diversity, and Inclusion. We achieve an f-score of 0.93, ranking 1st on the leaderboard for English comments.

pdf bib
NLP-CUET@LT-EDI-EACL2021 : Multilingual Code-Mixed Hope Speech Detection using Cross-lingual Representation LearnerNLP-CUET@LT-EDI-EACL2021: Multilingual Code-Mixed Hope Speech Detection using Cross-lingual Representation Learner
Eftekhar Hossain | Omar Sharif | Mohammed Moshiul Hoque

In recent years, several systems have been developed to regulate the spread of negativity and eliminate aggressive, offensive or abusive contents from the online platforms. Nevertheless, a limited number of researches carried out to identify positive, encouraging and supportive contents. In this work, our goal is to identify whether a social media post / comment contains hope speech or not. We propose three distinct models to identify hope speech in English, Tamil and Malayalam language to serve this purpose. To attain this goal, we employed various machine learning (SVM, LR, ensemble), deep learning (CNN+BiLSTM) and transformer (m-BERT, Indic-BERT, XLNet, XLM-R) based methods. Results indicate that XLM-R outdoes all other techniques by gaining a weighted f_1-score of 0.93, 0.60 and 0.85 respectively for English, Tamil and Malayalam language. Our team has achieved 1st, 2nd and 1st rank in these three tasks respectively.

pdf bib
Spartans@LT-EDI-EACL2021 : Inclusive Speech Detection using Pretrained Language ModelsLT-EDI-EACL2021: Inclusive Speech Detection using Pretrained Language Models
Megha Sharma | Gaurav Arora

We describe our system that ranked first in Hope Speech Detection (HSD) shared task and fourth in Offensive Language Identification (OLI) shared task, both in Tamil language. The goal of HSD and OLI is to identify if a code-mixed comment or post contains hope speech or offensive content respectively. We pre-train a transformer-based model RoBERTa using synthetically generated code-mixed data and use it in an ensemble along with their pre-trained ULMFiT model available from iNLTK.


bib (full) Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects

pdf bib
Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects
Marcos Zampieri | Preslav Nakov | Nikola Ljubešić | Jörg Tiedemann | Yves Scherrer | Tommi Jauhiainen

pdf bib
Hierarchical Transformer for Multilingual Machine Translation
Albina Khusainova | Adil Khan | Adín Ramírez Rivera | Vitaly Romanov

The choice of parameter sharing strategy in multilingual machine translation models determines how optimally parameter space is used and hence, directly influences ultimate translation quality. Inspired by linguistic trees that show the degree of relatedness between different languages, the new general approach to parameter sharing in multilingual machine translation was suggested recently. The main idea is to use these expert language hierarchies as a basis for multilingual architecture : the closer two languages are, the more parameters they share. In this work, we test this idea using the Transformer architecture and show that despite the success in previous work there are problems inherent to training such hierarchical models. We demonstrate that in case of carefully chosen training strategy the hierarchical architecture can outperform bilingual models and multilingual models with full parameter sharing.

pdf bib
Representations of Language Varieties Are Reliable Given Corpus Similarity Measures
Jonathan Dunn

This paper measures similarity both within and between 84 language varieties across nine languages. These corpora are drawn from digital sources (the web and tweets), allowing us to evaluate whether such geo-referenced corpora are reliable for modelling linguistic variation. The basic idea is that, if each source adequately represents a single underlying language variety, then the similarity between these sources should be stable across all languages and countries. The paper shows that there is a consistent agreement between these sources using frequency-based corpus similarity measures. This provides further evidence that digital geo-referenced corpora consistently represent local language varieties.

pdf bib
Whit’s the Richt Pairt o Speech : PoS tagging for ScotsPoS tagging for Scots
Harm Lameris | Sara Stymne

In this paper we explore PoS tagging for the Scots language. Scots is spoken in Scotland and Northern Ireland, and is closely related to English. As no linguistically annotated Scots data were available, we manually PoS tagged a small set that is used for evaluation and training. We use English as a transfer language to examine zero-shot transfer and transfer learning methods. We find that training on a very small amount of Scots data was superior to zero-shot transfer from English. Combining the Scots and English data led to further improvements, with a concatenation method giving the best results. We also compared the use of two different English treebanks and found that a treebank containing web data was superior in the zero-shot setting, while it was outperformed by a treebank containing a mix of genres when combined with Scots data.

pdf bib
Discriminating Between Similar Nordic Languages
René Haas | Leon Derczynski

Automatic language identification is a challenging problem. Discriminating between closely related languages is especially difficult. This paper presents a machine learning approach for automatic language identification for the Nordic languages, which often suffer miscategorisation by existing state-of-the-art tools. Concretely we will focus on discrimination between six Nordic languages : Danish, Swedish, Norwegian (Nynorsk), Norwegian (Bokml), Faroese and Icelandic.

pdf bib
Optimizing a Supervised Classifier for a Difficult Language Identification Problem
Yves Bestgen

This paper describes the system developed by the Laboratoire d’analyse statistique des textes for the Dravidian Language Identification (DLI) shared task of VarDial 2021. This task is particularly difficult because the materials consists of short YouTube comments, written in Roman script, from three closely related Dravidian languages, and a fourth category consisting of several other languages in varying proportions, all mixed with English. The proposed system is made up of a logistic regression model which uses as only features n-grams of characters with a maximum length of 5. After its optimization both in terms of the feature weighting and the classifier parameters, it ranked first in the challenge. The additional analyses carried out underline the importance of optimization, especially when the measure of effectiveness is the Macro-F1.

pdf bib
Comparing Approaches to Dravidian Language IdentificationDravidian Language Identification
Tommi Jauhiainen | Tharindu Ranasinghe | Marcos Zampieri

This paper describes the submissions by team HWR to the Dravidian Language Identification (DLI) shared task organized at VarDial 2021 workshop. The DLI training set includes 16,674 YouTube comments written in Roman script containing code-mixed text with English and one of the three South Dravidian languages : Kannada, Malayalam, and Tamil. We submitted results generated using two models, a Naive Bayes classifier with adaptive language models, which has shown to obtain competitive performance in many language and dialect identification tasks, and a transformer-based model which is widely regarded as the state-of-the-art in a number of NLP tasks. Our first submission was sent in the closed submission track using only the training set provided by the shared task organisers, whereas the second submission is considered to be open as it used a pretrained model trained with external data. Our team attained shared second position in the shared task with the submission based on Naive Bayes. Our results reinforce the idea that deep learning methods are not as competitive in language identification related tasks as they are in many other text classification tasks.


bib (full) Proceedings of the Sixth Arabic Natural Language Processing Workshop

pdf bib
Proceedings of the Sixth Arabic Natural Language Processing Workshop
Nizar Habash | Houda Bouamor | Hazem Hajj | Walid Magdy | Wajdi Zaghouani | Fethi Bougares | Nadi Tomeh | Ibrahim Abu Farha | Samia Touileb

pdf bib
Benchmarking Transformer-based Language Models for Arabic Sentiment and Sarcasm DetectionArabic Sentiment and Sarcasm Detection
Ibrahim Abu Farha | Walid Magdy

The introduction of transformer-based language models has been a revolutionary step for natural language processing (NLP) research. These models, such as BERT, GPT and ELECTRA, led to state-of-the-art performance in many NLP tasks. Most of these models were initially developed for English and other languages followed later. Recently, several Arabic-specific models started emerging. However, there are limited direct comparisons between these models. In this paper, we evaluate the performance of 24 of these models on Arabic sentiment and sarcasm detection. Our results show that the models achieving the best performance are those that are trained on only Arabic data, including dialectal Arabic, and use a larger number of parameters, such as the recently released MARBERT. However, we noticed that AraELECTRA is one of the top performing models while being much more efficient in its computational cost. Finally, the experiments on AraGPT2 variants showed low performance compared to BERT models, which indicates that it might not be suitable for classification tasks.

pdf bib
Kawarith : an Arabic Twitter Corpus for Crisis EventsArabic Twitter Corpus for Crisis Events
Alaa Alharbi | Mark Lee

Social media (SM) platforms such as Twitter provide large quantities of real-time data that can be leveraged during mass emergencies. Developing tools to support crisis-affected communities requires available datasets, which often do not exist for low resource languages. This paper introduces Kawarith a multi-dialect Arabic Twitter corpus for crisis events, comprising more than a million Arabic tweets collected during 22 crises that occurred between 2018 and 2020 and involved several types of hazard. Exploration of this content revealed the most discussed topics and information types, and the paper presents a labelled dataset from seven emergency events that serves as a gold standard for several tasks in crisis informatics research. Using annotated data from the same event, a BERT model is fine-tuned to classify tweets into different categories in the multi- label setting. Results show that BERT-based models yield good performance on this task even with small amounts of task-specific training data.

pdf bib
ArCOV-19 : The First Arabic COVID-19 Twitter Dataset with Propagation NetworksArCOV-19: The First Arabic COVID-19 Twitter Dataset with Propagation Networks
Fatima Haouari | Maram Hasanain | Reem Suwaileh | Tamer Elsayed

In this paper, we present ArCOV-19, an Arabic COVID-19 Twitter dataset that spans one year, covering the period from 27th of January 2020 till 31st of January 2021. ArCOV-19 is the first publicly-available Arabic Twitter dataset covering COVID-19 pandemic that includes about 2.7 M tweets alongside the propagation networks of the most-popular subset of them (i.e., most-retweeted and -liked). The propagation networks include both retweetsand conversational threads (i.e., threads of replies). ArCOV-19 is designed to enable research under several domains including natural language processing, information retrieval, and social computing. Preliminary analysis shows that ArCOV-19 captures rising discussions associated with the first reported cases of the disease as they appeared in the Arab world. In addition to the source tweets and the propagation networks, we also release the search queries and the language-independent crawler used to collect the tweets to encourage the curation of similar datasets.

pdf bib
ALUE : Arabic Language Understanding EvaluationALUE: Arabic Language Understanding Evaluation
Haitham Seelawi | Ibraheem Tuffaha | Mahmoud Gzawi | Wael Farhan | Bashar Talafha | Riham Badawi | Zyad Sober | Oday Al-Dweik | Abed Alhakim Freihat | Hussein Al-Natsheh

The emergence of Multi-task learning (MTL)models in recent years has helped push thestate of the art in Natural Language Un-derstanding (NLU). We strongly believe thatmany NLU problems in Arabic are especiallypoised to reap the benefits of such models. Tothis end we propose the Arabic Language Un-derstanding Evaluation Benchmark (ALUE),based on 8 carefully selected and previouslypublished tasks. For five of these, we providenew privately held evaluation datasets to en-sure the fairness and validity of our benchmark. We also provide a diagnostic dataset to helpresearchers probe the inner workings of theirmodels. Our initial experiments show thatMTL models outperform their singly trainedcounterparts on most tasks. But in order to en-tice participation from the wider community, we stick to publishing singly trained baselinesonly. Nonetheless, our analysis reveals thatthere is plenty of room for improvement inArabic NLU. We hope that ALUE will playa part in helping our community realize someof these improvements. Interested researchersare invited to submit their results to our online, and publicly accessible leaderboard.

pdf bib
Quranic Verses Semantic Relatedness Using AraBERTAraBERT
Abdullah Alsaleh | Eric Atwell | Abdulrahman Altahhan

Bidirectional Encoder Representations from Transformers (BERT) has gained popularity in recent years producing state-of-the-art performances across Natural Language Processing tasks. In this paper, we used AraBERT language model to classify pairs of verses provided by the QurSim dataset to either be semantically related or not. We have pre-processed The QurSim dataset and formed three datasets for comparisons. Also, we have used both versions of AraBERT, which are AraBERTv02 and AraBERTv2, to recognise which version performs the best with the given datasets. The best results was AraBERTv02 with 92 % accuracy score using a dataset comprised of label ‘2’ and label’ -1’, the latter was generated outside of QurSim dataset.

pdf bib
AraELECTRA : Pre-Training Text Discriminators for Arabic Language UnderstandingAraELECTRA: Pre-Training Text Discriminators for Arabic Language Understanding
Wissam Antoun | Fady Baly | Hazem Hajj

Advances in English language representation enabled a more sample-efficient pre-training task by Efficiently Learning an Encoder that Classifies Token Replacements Accurately (ELECTRA). Which, instead of training a model to recover masked tokens, it trains a discriminator model to distinguish true input tokens from corrupted tokens that were replaced by a generator network. On the other hand, current Arabic language representation approaches rely only on pretraining via masked language modeling. In this paper, we develop an Arabic language representation model, which we name AraELECTRA. Our model is pretrained using the replaced token detection objective on large Arabic text corpora. We evaluate our model on multiple Arabic NLP tasks, including reading comprehension, sentiment analysis, and named-entity recognition and we show that AraELECTRA outperforms current state-of-the-art Arabic language representation models, given the same pretraining data and with even a smaller model size.

pdf bib
AraGPT2 : Pre-Trained Transformer for Arabic Language GenerationAraGPT2: Pre-Trained Transformer for Arabic Language Generation
Wissam Antoun | Fady Baly | Hazem Hajj

Recently, pre-trained transformer-based architectures have proven to be very efficient at language modeling and understanding, given that they are trained on a large enough corpus. Applications in language generation for Arabic are still lagging in comparison to other NLP advances primarily due to the lack of advanced Arabic language generation models. In this paper, we develop the first advanced Arabic language generation model, AraGPT2, trained from scratch on a large Arabic corpus of internet text and news articles. Our largest model, AraGPT2-mega, has 1.46 billion parameters, which makes it the largest Arabic language model available. The mega model was evaluated and showed success on different tasks including synthetic news generation, and zero-shot question answering. For text generation, our best model achieves a perplexity of 29.8 on held-out Wikipedia articles. A study conducted with human evaluators showed the significant success of AraGPT2-mega in generating news articles that are difficult to distinguish from articles written by humans. We thus develop and release an automatic discriminator model with a 98 % percent accuracy in detecting model-generated text. The models are also publicly available, hoping to encourage new research directions and applications for Arabic NLP.

pdf bib
SERAG : Semantic Entity Retrieval from Arabic Knowledge GraphsSERAG: Semantic Entity Retrieval from Arabic Knowledge Graphs
Saher Esmeir

Knowledge graphs (KGs) are widely used to store and access information about entities and their relationships. Given a query, the task of entity retrieval from a KG aims at presenting a ranked list of entities relevant to the query. Lately, an increasing number of models for entity retrieval have shown a significant improvement over traditional methods. These models, however, were developed for English KGs. In this work, we build on one such system, named KEWER, to propose SERAG (Semantic Entity Retrieval from Arabic knowledge Graphs). Like KEWER, SERAG uses random walks to generate entity embeddings. DBpedia-Entity v2 is considered the standard test collection for entity retrieval. We discuss the challenges of using it for non-English languages in general and Arabic in particular. We provide an Arabic version of this standard collection, and use it to evaluate SERAG. SERAG is shown to significantly outperform the popular BM25 model thanks to its multi-hop reasoning.

pdf bib
Introducing A large Tunisian Arabizi Dialectal Dataset for Sentiment AnalysisTunisian Arabizi Dialectal Dataset for Sentiment Analysis
Chayma Fourati | Hatem Haddad | Abir Messaoudi | Moez BenHajhmida | Aymen Ben Elhaj Mabrouk | Malek Naski

On various Social Media platforms, people, tend to use the informal way to communicate, or write posts and comments : their local dialects. In Africa, more than 1500 dialects and languages exist. Particularly, Tunisians talk and write informally using Latin letters and numbers rather than Arabic ones. In this paper, we introduce a large common-crawl-based Tunisian Arabizi dialectal dataset dedicated for Sentiment Analysis. The dataset consists of a total of 100k comments (about movies, politic, sport, etc.) annotated manually by Tunisian native speakers as Positive, negative and Neutral. We evaluate our dataset on sentiment analysis task using the Bidirectional Encoder Representations from Transformers (BERT) as a contextual language model in its multilingual version (mBERT) as an embedding technique then combining mBERT with Convolutional Neural Network (CNN) as classifier. The dataset is publicly available.

pdf bib
Improving Cross-Lingual Transfer for Event Argument Extraction with Language-Universal Sentence Structures
Minh Van Nguyen | Thien Huu Nguyen

We study the problem of Cross-lingual Event Argument Extraction (CEAE). The task aims to predict argument roles of entity mentions for events in text, whose language is different from the language that a predictive model has been trained on. Previous work on CEAE has shown the cross-lingual benefits of universal dependency trees in capturing shared syntactic structures of sentences across languages. In particular, this work exploits the existence of the syntactic connections between the words in the dependency trees as the anchor knowledge to transfer the representation learning across languages for CEAE models (i.e., via graph convolutional neural networks GCNs). In this paper, we introduce two novel sources of language-independent information for CEAE models based on the semantic similarity and the universal dependency relations of the word pairs in different languages. We propose to use the two sources of information to produce shared sentence structures to bridge the gap between languages and improve the cross-lingual performance of the CEAE models. Extensive experiments are conducted with Arabic, Chinese, and English to demonstrate the effectiveness of the proposed method for CEAE.

pdf bib
BERT-based Multi-Task Model for Country and Province Level MSA and Dialectal Arabic IdentificationBERT-based Multi-Task Model for Country and Province Level MSA and Dialectal Arabic Identification
Abdellah El Mekki | Abdelkader El Mahdaouy | Kabil Essefar | Nabil El Mamoun | Ismail Berrada | Ahmed Khoumsi

Dialect and standard language identification are crucial tasks for many Arabic natural language processing applications. In this paper, we present our deep learning-based system, submitted to the second NADI shared task for country-level and province-level identification of Modern Standard Arabic (MSA) and Dialectal Arabic (DA). The system is based on an end-to-end deep Multi-Task Learning (MTL) model to tackle both country-level and province-level MSA / DA identification. The latter MTL model consists of a shared Bidirectional Encoder Representation Transformers (BERT) encoder, two task-specific attention layers, and two classifiers. Our key idea is to leverage both the task-discriminative and the inter-task shared features for country and province MSA / DA identification. The obtained results show that our MTL model outperforms single-task models on most subtasks.

pdf bib
Country-level Arabic Dialect Identification using RNNs with and without Linguistic FeaturesArabic Dialect Identification using RNNs with and without Linguistic Features
Elsayed Issa | Mohammed AlShakhori1 | Reda Al-Bahrani | Gus Hahn-Powell

This work investigates the value of augmenting recurrent neural networks with feature engineering for the Second Nuanced Arabic Dialect Identification (NADI) Subtask 1.2 : Country-level DA identification. We compare the performance of a simple word-level LSTM using pretrained embeddings with one enhanced using feature embeddings for engineered linguistic features. Our results show that the addition of explicit features to the LSTM is detrimental to performance. We attribute this performance loss to the bivalency of some linguistic items in some text, ubiquity of topics, and participant mobility.

pdf bib
Arabic Dialect Identification based on a Weighted Concatenation of TF-IDF FeaturesArabic Dialect Identification based on a Weighted Concatenation of TF-IDF Features
Mohamed Lichouri | Mourad Abbas | Khaled Lounnas | Besma Benaziz | Aicha Zitouni

In this paper, we analyze the impact of the weighted concatenation of TF-IDF features for the Arabic Dialect Identification task while we participated in the NADI2021 shared task. This study is performed for two subtasks : subtask 1.1 (country-level MSA) and subtask 1.2 (country-level DA) identification. The classifiers supporting our comparative study are Linear Support Vector Classification (LSVC), Linear Regression (LR), Perceptron, Stochastic Gradient Descent (SGD), Passive Aggressive (PA), Complement Naive Bayes (CNB), MutliLayer Perceptron (MLP), and RidgeClassifier. In the evaluation phase, our system gives F1 scores of 14.87 % and 21.49 %, for country-level MSA and DA identification respectively, which is very close to the average F1 scores achieved by the submitted systems and recorded for both subtasks (18.70 % and 24.23 %).

pdf bib
Machine Learning-Based Approach for Arabic Dialect IdentificationArabic Dialect Identification
Hamada Nayel | Ahmed Hassan | Mahmoud Sobhi | Ahmed El-Sawy

This paper describes our systems submitted to the Second Nuanced Arabic Dialect Identification Shared Task (NADI 2021). Dialect identification is the task of automatically detecting the source variety of a given text or speech segment. There are four subtasks, two subtasks for country-level identification and the other two subtasks for province-level identification. The data in this task covers a total of 100 provinces from all 21 Arab countries and come from the Twitter domain. The proposed systems depend on five machine-learning approaches namely Complement Nave Bayes, Support Vector Machine, Decision Tree, Logistic Regression and Random Forest Classifiers. F1 macro-averaged score of Nave Bayes classifier outperformed all other classifiers for development and test data.

pdf bib
Overview of the WANLP 2021 Shared Task on Sarcasm and Sentiment Detection in ArabicWANLP 2021 Shared Task on Sarcasm and Sentiment Detection in Arabic
Ibrahim Abu Farha | Wajdi Zaghouani | Walid Magdy

This paper provides an overview of the WANLP 2021 shared task on sarcasm and sentiment detection in Arabic. The shared task has two subtasks : sarcasm detection (subtask 1) and sentiment analysis (subtask 2). This shared task aims to promote and bring attention to Arabic sarcasm detection, which is crucial to improve the performance in other tasks such as sentiment analysis. The dataset used in this shared task, namely ArSarcasm-v2, consists of 15,548 tweets labelled for sarcasm, sentiment and dialect. We received 27 and 22 submissions for subtasks 1 and 2 respectively. Most of the approaches relied on using and fine-tuning pre-trained language models such as AraBERT and MARBERT. The top achieved results for the sarcasm detection and sentiment analysis tasks were 0.6225 F1-score and 0.748 F1-PN respectively.

pdf bib
Sarcasm and Sentiment Detection In Arabic Tweets Using BERT-based Models and Data AugmentationArabic Tweets Using BERT-based Models and Data Augmentation
Abeer Abuzayed | Hend Al-Khalifa

In this paper, we describe our efforts on the shared task of sarcasm and sentiment detection in Arabic (Abu Farha et al., 2021). The shared task consists of two sub-tasks : Sarcasm Detection (Subtask 1) and Sentiment Analysis (Subtask 2). Our experiments were based on fine-tuning seven BERT-based models with data augmentation to solve the imbalanced data problem. For both tasks, the MARBERT BERT-based model with data augmentation outperformed other models with an increase of the F-score by 15 % for both tasks which shows the effectiveness of our approach.

pdf bib
Multi-task Learning Using a Combination of Contextualised and Static Word Embeddings for Arabic Sarcasm Detection and Sentiment AnalysisArabic Sarcasm Detection and Sentiment Analysis
Abdullah I. Alharbi | Mark Lee

Sarcasm detection and sentiment analysis are important tasks in Natural Language Understanding. Sarcasm is a type of expression where the sentiment polarity is flipped by an interfering factor. In this study, we exploited this relationship to enhance both tasks by proposing a multi-task learning approach using a combination of static and contextualised embeddings. Our proposed system achieved the best result in the sarcasm detection subtask.

pdf bib
Sarcasm and Sentiment Detection in Arabic : investigating the interest of character-level featuresArabic: investigating the interest of character-level features
Dhaou Ghoul | Gaël Lejeune

We present three methods developed for the Shared Task on Sarcasm and Sentiment Detection in Arabic. We present a baseline that uses character n-gram features. We also propose two more sophisticated methods : a recurrent neural network with a word level representation and an ensemble classifier relying on word and character-level features. We chose to present results from an ensemble classifier but it was not very successful as compared to the best systems : 22th/37 on sarcasm detection and 15th/22 on sentiment detection. It finally appeared that our baseline could have been improved and beat those results.

pdf bib
SarcasmDet at Sarcasm Detection Task 2021 in Arabic using AraBERT Pretrained ModelSarcasmDet at Sarcasm Detection Task 2021 in Arabic using AraBERT Pretrained Model
Dalya Faraj | Dalya Faraj | Malak Abdullah

This paper presents one of the top five winning solutions for the Shared Task on Sarcasm and Sentiment Detection in Arabic (Subtask-1 Sarcasm Detection). The goal of the task is to identify whether a tweet is sarcastic or not. Our solution has been developed using ensemble technique with AraBERT pre-trained model. We describe the architecture of the submitted solution in the shared task. We also provide the experiments and the hyperparameter tuning that lead to this result. Besides, we discuss and analyze the results by comparing all the models that we trained or tested to achieve a better score in a table design. Our model is ranked fifth out of 27 teams with an F1 score of 0.5985. It is worth mentioning that our model achieved the highest accuracy score of 0.7830

pdf bib
Sarcasm and Sentiment Detection in Arabic language A Hybrid Approach Combining Embeddings and Rule-based FeaturesArabic language A Hybrid Approach Combining Embeddings and Rule-based Features
Kamel Gaanoun | Imade Benelallam

This paper presents the ArabicProcessors team’s system designed for sarcasm (subtask 1) and sentiment (subtask 2) detection shared task. We created a hybrid system by combining rule-based features and both static and dynamic embeddings using transformers and deep learning. The system’s architecture is an ensemble of Naive bayes, MarBERT and Mazajak embedding. This process scored an F1-score of 51 % on sarcasm and 71 % for sentiment detection.

pdf bib
iCompass at Shared Task on Sarcasm and Sentiment Detection in ArabicCompass at Shared Task on Sarcasm and Sentiment Detection in Arabic
Malek Naski | Abir Messaoudi | Hatem Haddad | Moez BenHajhmida | Chayma Fourati | Aymen Ben Elhaj Mabrouk

We describe our submitted system to the 2021 Shared Task on Sarcasm and Sentiment Detection in Arabic (Abu Farha et al., 2021). We tackled both subtasks, namely Sarcasm Detection (Subtask 1) and Sentiment Analysis (Subtask 2). We used state-of-the-art pretrained contextualized text representation models and fine-tuned them according to the downstream task in hand. As a first approach, we used Google’s multilingual BERT and then other Arabic variants : AraBERT, ARBERT and MARBERT. The results found show that MARBERT outperforms all of the previously mentioned models overall, either on Subtask 1 or Subtask 2.

pdf bib
AraBERT and Farasa Segmentation Based Approach For Sarcasm and Sentiment Detection in Arabic TweetsAraBERT and Farasa Segmentation Based Approach For Sarcasm and Sentiment Detection in Arabic Tweets
Anshul Wadhawan

This paper presents our strategy to tackle the EACL WANLP-2021 Shared Task 2 : Sarcasm and Sentiment Detection. One of the subtasks aims at developing a system that identifies whether a given Arabic tweet is sarcastic in nature or not, while the other aims to identify the sentiment of the Arabic tweet. We approach the task in two steps. The first step involves pre processing the provided dataset by performing insertions, deletions and segmentation operations on various parts of the text. The second step involves experimenting with multiple variants of two transformer based models, AraELECTRA and AraBERT. Our final approach was ranked seventh and fourth in the Sarcasm and Sentiment Detection subtasks respectively.


bib (full) Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

pdf bib
Proceedings of the Eleventh Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis
Orphee De Clercq | Alexandra Balahur | Joao Sedoc | Valentin Barriere | Shabnam Tafreshi | Sven Buechel | Veronique Hoste

pdf bib
Emotion Ratings : How Intensity, Annotation Confidence and Agreements are Entangled
Enrica Troiano | Sebastian Padó | Roman Klinger

When humans judge the affective content of texts, they also implicitly assess the correctness of such judgment, that is, their confidence. We hypothesize that people’s (in)confidence that they performed well in an annotation task leads to (dis)agreements among each other. If this is true, confidence may serve as a diagnostic tool for systematic differences in annotations. To probe our assumption, we conduct a study on a subset of the Corpus of Contemporary American English, in which we ask raters to distinguish neutral sentences from emotion-bearing ones, while scoring the confidence of their answers. Confidence turns out to approximate inter-annotator disagreements. Further, we find that confidence is correlated to emotion intensity : perceiving stronger affect in text prompts annotators to more certain classification performances. This insight is relevant for modelling studies of intensity, as it opens the question wether automatic regressors or classifiers actually predict intensity, or rather human’s self-perceived confidence.

pdf bib
Disentangling Document Topic and Author Gender in Multiple Languages : Lessons for Adversarial Debiasing
Erenay Dayanik | Sebastian Padó

Text classification is a central tool in NLP. However, when the target classes are strongly correlated with other textual attributes, text classification models can pick up wrong features, leading to bad generalization and biases. In social media analysis, this problem surfaces for demographic user classes such as language, topic, or gender, which influence the generate text to a substantial extent. Adversarial training has been claimed to mitigate this problem, but thorough evaluation is missing. In this paper, we experiment with text classification of the correlated attributes of document topic and author gender, using a novel multilingual parallel corpus of TED talk transcripts. Our findings are : (a) individual classifiers for topic and author gender are indeed biased ; (b) debiasing with adversarial training works for topic, but breaks down for author gender ; (c) gender debiasing results differ across languages. We interpret the result in terms of feature space overlap, highlighting the role of linguistic surface realization of the target classes.

pdf bib
Universal Joy A Data Set and Results for Classifying Emotions Across Languages
Sotiris Lamprinidis | Federico Bianchi | Daniel Hardt | Dirk Hovy

While emotions are universal aspects of human psychology, they are expressed differently across different languages and cultures. We introduce a new data set of over 530k anonymized public Facebook posts across 18 languages, labeled with five different emotions. Using multilingual BERT embeddings, we show that emotions can be reliably inferred both within and across languages. Zero-shot learning produces promising results for low-resource languages. Following established theories of basic emotions, we provide a detailed analysis of the possibilities and limits of cross-lingual emotion classification. We find that structural and typological similarity between languages facilitates cross-lingual learning, as well as linguistic diversity of training data. Our results suggest that there are commonalities underlying the expression of emotion in different languages. We publicly release the anonymized data for future research.

pdf bib
An End-to-End Network for Emotion-Cause Pair Extraction
Aaditya Singh | Shreeshail Hingane | Saim Wani | Ashutosh Modi

The task of Emotion-Cause Pair Extraction (ECPE) aims to extract all potential clause-pairs of emotions and their corresponding causes in a document. Unlike the more well-studied task of Emotion Cause Extraction (ECE), ECPE does not require the emotion clauses to be provided as annotations. Previous works on ECPE have either followed a multi-stage approach where emotion extraction, cause extraction, and pairing are done independently or use complex architectures to resolve its limitations. In this paper, we propose an end-to-end model for the ECPE task. Due to the unavailability of an English language ECPE corpus, we adapt the NTCIR-13 ECE corpus and establish a baseline for the ECPE task on this dataset. On this dataset, the proposed method produces significant performance improvements (6.5 % increase in F1 score) over the multi-stage approach and achieves comparable performance to the state-of-the-art methods.

pdf bib
PVG at WASSA 2021 : A Multi-Input, Multi-Task, Transformer-Based Architecture for Empathy and Distress PredictionPVG at WASSA 2021: A Multi-Input, Multi-Task, Transformer-Based Architecture for Empathy and Distress Prediction
Atharva Kulkarni | Sunanda Somwase | Shivam Rajput | Manisha Marathe

Active research pertaining to the affective phenomenon of empathy and distress is invaluable for improving human-machine interaction. Predicting intensities of such complex emotions from textual data is difficult, as these constructs are deeply rooted in the psychological theory. Consequently, for better prediction, it becomes imperative to take into account ancillary factors such as the psychological test scores, demographic features, underlying latent primitive emotions, along with the text’s undertone and its psychological complexity. This paper proffers team PVG’s solution to the WASSA 2021 Shared Task on Predicting Empathy and Emotion in Reaction to News Stories. Leveraging the textual data, demographic features, psychological test score, and the intrinsic interdependencies of primitive emotions and empathy, we propose a multi-input, multi-task framework for the task of empathy score prediction. Here, the empathy score prediction is considered the primary task, while emotion and empathy classification are considered secondary auxiliary tasks. For the distress score prediction task, the system is further boosted by the addition of lexical features. Our submission ranked 1st based on the average correlation (0.545) as well as the distress correlation (0.574), and 2nd for the empathy Pearson correlation (0.517).

pdf bib
WASSA@IITK at WASSA 2021 : Multi-task Learning and Transformer Finetuning for Emotion Classification and Empathy PredictionWASSA@IITK at WASSA 2021: Multi-task Learning and Transformer Finetuning for Emotion Classification and Empathy Prediction
Jay Mundra | Rohan Gupta | Sagnik Mukherjee

This paper describes our contribution to the WASSA 2021 shared task on Empathy Prediction and Emotion Classification. The broad goal of this task was to model an empathy score, a distress score and the overall level of emotion of an essay written in response to a newspaper article associated with harm to someone. We have used the ELECTRA model abundantly and also advanced deep learning approaches like multi-task learning. Additionally, we also leveraged standard machine learning techniques like ensembling. Our system achieves a Pearson Correlation Coefficient of 0.533 on sub-task I and a macro F1 score of 0.5528 on sub-task II. We ranked 1st in Emotion Classification sub-task and 3rd in Empathy Prediction sub-task.

pdf bib
Exploring Stylometric and Emotion-Based Features for Multilingual Cross-Domain Hate Speech Detection
Ilia Markov | Nikola Ljubešić | Darja Fišer | Walter Daelemans

In this paper, we describe experiments designed to evaluate the impact of stylometric and emotion-based features on hate speech detection : the task of classifying textual content into hate or non-hate speech classes. Our experiments are conducted for three languages English, Slovene, and Dutch both in in-domain and cross-domain setups, and aim to investigate hate speech using features that model two linguistic phenomena : the writing style of hateful social media content operationalized as function word usage on the one hand, and emotion expression in hateful messages on the other hand. The results of experiments with features that model different combinations of these phenomena support our hypothesis that stylometric and emotion-based features are robust indicators of hate speech. Their contribution remains persistent with respect to domain and language variation. We show that the combination of features that model the targeted phenomena outperforms words and character n-gram features under cross-domain conditions, and provides a significant boost to deep learning models, which currently obtain the best results, when combined with them in an ensemble.

pdf bib
Creating and Evaluating Resources for Sentiment Analysis in the Low-resource Language : SindhiSindhi
Wazir Ali | Naveed Ali | Yong Dai | Jay Kumar | Saifullah Tumrani | Zenglin Xu

In this paper, we develop Sindhi subjective lexicon using a merger of existing English resources : NRC lexicon, list of opinion words, SentiWordNet, Sindhi-English bilingual dictionary, and collection of Sindhi modifiers. The positive or negative sentiment score is assigned to each Sindhi opinion word. Afterwards, we determine the coverage of the proposed lexicon with subjectivity analysis. Moreover, we crawl multi-domain tweet corpus of news, sports, and finance. The crawled corpus is annotated by experienced annotators using the Doccano text annotation tool. The sentiment annotated corpus is evaluated by employing support vector machine (SVM), recurrent neural network (RNN) variants, and convolutional neural network (CNN).

pdf bib
L3CubeMahaSent : A Marathi Tweet-based Sentiment Analysis DatasetL3CubeMahaSent: A Marathi Tweet-based Sentiment Analysis Dataset
Atharva Kulkarni | Meet Mandhane | Manali Likhitkar | Gayatri Kshirsagar | Raviraj Joshi

Sentiment analysis is one of the most fundamental tasks in Natural Language Processing. Popular languages like English, Arabic, Russian, Mandarin, and also Indian languages such as Hindi, Bengali, Tamil have seen a significant amount of work in this area. However, the Marathi language which is the third most popular language in India still lags behind due to the absence of proper datasets. In this paper, we present the first major publicly available Marathi Sentiment Analysis Dataset-L3CubeMahaSent. It is curated using tweets extracted from various Maharashtrian personalities’ Twitter accounts. Our dataset consists of ~16,000 distinct tweets classified in three broad classes viz. positive, negative, and neutral. We also present the guidelines using which we annotated the tweets. Finally, we present the statistics of our dataset and baseline classification results using CNN, LSTM, ULMFiT, and BERT based models.

pdf bib
Me, myself, and ire : Effects of automatic transcription quality on emotion, sarcasm, and personality detection
John Culnan | Seongjin Park | Meghavarshini Krishnaswamy | Rebecca Sharp

In deployment, systems that use speech as input must make use of automated transcriptions. Yet, typically when these systems are evaluated, gold transcriptions are assumed. We explicitly examine the impact of transcription errors on the downstream performance of a multi-modal system on three related tasks from three datasets : emotion, sarcasm, and personality detection. We include three separate transcription tools and show that while all automated transcriptions propagate errors that substantially impact downstream performance, the open-source tools fair worse than the paid tool, though not always straightforwardly, and word error rates do not correlate well with downstream performance. We further find that the inclusion of audio features partially mitigates transcription errors, but that a naive usage of a multi-task setup does not.

pdf bib
Emotional RobBERT and Insensitive BERTje : Combining Transformers and Affect Lexica for Dutch Emotion DetectionRobBERT and Insensitive BERTje: Combining Transformers and Affect Lexica for Dutch Emotion Detection
Luna De Bruyne | Orphee De Clercq | Veronique Hoste

In a first step towards improving Dutch emotion detection, we try to combine the Dutch transformer models BERTje and RobBERT with lexicon-based methods. We propose two architectures : one in which lexicon information is directly injected into the transformer model and a meta-learning approach where predictions from transformers are combined with lexicon features. The models are tested on 1,000 Dutch tweets and 1,000 captions from TV-shows which have been manually annotated with emotion categories and dimensions. We find that RobBERT clearly outperforms BERTje, but that directly adding lexicon information to transformers does not improve performance. In the meta-learning approach, lexicon information does have a positive effect on BERTje, but not on RobBERT. This suggests that more emotional information is already contained within this latter language model.

pdf bib
EmpNa at WASSA 2021 : A Lightweight Model for the Prediction of Empathy, Distress and Emotions from Reactions to News StoriesEmpNa at WASSA 2021: A Lightweight Model for the Prediction of Empathy, Distress and Emotions from Reactions to News Stories
Giuseppe Vettigli | Antonio Sorgente

This paper describes our submission for the WASSA 2021 shared task regarding the prediction of empathy, distress and emotions from news stories. The solution is based on combining the frequency of words, lexicon-based information, demographics of the annotators and personality of the annotators into a linear model. The prediction of empathy and distress is performed using Linear Regression while the prediction of emotions is performed using Logistic Regression. Both tasks are performed using the same features. Our models rank 4th for the prediction of emotions and 2nd for the prediction of empathy and distress. These results are particularly interesting when considered that the computational requirements of the solution are minimal.