Other Workshops and Events (2020)


Contents

up

bib (full) Proceedings of the Workshop on Automated Extraction of Socio-political Events from News 2020

pdf bib
Proceedings of the Workshop on Automated Extraction of Socio-political Events from News 2020
Ali Hürriyetoğlu | Erdem Yörük | Vanni Zavarella | Hristo Tanev

pdf bib
Automated Extraction of Socio-political Events from News (AESPEN): Workshop and Shared Task ReportAESPEN): Workshop and Shared Task Report
Ali Hürriyetoğlu | Vanni Zavarella | Hristo Tanev | Erdem Yörük | Ali Safaya | Osman Mutlu

We describe our effort on automated extraction of socio-political events from news in the scope of a workshop and a shared task we organized at Language Resources and Evaluation Conference (LREC 2020). We believe the event extraction studies in computational linguistics and social and political sciences should further support each other in order to enable large scale socio-political event information collection across sources, countries, and languages. The event consists of regular research papers and a shared task, which is about event sentence coreference identification (ESCI), tracks. All submissions were reviewed by five members of the program committee. The workshop attracted research papers related to evaluation of machine learning methodologies, language resources, material conflict forecasting, and a shared task participation report in the scope of socio-political event information collection. It has shown us the volume and variety of both the data sources and event information collection approaches related to socio-political events and the need to fill the gap between automated text processing techniques and requirements of social and political sciences.

pdf bib
Text Categorization for Conflict Event Annotation
Fredrik Olsson | Magnus Sahlgren | Fehmi ben Abdesslem | Ariel Ekgren | Kristine Eck

We cast the problem of event annotation as one of text categorization, and compare state of the art text categorization techniques on event data produced within the Uppsala Conflict Data Program (UCDP). Annotating a single text involves assigning the labels pertaining to at least 17 distinct categorization tasks, e.g., who were the attacking organization, who was attacked, and where did the event take place. The text categorization techniques under scrutiny are a classical Bag-of-Words approach ; character-based contextualized embeddings produced by ELMo ; embeddings produced by the BERT base model, and a version of BERT base fine-tuned on UCDP data ; and a pre-trained and fine-tuned classifier based on ULMFiT. The categorization tasks are very diverse in terms of the number of classes to predict as well as the skeweness of the distribution of classes. The categorization results exhibit a large variability across tasks, ranging from 30.3 % to 99.8 % F-score.

pdf bib
Seeing the Forest and the Trees : Detection and Cross-Document Coreference Resolution of Militarized Interstate Disputes
Benjamin Radford

Previous efforts to automate the detection of social and political events in text have primarily focused on identifying events described within single sentences or documents. Within a corpus of documents, these automated systems are unable to link event referencesrecognize singular events across multiple sentences or documents. A separate literature in computational linguistics on event coreference resolution attempts to link known events to one another within (and across) documents. I provide a data set for evaluating methods to identify certain political events in text and to link related texts to one another based on shared events. The data set, Headlines of War, is built on the Militarized Interstate Disputes data set and offers headlines classified by dispute status and headline pairs labeled with coreference indicators. Additionally, I introduce a model capable of accomplishing both tasks. The multi-task convolutional neural network is shown to be capable of recognizing events and event coreferences given the headlines’ texts and publication dates.

pdf bib
Supervised Event Coding from Text Written in Arabic : Introducing HadathArabic: Introducing Hadath
Javier Osorio | Alejandro Reyes | Alejandro Beltrán | Atal Ahmadzai

This article introduces Hadath, a supervised protocol for coding event data from text written in Arabic. Hadath contributes to recent efforts in advancing multi-language event coding using computer-based solutions. In this application, we focus on extracting event data about the conflict in Afghanistan from 2008 to 2018 using Arabic information sources. The implementation relies first on a Machine Learning algorithm to classify news stories relevant to the Afghan conflict. Then, using Hadath, we implement the Natural Language Processing component for event coding from Arabic script. The output database contains daily geo-referenced information at the district level on who did what to whom, when and where in the Afghan conflict. The data helps to identify trends in the dynamics of violence, the provision of governance, and traditional conflict resolution in Afghanistan for different actors over time and across space.

pdf bib
Protest Event Analysis : A Longitudinal Analysis for GreeceGreece
Konstantina Papanikolaou | Haris Papageorgiou

The advent of Big Data has shifted social science research towards computational methods. The volume of data that is nowadays available has brought a radical change in traditional approaches due to the cost and effort needed for processing. Knowledge extraction from heterogeneous and ample data is not an easy task to tackle. Thus, interdisciplinary approaches are necessary, combining experts of both social and computer science. This paper aims to present a work in the context of protest analysis, which falls into the scope of Computational Social Science. More specifically, the contribution of this work is to describe a Computational Social Science methodology for Event Analysis. The presented methodology is generic in the sense that it can be applied in every event typology and moreover, it is innovative and suitable for interdisciplinary tasks as it incorporates the human-in-the-loop. Additionally, a case study is presented concerning Protest Analysis in Greece over the last two decades. The conceptual foundation lies mainly upon claims analysis, and newspaper data were used in order to map, document and discuss protests in Greece in a longitudinal perspective.

up

bib (full) Proceedings of the 1st International Workshop on Artificial Intelligence for Historical Image Enrichment and Access

pdf bib
Proceedings of the 1st International Workshop on Artificial Intelligence for Historical Image Enrichment and Access
Yalemisew Abgaz | Amelie Dorn | Jose Luis Preza Diaz | Gerda Koch

pdf bib
Toward the Automatic Retrieval and Annotation of Outsider Art images : A Preliminary Statement
John Roberto | Diego Ortego | Brian Davis

The aim of this position paper is to establish an initial approach to the automatic classification of digital images about the Outsider Art style of painting. Specifically, we explore whether is it possible to classify non-traditional artistic styles by using the same features that are used for classifying traditional styles? Our research question is motivated by two facts. First, art historians state that non-traditional styles are influenced by factors outside of the world of art. Second, some studies have shown that several artistic styles confound certain classification techniques. Following current approaches to style prediction, this paper utilises Deep Learning methods to encode image features. Our preliminary experiments have provided motivation to think that, as is the case with traditional styles, Outsider Art can be computationally modelled with objective means by using training datasets and CNN models. Nevertheless, our results are not conclusive due to the lack of a large available dataset on Outsider Art. Therefore, at the end of the paper, we have mapped future lines of action, which include the compilation of a large dataset of Outsider Art images and the creation of an ontology of Outsider Art.

pdf bib
Towards a Comprehensive Assessment of the Quality and Richness of the Europeana Metadata of food-related Images
Yalemisew Abgaz | Amelie Dorn | Jose Luis Preza Diaz | Gerda Koch

Semantic enrichment of historical images to build interactive AI systems for the Digital Humanities domain has recently gained significant attention. However, before implementing any semantic enrichment tool for building AI systems, it is also crucial to analyse the quality and richness of the existing datasets and understand the areas where semantic enrichment is most required. Here, we propose an approach to conducting a preliminary analysis of selected historical images from the Europeana platform using existing linked data quality assessment tools. The analysis targets food images by collecting metadata provided from curators such as Galleries, Libraries, Archives and Museums (GLAMs) and cultural aggregators such as Europeana. We identified metrics to evaluate the quality of the metadata associated with food-related images which are harvested from the Europeana platform. In this paper, we present the food-image dataset, the associated metadata and our proposed method for the assessment. The results of our assessment will be used to guide the current effort to semantically enrich the images and build high-quality metadata using Computer Vision.

up

pdf (full)
bib (full)
Proceedings of the First Workshop on Advances in Language and Vision Research

pdf bib
Proceedings of the First Workshop on Advances in Language and Vision Research
Xin Wang | Jesse Thomason | Ronghang Hu | Xinlei Chen | Peter Anderson | Qi Wu | Asli Celikyilmaz | Jason Baldridge | William Yang Wang

pdf bib
Visual Question Generation from Radiology Images
Mourad Sarrouti | Asma Ben Abacha | Dina Demner-Fushman

Visual Question Generation (VQG), the task of generating a question based on image contents, is an increasingly important area that combines natural language processing and computer vision. Although there are some recent works that have attempted to generate questions from images in the open domain, the task of VQG in the medical domain has not been explored so far. In this paper, we introduce an approach to generation of visual questions about radiology images called VQGR, i.e. an algorithm that is able to ask a question when shown an image. VQGR first generates new training data from the existing examples, based on contextual word embeddings and image augmentation techniques. It then uses the variational auto-encoders model to encode images into a latent space and decode natural language questions. Experimental automatic evaluations performed on the VQA-RAD dataset of clinical visual questions show that VQGR achieves good performances compared with the baseline system. The source code is available at https://github.com/sarrouti/vqgr.

pdf bib
On the role of effective and referring questions in GuessWhat? !GuessWhat?!
Mauricio Mazuecos | Alberto Testoni | Raffaella Bernardi | Luciana Benotti

Task success is the standard metric used to evaluate referential visual dialogue systems. In this paper we propose two new metrics that evaluate how each question contributes to the goal. First, we measure how effective each question is by evaluating whether the question discards objects that are not the referent. Second, we define referring questions as those that univocally identify one object in the image. We report the new metrics for human dialogues and for state of the art publicly available models on GuessWhat? !. Regarding our first metric, we find that successful dialogues do not have a higher percentage of effective questions for most models. With respect to the second metric, humans make questions at the end of the dialogue that are referring, confirming their guess before guessing. Human dialogues that use this strategy have a higher task success but models do not seem to learn it.

pdf bib
Latent Alignment of Procedural Concepts in Multimodal Recipes
Hossein Rajaby Faghihi | Roshanak Mirzaee | Sudarshan Paliwal | Parisa Kordjamshidi

We propose a novel alignment mechanism to deal with procedural reasoning on a newly released multimodal QA dataset, named RecipeQA. Our model is solving the textual cloze task which is a reading comprehension on a recipe containing images and instructions. We exploit the power of attention networks, cross-modal representations, and a latent alignment space between instructions and candidate answers to solve the problem. We introduce constrained max-pooling which refines the max pooling operation on the alignment matrix to impose disjoint constraints among the outputs of the model. Our evaluation result indicates a 19 % improvement over the baselines.

up

pdf (full)
bib (full)
Proceedings of the First Workshop on Automatic Simultaneous Translation

pdf bib
Proceedings of the First Workshop on Automatic Simultaneous Translation
Hua Wu | Collin Cherry | Liang Huang | Zhongjun He | Mark Liberman | James Cross | Yang Liu

pdf bib
Modeling Discourse Structure for Document-level Neural Machine Translation
Junxuan Chen | Xiang Li | Jiarui Zhang | Chulun Zhou | Jianwei Cui | Bin Wang | Jinsong Su

Recently, document-level neural machine translation (NMT) has become a hot topic in the community of machine translation. Despite its success, most of existing studies ignored the discourse structure information of the input document to be translated, which has shown effective in other tasks. In this paper, we propose to improve document-level NMT with the aid of discourse structure information. Our encoder is based on a hierarchical attention network (HAN) (Miculicich et al., 2018). Specifically, we first parse the input document to obtain its discourse structure. Then, we introduce a Transformer-based path encoder to embed the discourse structure information of each word. Finally, we combine the discourse structure information with the word embedding before it is fed into the encoder. Experimental results on the English-to-German dataset show that our model can significantly outperform both Transformer and Transformer+HAN.

up

pdf (full)
bib (full)
Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications

pdf bib
Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications
Jill Burstein | Ekaterina Kochmar | Claudia Leacock | Nitin Madnani | Ildikó Pilán | Helen Yannakoudakis | Torsten Zesch

pdf bib
Complementary Systems for Off-Topic Spoken Response Detection
Vatsal Raina | Mark Gales | Kate Knill

Increased demand to learn English for business and education has led to growing interest in automatic spoken language assessment and teaching systems. With this shift to automated approaches it is important that systems reliably assess all aspects of a candidate’s responses. This paper examines one form of spoken language assessment ; whether the response from the candidate is relevant to the prompt provided. This will be referred to as off-topic spoken response detection. Two forms of previously proposed approaches are examined in this work : the hierarchical attention-based topic model (HATM) ; and the similarity grid model (SGM). The work focuses on the scenario when the prompt, and associated responses, have not been seen in the training data, enabling the system to be applied to new test scripts without the need to collect data or retrain the model. To improve the performance of the systems for unseen prompts, data augmentation based on easy data augmentation (EDA) and translation based approaches are applied. Additionally for the HATM, a form of prompt dropout is described. The systems were evaluated on both seen and unseen prompts from Linguaskill Business and General English tests. For unseen data the performance of the HATM was improved using data augmentation, in contrast to the SGM where no gains were obtained. The two approaches were found to be complementary to one another, yielding a combined F0.5 score of 0.814 for off-topic response detection where the prompts have not been seen in training.

pdf bib
CIMA : A Large Open Access Dialogue Dataset for TutoringCIMA: A Large Open Access Dialogue Dataset for Tutoring
Katherine Stasaski | Kimberly Kao | Marti A. Hearst

One-to-one tutoring is often an effective means to help students learn, and recent experiments with neural conversation systems are promising. However, large open datasets of tutoring conversations are lacking. To remedy this, we propose a novel asynchronous method for collecting tutoring dialogue via crowdworkers that is both amenable to the needs of deep learning algorithms and reflective of pedagogical concerns. In this approach, extended conversations are obtained between crowdworkers role-playing as both students and tutors. The CIMA collection, which we make publicly available, is novel in that students are exposed to overlapping grounded concepts between exercises and multiple relevant tutoring responses are collected for the same input. CIMA contains several compelling properties from an educational perspective : student role-players complete exercises in fewer turns during the course of the conversation and tutor players adopt strategies that conform with some educational conversational norms, such as providing hints versus asking questions in appropriate contexts. The dataset enables a model to be trained to generate the next tutoring utterance in a conversation, conditioned on a provided action strategy.

pdf bib
Becoming Linguistically Mature : Modeling English and German Children’s Writing Development Across School GradesEnglish and German Children’s Writing Development Across School Grades
Elma Kerz | Yu Qiao | Daniel Wiechmann | Marcus Ströbel

In this paper we employ a novel approach to advancing our understanding of the development of writing in English and German children across school grades using classification tasks. The data used come from two recently compiled corpora : The English data come from the the GiC corpus (983 school children in second-, sixth-, ninth- and eleventh-grade) and the German data are from the FD-LEX corpus (930 school children in fifth- and ninth-grade). The key to this paper is the combined use of what we refer to as ‘complexity contours’, i.e. series of measurements that capture the progression of linguistic complexity within a text, and Recurrent Neural Network (RNN) classifiers that adequately capture the sequential information in those contours. Our experiments demonstrate that RNN classifiers trained on complexity contours achieve higher classification accuracy than one trained on text-average complexity scores. In a second step, we determine the relative importance of the features from four distinct categories through a Sensitivity-Based Pruning approach.

pdf bib
Can Neural Networks Automatically Score Essay Traits?
Sandeep Mathias | Pushpak Bhattacharyya

Essay traits are attributes of an essay that can help explain how well written (or badly written) the essay is. Examples of traits include Content, Organization, Language, Sentence Fluency, Word Choice, etc. A lot of research in the last decade has dealt with automatic holistic essay scoring-where a machine rates an essay and gives a score for the essay. However, writers need feedback, especially if they want to improve their writing-which is why trait-scoring is important. In this paper, we show how a deep-learning based system can outperform feature-based machine learning systems, as well as a string kernel system in scoring essay traits.

pdf bib
Applications of Natural Language Processing in Bilingual Language Teaching : An Indonesian-English Case StudyIndonesian-English Case Study
Zara Maxwelll-Smith | Simón González Ochoa | Ben Foley | Hanna Suominen

Multilingual corpora are difficult to compile and a classroom setting adds pedagogy to the mix of factors which make this data so rich and problematic to classify. In this paper, we set out methodological considerations of using automated speech recognition to build a corpus of teacher speech in an Indonesian language classroom. Our preliminary results (64 % word error rate) suggest these tools have the potential to speed data collection in this context. We provide practical examples of our data structure, details of our piloted computer-assisted processes, and fine-grained error analysis. Our study is informed and directed by genuine research questions and discussion in both the education and computational linguistics fields. We highlight some of the benefits and risks of using these emerging technologies to analyze the complex work of language teachers and in education more generally.

pdf bib
An empirical investigation of neural methods for content scoring of science explanations
Brian Riordan | Sarah Bichler | Allison Bradford | Jennifer King Chen | Korah Wiley | Libby Gerard | Marcia C. Linn

With the widespread adoption of the Next Generation Science Standards (NGSS), science teachers and online learning environments face the challenge of evaluating students’ integration of different dimensions of science learning. Recent advances in representation learning in natural language processing have proven effective across many natural language processing tasks, but a rigorous evaluation of the relative merits of these methods for scoring complex constructed response formative assessments has not previously been carried out. We present a detailed empirical investigation of feature-based, recurrent neural network, and pre-trained transformer models on scoring content in real-world formative assessment data. We demonstrate that recent neural methods can rival or exceed the performance of feature-based methods. We also provide evidence that different classes of neural models take advantage of different learning cues, and pre-trained transformer models may be more robust to spurious, dataset-specific learning cues, better reflecting scoring rubrics.

pdf bib
GECToR Grammatical Error Correction : Tag, Not RewriteGECToR – Grammatical Error Correction: Tag, Not Rewrite
Kostiantyn Omelianchuk | Vitaliy Atrasevych | Artem Chernodub | Oleksandr Skurzhanskyi

In this paper, we present a simple and efficient GEC sequence tagger using a Transformer encoder. Our system is pre-trained on synthetic data and then fine-tuned in two stages : first on errorful corpora, and second on a combination of errorful and error-free parallel corpora. We design custom token-level transformations to map input tokens to target corrections. Our best single-model / ensemble GEC tagger achieves an F_0.5 of 65.3/66.5 on CONLL-2014 (test) and F_0.5 of 72.4/73.6 on BEA-2019 (test). Its inference speed is up to 10 times as fast as a Transformer-based seq2seq GEC system.

pdf bib
Interpreting Neural CWI Classifiers’ Weights as Vocabulary SizeCWI Classifiers’ Weights as Vocabulary Size
Yo Ehara

Complex Word Identification (CWI) is a task for the identification of words that are challenging for second-language learners to read. Even though the use of neural classifiers is now common in CWI, the interpretation of their parameters remains difficult. This paper analyzes neural CWI classifiers and shows that some of their parameters can be interpreted as vocabulary size. We present a novel formalization of vocabulary size measurement methods that are practiced in the applied linguistics field as a kind of neural classifier. We also contribute to building a novel dataset for validating vocabulary testing and readability via crowdsourcing.

pdf bib
Predicting the Difficulty and Response Time of Multiple Choice Questions Using Transfer Learning
Kang Xue | Victoria Yaneva | Christopher Runyon | Peter Baldwin

This paper investigates whether transfer learning can improve the prediction of the difficulty and response time parameters for 18,000 multiple-choice questions from a high-stakes medical exam. The type the signal that best predicts difficulty and response time is also explored, both in terms of representation abstraction and item component used as input (e.g., whole item, answer options only, etc.). The results indicate that, for our sample, transfer learning can improve the prediction of item difficulty when response time is used as an auxiliary task but not the other way around. In addition, difficulty was best predicted using signal from the item stem (the description of the clinical case), while all parts of the item were important for predicting the response time.

up

pdf (full)
bib (full)
Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing

pdf bib
Proceedings of the 19th SIGBioMed Workshop on Biomedical Language Processing
Dina Demner-Fushman | Kevin Bretonnel Cohen | Sophia Ananiadou | Junichi Tsujii

pdf bib
Interactive Extractive Search over Biomedical Corpora
Hillel Taub Tabib | Micah Shlain | Shoval Sadde | Dan Lahav | Matan Eyal | Yaara Cohen | Yoav Goldberg

We present a system that allows life-science researchers to search a linguistically annotated corpus of scientific texts using patterns over dependency graphs, as well as using patterns over token sequences and a powerful variant of boolean keyword queries. In contrast to previous attempts to dependency-based search, we introduce a light-weight query language that does not require the user to know the details of the underlying linguistic representations, and instead to query the corpus by providing an example sentence coupled with simple markup. Search is performed at an interactive speed due to efficient linguistic graph-indexing and retrieval engine. This allows for rapid exploration, development and refinement of user queries. We demonstrate the system using example workflows over two corpora : the PubMed corpus including 14,446,243 PubMed abstracts and the CORD-19 dataset, a collection of over 45,000 research papers focused on COVID-19 research. The system is publicly available at https://allenai.github.io/spike

pdf bib
Improving Biomedical Analogical Retrieval with Embedding of Structural Dependencies
Amandalynne Paullada | Bethany Percha | Trevor Cohen

Inferring the nature of the relationships between biomedical entities from text is an important problem due to the difficulty of maintaining human-curated knowledge bases in rapidly evolving fields. Neural word embeddings have earned attention for an apparent ability to encode relational information. However, word embedding models that disregard syntax during training are limited in their ability to encode the structural relationships fundamental to cognitive theories of analogy. In this paper, we demonstrate the utility of encoding dependency structure in word embeddings in a model we call Embedding of Structural Dependencies (ESD) as a way to represent biomedical relationships in two analogical retrieval tasks : a relationship retrieval (RR) task, and a literature-based discovery (LBD) task meant to hypothesize plausible relationships between pairs of entities unseen in training. We compare our model to skip-gram with negative sampling (SGNS), using 19 databases of biomedical relationships as our evaluation data, with improvements in performance on 17 (LBD) and 18 (RR) of these sets. These results suggest embeddings encoding dependency path information are of value for biomedical analogy retrieval.

pdf bib
A BERT-based One-Pass Multi-Task Model for Clinical Temporal Relation ExtractionBERT-based One-Pass Multi-Task Model for Clinical Temporal Relation Extraction
Chen Lin | Timothy Miller | Dmitriy Dligach | Farig Sadeque | Steven Bethard | Guergana Savova

Recently BERT has achieved a state-of-the-art performance in temporal relation extraction from clinical Electronic Medical Records text. However, the current approach is inefficient as it requires multiple passes through each input sequence. We extend a recently-proposed one-pass model for relation classification to a one-pass model for relation extraction. We augment this framework by introducing global embeddings to help with long-distance relation inference, and by multi-task learning to increase model performance and generalizability. Our proposed model produces results on par with the state-of-the-art in temporal relation extraction on the THYME corpus and is much greener in computational cost.

pdf bib
Neural Transduction of Letter Position Dyslexia using an Anagram Matrix Representation
Avi Bleiweiss

Research on analyzing reading patterns of dyslectic children has mainly been driven by classifying dyslexia types offline. We contend that a framework to remedy reading errors inline is more far-reaching and will help to further advance our understanding of this impairment. In this paper, we propose a simple and intuitive neural model to reinstate migrating words that transpire in letter position dyslexia, a visual analysis deficit to the encoding of character order within a word. Introduced by the anagram matrix representation of an input verse, the novelty of our work lies in the expansion from one to a two dimensional context window for training. This warrants words that only differ in the disposition of letters to remain interpreted semantically similar in the embedding space. Subject to the apparent constraints of the self-attention transformer architecture, our model achieved a unigram BLEU score of 40.6 on our reconstructed dataset of the Shakespeare sonnets.

pdf bib
Extensive Error Analysis and a Learning-Based Evaluation of Medical Entity Recognition Systems to Approximate User Experience
Isar Nejadgholi | Kathleen C. Fraser | Berry de Bruijn

When comparing entities extracted by a medical entity recognition system with gold standard annotations over a test set, two types of mismatches might occur, label mismatch or span mismatch. Here we focus on span mismatch and show that its severity can vary from a serious error to a fully acceptable entity extraction due to the subjectivity of span annotations. For a domain-specific BERT-based NER system, we showed that 25 % of the errors have the same labels and overlapping span with gold standard entities. We collected expert judgement which shows more than 90 % of these mismatches are accepted or partially accepted by the user. Using the training set of the NER system, we built a fast and lightweight entity classifier to approximate the user experience of such mismatches through accepting or rejecting them. The decisions made by this classifier are used to calculate a learning-based F-score which is shown to be a better approximation of a forgiving user’s experience than the relaxed F-score. We demonstrated the results of applying the proposed evaluation metric for a variety of deep learning medical entity recognition models trained with two datasets.

pdf bib
Global Locality in Biomedical Relation and Event Extraction
Elaheh ShafieiBavani | Antonio Jimeno Yepes | Xu Zhong | David Martinez Iraola

Due to the exponential growth of biomedical literature, event and relation extraction are important tasks in biomedical text mining. Most work only focus on relation extraction, and detect a single entity pair mention on a short span of text, which is not ideal due to long sentences that appear in biomedical contexts. We propose an approach to both relation and event extraction, for simultaneously predicting relationships between all mention pairs in a text. We also perform an empirical study to discuss different network setups for this purpose. The best performing model includes a set of multi-head attentions and convolutions, an adaptation of the transformer architecture, which offers self-attention the ability to strengthen dependencies among related elements, and models the interaction between features extracted by multiple attention heads. Experiment results demonstrate that our approach outperforms the state of the art on a set of benchmark biomedical corpora including BioNLP 2009, 2011, 2013 and BioCreative 2017 shared tasks.

up

bib (full) Proceedings of the 13th Workshop on Building and Using Comparable Corpora

pdf bib
Proceedings of the 13th Workshop on Building and Using Comparable Corpora
Reinhard Rapp | Pierre Zweigenbaum | Serge Sharoff

pdf bib
Constructing a Bilingual Corpus of Parallel Tweets
Hamdy Mubarak | Sabit Hassan | Ahmed Abdelali

In a bid to reach a larger and more diverse audience, Twitter users often post parallel tweetstweets that contain the same content but are written in different languages. Parallel tweets can be an important resource for developing machine translation (MT) systems among other natural language processing (NLP) tasks. In this paper, we introduce a generic method for collecting parallel tweets. Using this method, we collect a bilingual corpus of English-Arabic parallel tweets and a list of Twitter accounts who post English-Arabictweets regularly. Since our method is generic, it can also be used for collecting parallel tweets that cover less-resourced languages such as Serbian and Urdu. Additionally, we annotate a subset of Twitter accounts with their countries of origin and topic of interest, which provides insights about the population who post parallel tweets. This latter information can also be useful for author profiling tasks.

pdf bib
Automatic Creation of Correspondence Table of Meaning Tags from Two Dictionaries in One Language Using Bilingual Word Embedding
Teruo Hirabayashi | Kanako Komiya | Masayuki Asahara | Hiroyuki Shinnou

In this paper, we show how to use bilingual word embeddings (BWE) to automatically create a corresponding table of meaning tags from two dictionaries in one language and examine the effectiveness of the method. To do this, we had a problem : the meaning tags do not always correspond one-to-one because the granularities of the word senses and the concepts are different from each other. Therefore, we regarded the concept tag that corresponds to a word sense the most as the correct concept tag corresponding the word sense. We used two BWE methods, a linear transformation matrix and VecMap. We evaluated the most frequent sense (MFS) method and the corpus concatenation method for comparison. The accuracies of the proposed methods were higher than the accuracy of the random baseline but lower than those of the MFS and corpus concatenation methods. However, because our method utilized the embedding vectors of the word senses, the relations of the sense tags corresponding to concept tags could be examined by mapping the sense embeddings to the vector space of the concept tags. Also, our methods could be performed when we have only concept or word sense embeddings whereas the MFS method requires a parallel corpus and the corpus concatenation method needs two tagged corpora.

pdf bib
Benchmarking Multidomain English-Indonesian Machine TranslationEnglish-Indonesian Machine Translation
Tri Wahyu Guntara | Alham Fikri Aji | Radityo Eko Prasojo

In the context of Machine Translation (MT) from-and-to English, Bahasa Indonesia has been considered a low-resource language, and therefore applying Neural Machine Translation (NMT) which typically requires large training dataset proves to be problematic. In this paper, we show otherwise by collecting large, publicly-available datasets from the Web, which we split into several domains : news, religion, general, and conversation, to train and benchmark some variants of transformer-based NMT models across the domains. We show using BLEU that our models perform well across them, outperform the baseline Statistical Machine Translation (SMT) models, and perform comparably with Google Translate. Our datasets (with the standard split for training, validation, and testing), code, and models are available onhttps://github.com/gunnxx/indonesian-mt-data\n

pdf bib
Reducing the Search Space for Parallel Sentences in Comparable Corpora
Rémi Cardon | Natalia Grabar

This paper describes and evaluates simple techniques for reducing the research space for parallel sentences in monolingual comparable corpora. Initially, when searching for parallel sentences between two comparable documents, all the possible sentence pairs between the documents have to be considered, which introduces a great degree of imbalance between parallel pairs and non-parallel pairs. This is a problem because even with a high performing algorithm, a lot of noise will be present in the extracted results, thus introducing a need for an extensive and costly manual check phase. We work on a manually annotated subset obtained from a French comparable corpus and show how we can drastically reduce the number of sentence pairs that have to be fed to a classifier so that the results can be manually handled.

pdf bib
TALN / LS2N Participation at the BUCC Shared Task : Bilingual Dictionary Induction from Comparable CorporaTALN/LS2N Participation at the BUCC Shared Task: Bilingual Dictionary Induction from Comparable Corpora
Martin Laville | Amir Hazem | Emmanuel Morin

This paper describes the TALN / LS2N system participation at the Building and Using Comparable Corpora (BUCC) shared task. We first introduce three strategies : (i) a word embedding approach based on fastText embeddings ; (ii) a concatenation approach using both character Skip-Gram and character CBOW models, and finally (iii) a cognates matching approach based on an exact match string similarity. Then, we present the applied strategy for the shared task which consists in the combination of the embeddings concatenation and the cognates matching approaches. The covered languages are French, English, German, Russian and Spanish. Overall, our system mixing embeddings concatenation and perfect cognates matching obtained the best results while compared to individual strategies, except for English-Russian and Russian-English language pairs for which the concatenation approach was preferred.

pdf bib
BUCC2020 : Bilingual Dictionary Induction using Cross-lingual EmbeddingBUCC2020: Bilingual Dictionary Induction using Cross-lingual Embedding
Sanjanasri JP | Vijay Krishna Menon | Soman KP

This paper presents a deep learning system for the BUCC 2020 shared task : Bilingual dictionary induction from comparable corpora. We have submitted two runs for this shared Task, German (de) and English (en) language pair for closed track and Tamil (ta) and English (en) for the open track. Our core approach focuses on quantifying the semantics of the language pairs, so that semantics of two different language pairs can be compared or transfer learned. With the advent of word embeddings, it is possible to quantify this. In this paper, we propose a deep learning approach which makes use of the supplied training data, to generate cross-lingual embedding. This is later used for inducting bilingual dictionary from comparable corpora.

up

bib (full) Proceedings of the The 4th Workshop on Computational Approaches to Code Switching

pdf bib
Proceedings of the The 4th Workshop on Computational Approaches to Code Switching
Thamar Solorio | Monojit Choudhury | Kalika Bali | Sunayana Sitaram | Amitava Das | Mona Diab

pdf bib
An Annotated Corpus of Emerging Anglicisms in Spanish Newspaper HeadlinesSpanish Newspaper Headlines
Elena Alvarez-Mellado

The extraction of anglicisms (lexical borrowings from English) is relevant both for lexicographic purposes and for NLP downstream tasks. We introduce a corpus of European Spanish newspaper headlines annotated with anglicisms and a baseline model for anglicism extraction. In this paper we present : (1) a corpus of 21,570 newspaper headlines written in European Spanish annotated with emergent anglicisms and (2) a conditional random field baseline model with handcrafted features for anglicism extraction. We present the newspaper headlines corpus, describe the annotation tagset and guidelines and introduce a CRF model that can serve as baseline for the task of detecting anglicisms. The presented work is a first step towards the creation of an anglicism extractor for Spanish newswire.

up

pdf (full)
bib (full)
Second Grand-Challenge and Workshop on Multimodal Language (Challenge-HML)

pdf bib
Second Grand-Challenge and Workshop on Multimodal Language (Challenge-HML)
Amir Zadeh | Louis-Philippe Morency | Paul Pu Liang | Soujanya Poria

pdf bib
A Multi-modal Approach to Fine-grained Opinion Mining on Video Reviews
Edison Marrese-Taylor | Cristian Rodriguez | Jorge Balazs | Stephen Gould | Yutaka Matsuo

Despite the recent advances in opinion mining for written reviews, few works have tackled the problem on other sources of reviews. In light of this issue, we propose a multi-modal approach for mining fine-grained opinions from video reviews that is able to determine the aspects of the item under review that are being discussed and the sentiment orientation towards them. Our approach works at the sentence level without the need for time annotations and uses features derived from the audio, video and language transcriptions of its contents. We evaluate our approach on two datasets and show that leveraging the video and audio modalities consistently provides increased performance over text-only baselines, providing evidence these extra modalities are key in better understanding video reviews.

up

bib (full) Proceedings of the LREC 2020 Workshop on "Citizen Linguistics in Language Resource Development"

pdf bib
Proceedings of the LREC 2020 Workshop on "Citizen Linguistics in Language Resource Development"
James Fiumara | Christopher Cieri | Mark Liberman | Chris Callison-Burch

pdf bib
Speaking Outside the Box : Exploring the Benefits of Unconstrained Input in Crowdsourcing and Citizen Science Platforms
Jon Chamberlain | Udo Kruschwitz | Massimo Poesio

Crowdsourcing approaches provide a difficult design challenge for developers. There is a trade-off between the efficiency of the task to be done and the reward given to the user for participating, whether it be altruism, social enhancement, entertainment or money. This paper explores how crowdsourcing and citizen science systems collect data and complete tasks, illustrated by a case study from the online language game-with-a-purpose Phrase Detectives. The game was originally developed to be a constrained interface to prevent player collusion, but subsequently benefited from posthoc analysis of over 76k unconstrained inputs from users. Understanding the interface design and task deconstruction are critical for enabling users to participate in such systems and the paper concludes with a discussion of the idea that social networks can be viewed as form of citizen science platform with both constrained and unconstrained inputs making for a highly complex dataset.

up

bib (full) Proceedings of the workshop on Cross-Language Search and Summarization of Text and Speech (CLSSTS2020)

pdf bib
Proceedings of the workshop on Cross-Language Search and Summarization of Text and Speech (CLSSTS2020)
Kathy McKeown | Douglas W. Oard | Elizabeth | Richard Schwartz

pdf bib
SEARCHER : Shared Embedding Architecture for Effective RetrievalSEARCHER: Shared Embedding Architecture for Effective Retrieval
Joel Barry | Elizabeth Boschee | Marjorie Freedman | Scott Miller

We describe an approach to cross lingual information retrieval that does not rely on explicit translation of either document or query terms. Instead, both queries and documents are mapped into a shared embedding space where retrieval is performed. We discuss potential advantages of the approach in handling polysemy and synonymy. We present a method for training the model, and give details of the model implementation. We present experimental results for two cases : Somali-English and Bulgarian-English CLIR.

pdf bib
Cross-lingual Information Retrieval with BERTBERT
Zhuolin Jiang | Amro El-Jaroudi | William Hartmann | Damianos Karakos | Lingjun Zhao

Multiple neural language models have been developed recently, e.g., BERT and XLNet, and achieved impressive results in various NLP tasks including sentence classification, question answering and document ranking. In this paper, we explore the use of the popular bidirectional language model, BERT, to model and learn the relevance between English queries and foreign-language documents in the task of cross-lingual information retrieval. A deep relevance matching model based on BERT is introduced and trained by finetuning a pretrained multilingual BERT model with weak supervision, using home-made CLIR training data derived from parallel corpora. Experimental results of the retrieval of Lithuanian documents against short English queries show that our model is effective and outperforms the competitive baseline approaches.

pdf bib
A Comparison of Unsupervised Methods for Ad hoc Cross-Lingual Document Retrieval
Elaine Zosa | Mark Granroth-Wilding | Lidia Pivovarova

We address the problem of linking related documents across languages in a multilingual collection. We evaluate three diverse unsupervised methods to represent and compare documents : (1) multilingual topic model ; (2) cross-lingual document embeddings ; and (3) Wasserstein distance. We test the performance of these methods in retrieving news articles in Swedish that are known to be related to a given Finnish article. The results show that ensembles of the methods outperform the stand-alone methods, suggesting that they capture complementary characteristics of the documents

up

bib (full) Proceedings of the 8th Workshop on Challenges in the Management of Large Corpora

pdf bib
Proceedings of the 8th Workshop on Challenges in the Management of Large Corpora
Piotr Bański | Adrien Barbaresi | Simon Clematide | Marc Kupietz | Harald Lüngen | Ines Pisetta

pdf bib
Geoparsing the historical Gazetteers of Scotland : accurately computing location in mass digitised textsScotland: accurately computing location in mass digitised texts
Rosa Filgueira | Claire Grover | Melissa Terras | Beatrice Alex

This paper describes work in progress on devising automatic and parallel methods for geoparsing large digital historical textual data by combining the strengths of three natural language processing (NLP) tools, the Edinburgh Geoparser, spaCy and defoe, and employing different tokenisation and named entity recognition (NER) techniques. We apply these tools to a large collection of nineteenth century Scottish geographical dictionaries, and describe preliminary results obtained when processing this data.

pdf bib
The Corpus Query Middleware of Tomorrow A Proposal for a Hybrid Corpus Query Architecture
Markus Gärtner

Development of dozens of specialized corpus query systems and languages over the past decades has let to a diverse but also fragmented landscape. Today we are faced with a plethora of query tools that each provide unique features, but which are also not interoperable and often rely on very specific database back-ends or formats for storage. This severely hampers usability both for end users that want to query different corpora and also for corpus designers that wish to provide users with an interface for querying and exploration. We propose a hybrid corpus query architecture as a first step to overcoming this issue. It takes the form of a middleware system between user front-ends and optional database or text indexing solutions as back-ends. At its core is a custom query evaluation engine for index-less processing of corpus queries. With a flexible JSON-LD query protocol the approach allows communication with back-end systems to partially solve queries and offset some of the performance penalties imposed by the custom evaluation engine. This paper outlines the details of our first draft of aforementioned architecture.

pdf bib
Using full text indices for querying spoken language data
Elena Frick | Thomas Schmidt

As a part of the ZuMult-project, we are currently modelling a backend architecture that should provide query access to corpora from the Archive of Spoken German (AGD) at the Leibniz-Institute for the German Language (IDS). We are exploring how to reuse existing search engine frameworks providing full text indices and allowing to query corpora by one of the corpus query languages (QLs) established and actively used in the corpus research community. For this purpose, we tested MTAS-an open source Lucene-based search engine for querying on text with multilevel annotations. We applied MTAS on three oral corpora stored in the TEI-based ISO standard for transcriptions of spoken language (ISO 24624:2016). These corpora differ from the corpus data that MTAS was developed for, because they include interactions with two and more speakers and are enriched, inter alia, with timeline-based annotations. In this contribution, we report our test results and address issues that arise when search frameworks originally developed for querying written corpora are being transferred into the field of spoken language.

pdf bib
Czech National Corpus in 2020 : Recent Developments and Future OutlookCzech National Corpus in 2020: Recent Developments and Future Outlook
Michal Kren

The paper overviews the state of implementation of the Czech National Corpus (CNC) in all the main areas of its operation : corpus compilation, annotation, application development and user services. As the focus is on the recent development, some of the areas are described in more detail than the others. Close attention is paid to the data collection and, in particular, to the description of web application development. This is not only because CNC has recently seen a significant progress in this area, but also because we believe that end-user web applications shape the way linguists and other scholars think about the language data and about the range of possibilities they offer. This consideration is even more important given the variability of the CNC corpora.

up

bib (full) Proceedings of the 6th International Workshop on Computational Terminology

pdf bib
Proceedings of the 6th International Workshop on Computational Terminology
Béatrice Daille | Kyo Kageura | Ayla Rigouts Terryn

pdf bib
A study of semantic projection from single word terms to multi-word terms in the environment domain
Yizhe Wang | Beatrice Daille | Nabil Hathout

The semantic projection method is often used in terminology structuring to infer semantic relations between terms. Semantic projection relies upon the assumption of semantic compositionality : the relation that links simple term pairs remains valid in pairs of complex terms built from these simple terms. This paper proposes to investigate whether this assumption commonly adopted in natural language processing is actually valid. First, we describe the process of constructing a list of semantically linked multi-word terms (MWTs) related to the environmental field through the extraction of semantic variants. Second, we present our analysis of the results from the semantic projection. We find that contexts play an essential role in defining the relations between MWTs.

pdf bib
TermEval 2020 : RACAI’s automatic term extraction systemTermEval 2020: RACAI’s automatic term extraction system
Vasile Pais | Radu Ion

This paper describes RACAI’s automatic term extraction system, which participated in the TermEval 2020 shared task on English monolingual term extraction. We discuss the system architecture, some of the challenges that we faced as well as present our results in the English competition.

up

pdf (full)
bib (full)
Proceedings of The 3rd Workshop on e-Commerce and NLP

pdf bib
Proceedings of The 3rd Workshop on e-Commerce and NLP
Shervin Malmasi | Surya Kallumadi | Nicola Ueffing | Oleg Rokhlenko | Eugene Agichtein | Ido Guy

pdf bib
Bootstrapping Named Entity Recognition in E-Commerce with Positive Unlabeled LearningE-Commerce with Positive Unlabeled Learning
Hanchu Zhang | Leonhard Hennig | Christoph Alt | Changjian Hu | Yao Meng | Chao Wang

In this work, we introduce a bootstrapped, iterative NER model that integrates a PU learning algorithm for recognizing named entities in a low-resource setting. Our approach combines dictionary-based labeling with syntactically-informed label expansion to efficiently enrich the seed dictionaries. Experimental results on a dataset of manually annotated e-commerce product descriptions demonstrate the effectiveness of the proposed framework.

pdf bib
A Deep Learning System for Sentiment Analysis of Service Calls
Yanan Jia

Sentiment analysis is crucial for the advancement of artificial intelligence (AI). Sentiment understanding can help AI to replicate human language and discourse. Studying the formation and response of sentiment state from well-trained Customer Service Representatives (CSRs) can help make the interaction between humans and AI more intelligent. In this paper, a sentiment analysis pipeline is first carried out with respect to real-world multi-party conversations-that is, service calls. Based on the acoustic and linguistic features extracted from the source information, a novel aggregated method for voice sentiment recognition framework is built. Each party’s sentiment pattern during the communication is investigated along with the interaction sentiment pattern between all parties.

pdf bib
SimsterQ : A Similarity based Clustering Approach to Opinion Question AnsweringSimsterQ: A Similarity based Clustering Approach to Opinion Question Answering
Aishwarya Ashok | Ganapathy Natarajan | Ramez Elmasri | Laurel Smith-Stvan

In recent years, there has been an increase in online shopping resulting in an increased number of online reviews. Customers can not delve into the huge amount of data when they are looking for specific aspects of a product. Some of these aspects can be extracted from the product reviews. In this paper we introduced SimsterQ-a clustering based system for answering questions that makes use of word vectors. Clustering was performed using cosine similarity scores between sentence vectors of reviews and questions. Two variants (Sim and Median) with and without stopwords were evaluated against traditional methods that use term frequency. We also used an n-gram approach to study the effect of noise. We used the reviews in the Amazon Reviews dataset to pick the answers. Evaluation was performed both at the individual sentence level using the top sentence from Okapi BM25 as the gold standard and at the whole answer level using review snippets as the gold standard. At the sentence level our system performed slightly better than a more complicated deep learning method. Our system returned answers similar to the review snippets from the Amazon QA Dataset as measured by the cosine similarity. Analysis was also performed on the quality of the clusters generated by our system.

pdf bib
On Application of Bayesian Parametric and Non-parametric Methods for User Cohorting in Product SearchBayesian Parametric and Non-parametric Methods for User Cohorting in Product Search
Shashank Gupta

In this paper, we study the applicability of Bayesian Parametric and Non-parametric methods for user clustering in an E-commerce search setting. To the best of our knowledge, this is the first work that presents a comparative study of various Bayesian clustering methods in the context of product search. Specifically, we cluster users based on their topical patterns from their respective product search queries. To evaluate the quality of the clusters formed, we perform a collaborative query recommendation task. Our findings indicate that simple parametric model like Latent Dirichlet Allocation (LDA) outperforms more sophisticated non-parametric methods like Distance Dependent Chinese Restaurant Process and Dirichlet Process-based clustering in both tasks.

up

pdf (full)
bib (full)
Proceedings of the Third Workshop on Fact Extraction and VERification (FEVER)

pdf bib
Proceedings of the Third Workshop on Fact Extraction and VERification (FEVER)
Christos Christodoulopoulos | James Thorne | Andreas Vlachos | Oana Cocarascu | Arpit Mittal

pdf bib
Simple Compounded-Label Training for Fact Extraction and Verification
Yixin Nie | Lisa Bauer | Mohit Bansal

Automatic fact checking is an important task motivated by the need for detecting and preventing the spread of misinformation across the web. The recently released FEVER challenge provides a benchmark task that assesses systems’ capability for both the retrieval of required evidence and the identification of authentic claims. Previous approaches share a similar pipeline training paradigm that decomposes the task into three subtasks, with each component built and trained separately. Although achieving acceptable scores, these methods induce difficulty for practical application development due to unnecessary complexity and expensive computation. In this paper, we explore the potential of simplifying the system design and reducing training computation by proposing a joint training setup in which a single sequence matching model is trained with compounded labels that give supervision for both sentence selection and claim verification subtasks, eliminating the duplicate computation that occurs when models are designed and trained separately. Empirical results on FEVER indicate that our method : (1) outperforms the typical multi-task learning approach, and (2) gets comparable results to top performing systems with a much simpler training setup and less training computation (in terms of the amount of data consumed and the number of model parameters), facilitating future works on the automatic fact checking task and its practical usage.

pdf bib
Language Models as Fact Checkers?
Nayeon Lee | Belinda Z. Li | Sinong Wang | Wen-tau Yih | Hao Ma | Madian Khabsa

Recent work has suggested that language models (LMs) store both common-sense and factual knowledge learned from pre-training data. In this paper, we leverage this implicit knowledge to create an effective end-to-end fact checker using a solely a language model, without any external knowledge or explicit retrieval components. While previous work on extracting knowledge from LMs have focused on the task of open-domain question answering, to the best of our knowledge, this is the first work to examine the use of language models as fact checkers. In a closed-book setting, we show that our zero-shot LM approach outperforms a random baseline on the standard FEVER task, and that our finetuned LM compares favorably with standard baselines. Though we do not ultimately outperform methods which use explicit knowledge bases, we believe our exploration shows that this method is viable and has much room for exploration.

up

pdf (full)
bib (full)
Proceedings of the Second Workshop on Figurative Language Processing

pdf bib
Proceedings of the Second Workshop on Figurative Language Processing
Beata Beigman Klebanov | Ekaterina Shutova | Patricia Lichtenstein | Smaranda Muresan | Chee Wee | Anna Feldman | Debanjan Ghosh

pdf bib
Sarcasm Detection in Tweets with BERT and GloVe EmbeddingsBERT and GloVe Embeddings
Akshay Khatri | Pranav P

Sarcasm is a form of communication in which the person states opposite of what he actually means. In this paper, we propose using machine learning techniques with BERT and GloVe embeddings to detect sarcasm in tweets. The dataset is preprocessed before extracting the embeddings. The proposed model also uses all of the context provided in the dataset to which the user is reacting along with his actual response.

pdf bib
C-Net : Contextual Network for Sarcasm DetectionC-Net: Contextual Network for Sarcasm Detection
Amit Kumar Jena | Aman Sinha | Rohit Agarwal

Automatic Sarcasm Detection in conversations is a difficult and tricky task. Classifying an utterance as sarcastic or not in isolation can be futile since most of the time the sarcastic nature of a sentence heavily relies on its context. This paper presents our proposed model, C-Net, which takes contextual information of a sentence in a sequential manner to classify it as sarcastic or non-sarcastic. Our model showcases competitive performance in the Sarcasm Detection shared task organised on CodaLab and achieved 75.0 % F1-score on the Twitter dataset and 66.3 % F1-score on Reddit dataset.

pdf bib
Sarcasm Identification and Detection in Conversion Context using BERTBERT
Kalaivani A. | Thenmozhi D.

Sarcasm analysis in user conversion text is automatic detection of any irony, insult, hurting, painful, caustic, humour, vulgarity that degrades an individual. It is helpful in the field of sentimental analysis and cyberbullying. As an immense growth of social media, sarcasm analysis helps to avoid insult, hurts and humour to affect someone. In this paper, we present traditional machine learning approaches, deep learning approach (LSTM -RNN) and BERT (Bidirectional Encoder Representations from Transformers) for identifying sarcasm. We have used the approaches to build the model, to identify and categorize how much conversion context or response is needed for sarcasm detection and evaluated on the two social media forums that is twitter conversation dataset and reddit conversion dataset. We compare the performance based on the approaches and obtained the best F1 scores as 0.722, 0.679 for the twitter forums and reddit forums respectively.

pdf bib
Neural Sarcasm Detection using Conversation Context
Nikhil Jaiswal

Social media platforms and discussion forums such as Reddit, Twitter, etc. are filled with figurative languages. Sarcasm is one such category of figurative language whose presence in a conversation makes language understanding a challenging task. In this paper, we present a deep neural architecture for sarcasm detection. We investigate various pre-trained language representation models (PLRMs) like BERT, RoBERTa, etc. and fine-tune it on the Twitter dataset. We experiment with a variety of PLRMs either on the twitter utterance in isolation or utilizing the contextual information along with the utterance. Our findings indicate that by taking into consideration the previous three most recent utterances, the model is more accurately able to classify a conversation as being sarcastic or not. Our best performing ensemble model achieves an overall F1 score of 0.790, which ranks us second on the leaderboard of the Sarcasm Shared Task 2020.

pdf bib
A Novel Hierarchical BERT Architecture for Sarcasm DetectionBERT Architecture for Sarcasm Detection
Himani Srivastava | Vaibhav Varshney | Surabhi Kumari | Saurabh Srivastava

Online discussion platforms are often flooded with opinions from users across the world on a variety of topics. Many such posts, comments, or utterances are often sarcastic in nature, i.e., the actual intent is hidden in the sentence and is different from its literal meaning, making the detection of such utterances challenging without additional context. In this paper, we propose a novel deep learning-based approach to detect whether an utterance is sarcastic or non-sarcastic by utilizing the given contexts ina hierarchical manner. We have used datasets from two online discussion platforms-Twitter and Reddit1for our experiments. Experimental and error analysis shows that the hierarchical models can make full use of history to obtain a better representation of contexts and thus, in turn, can outperform their sequential counterparts.

pdf bib
Detecting Sarcasm in Conversation Context Using Transformer-Based ModelsDetecting Sarcasm in Conversation Context Using Transformer-Based Models
Adithya Avvaru | Sanath Vobilisetty | Radhika Mamidi

Sarcasm detection, regarded as one of the sub-problems of sentiment analysis, is a very typical task because the introduction of sarcastic words can flip the sentiment of the sentence itself. To date, many research works revolve around detecting sarcasm in one single sentence and there is very limited research to detect sarcasm resulting from multiple sentences. Current models used Long Short Term Memory (LSTM) variants with or without attention to detect sarcasm in conversations. We showed that the models using state-of-the-art Bidirectional Encoder Representations from Transformers (BERT), to capture syntactic and semantic information across conversation sentences, performed better than the current models. Based on the data analysis, we estimated that the number of sentences in the conversation that can contribute to the sarcasm and the results agrees to this estimation. We also perform a comparative study of our different versions of BERT-based model with other variants of LSTM model and XLNet (both using the estimated number of conversation sentences) and find out that BERT-based models outperformed them.

pdf bib
Using Conceptual Norms for Metaphor Detection
Mingyu Wan | Kathleen Ahrens | Emmanuele Chersoni | Menghan Jiang | Qi Su | Rong Xiang | Chu-Ren Huang

This paper reports a linguistically-enriched method of detecting token-level metaphors for the second shared task on Metaphor Detection. We participate in all four phases of competition with both datasets, i.e. Verbs and AllPOS on the VUA and the TOFEL datasets. We use the modality exclusivity and embodiment norms for constructing a conceptual representation of the nodes and the context. Our system obtains an F-score of 0.652 for the VUA Verbs track, which is 5 % higher than the strong baselines. The experimental results across models and datasets indicate the salient contribution of using modality exclusivity and modality shift information for predicting metaphoricity.

pdf bib
Character aware models with similarity learning for metaphor detection
Tarun Kumar | Yashvardhan Sharma

Recent work on automatic sequential metaphor detection has involved recurrent neural networks initialized with different pre-trained word embeddings and which are sometimes combined with hand engineered features. To capture lexical and orthographic information automatically, in this paper we propose to add character based word representation. Also, to contrast the difference between literal and contextual meaning, we utilize a similarity network. We explore these components via two different architectures-a BiLSTM model and a Transformer Encoder model similar to BERT to perform metaphor identification. We participate in the Second Shared Task on Metaphor Detection on both the VUA and TOFEL datasets with the above models. The experimental results demonstrate the effectiveness of our method as it outperforms all the systems which participated in the previous shared task.

pdf bib
Recognizing Euphemisms and Dysphemisms Using Sentiment Analysis
Christian Felt | Ellen Riloff

This paper presents the first research aimed at recognizing euphemistic and dysphemistic phrases with natural language processing. Euphemisms soften references to topics that are sensitive, disagreeable, or taboo. Conversely, dysphemisms refer to sensitive topics in a harsh or rude way. For example, passed away and departed are euphemisms for death, while croaked and six feet under are dysphemisms for death. Our work explores the use of sentiment analysis to recognize euphemistic and dysphemistic language. First, we identify near-synonym phrases for three topics (firing, lying, and stealing) using a bootstrapping algorithm for semantic lexicon induction. Next, we classify phrases as euphemistic, dysphemistic, or neutral using lexical sentiment cues and contextual sentiment analysis. We introduce a new gold standard data set and present our experimental results for this task.

pdf bib
Generating Ethnographic Models from Communities’ Online Data
Tomek Strzalkowski | Anna Newheiser | Nathan Kemper | Ning Sa | Bharvee Acharya | Gregorios Katsios

In this paper we describe computational ethnography study to demonstrate how machine learning techniques can be utilized to exploit bias resident in language data produced by communities with online presence. Specifically, we leverage the use of figurative language (i.e., the choice of metaphors) in online text (e.g., news media, blogs) produced by distinct communities to obtain models of community worldviews that can be shown to be distinctly biased and thus different from other communities’ models. We automatically construct metaphor-based community models for two distinct scenarios : gun rights and marriage equality. We then conduct a series of experiments to validate the hypothesis that the metaphors found in each community’s online language convey the bias in the community’s worldview.

pdf bib
Augmenting Neural Metaphor Detection with Concreteness
Ghadi Alnafesah | Harish Tayyar Madabushi | Mark Lee

The idea that a shift in concreteness within a sentence indicates the presence of a metaphor has been around for a while. However, recent methods of detecting metaphor that have relied on deep neural models have ignored concreteness and related psycholinguistic information. We hypothesis that this information is not available to these models and that their addition will boost the performance of these models in detecting metaphor. We test this hypothesis on the Metaphor Detection Shared Task 2020 and find that the addition of concreteness information does in fact boost deep neural models. We also run tests on data from a previous shared task and show similar results.

pdf bib
Metaphor Detection using Ensembles of Bidirectional Recurrent Neural Networks
Jennifer Brooks | Abdou Youssef

In this paper we present our results from the Second Shared Task on Metaphor Detection, hosted by the Second Workshop on Figurative Language Processing. We use an ensemble of RNN models with bidirectional LSTMs and bidirectional attention mechanisms. Some of the models were trained on all parts of speech. Each of the other models was trained on one of four categories for parts of speech : nouns, verbs, adverbs / adjectives, or other. The models were combined into voting pools and the voting pools were combined using the logical OR operator.

pdf bib
Testing the role of metadata in metaphor identification
Egon Stemle | Alexander Onysko

This paper describes the adaptation and application of a neural network system for the automatic detection of metaphors. The LSTM BiRNN system participated in the shared task of metaphor identification that was part of the Second Workshop of Figurative Language Processing (FigLang2020) held at the Annual Conference of the Association for Computational Linguistics (ACL2020). The particular focus of our approach is on the potential influence that the metadata given in the ETS Corpus of Non-Native Written English might have on the automatic detection of metaphors in this dataset. The article first discusses the annotated ETS learner data, highlighting some of its peculiarities and inherent biases of metaphor use. A series of evaluations follow in order to test whether specific metadata influence the system performance in the task of automatic metaphor identification. The system is available under the APLv2 open-source license.

up

bib (full) Proceedings of the International FrameNet Workshop 2020: Towards a Global, Multilingual FrameNet

pdf bib
Proceedings of the International FrameNet Workshop 2020: Towards a Global, Multilingual FrameNet
Tiago T. Torrent | Collin F. Baker | Oliver Czulo | Kyoko Ohara | Miriam R. L. Petruck

pdf bib
Finding Corresponding Constructions in English and Japanese in a TED Talk Parallel Corpus using Frames-and-Constructions AnalysisEnglish and Japanese in a TED Talk Parallel Corpus using Frames-and-Constructions Analysis
Kyoko Ohara

This paper reports on an effort to search for corresponding constructions in English and Japanese in a TED Talk parallel corpus, using frames-and-constructions analysis (Ohara, 2019 ; Ohara and Okubo, 2020 ; cf. Czulo, 2013, 2017). The purpose of the paper is two-fold : (1) to demonstrate the validity of frames-and-constructions analysis to search for corresponding constructions in typologically unrelated languages ; and (2) to assess whether the Do schools kill creativity? TED Talk parallel corpus, annotated in various languages for Multilingual FrameNet, is a good starting place for building a multilingual constructicon. The analysis showed that similar to our previous findings involving texts in a Japanese to English bilingual children’s book, the TED Talk bilingual transcripts include pairs of constructions that share similar pragmatic functions. While the TED Talk parallel corpus constitutes a good resource for frame semantic annotation in multiple languages, it may not be the ideal place to start aligning constructions among typologically unrelated languages. Finally, this work shows that the proposed method, which focuses on heads of sentences, seems valid for searching for corresponding constructions in transcripts of spoken data, as well as in written data of typologically-unrelated languages.

pdf bib
Greek within the Global FrameNet Initiative : Challenges and Conclusions so farGreek within the Global FrameNet Initiative: Challenges and Conclusions so far
Voula Giouli | Vera Pilitsidou | Hephaestion Christopoulos

Large coverage lexical resources that bear deep linguistic information have always been considered useful for many natural language processing (NLP) applications including Machine Translation (MT). In this respect, Frame-based resources have been developed for many languages following Frame Semantics and the Berkeley FrameNet project. However, to a great extent, all those efforts have been kept fragmented. Consequentially, the Global FrameNet initiative has been conceived of as a joint effort to bring together FrameNets in different languages. The proposed paper is aimed at describing ongoing work towards developing the Greek (EL) counterpart of the Global FrameNet and our efforts to contribute to the Shared Annotation Task. In the paper, we will elaborate on the annotation methodology employed, the current status and progress made so far, as well as the problems raised during annotation.

pdf bib
Exploring Crosslinguistic Frame Alignment
Collin F. Baker | Arthur Lorenzi

The FrameNet (FN) project at the International Computer Science Institute in Berkeley (ICSI), which documents the core vocabulary of contemporary English, was the first lexical resource based on Fillmore’s theory of Frame Semantics. Berkeley FrameNet has inspired related projects in roughly a dozen other languages, which have evolved somewhat independently ; the current Multilingual FrameNet project (MLFN) is an attempt to find alignments between all of them. The alignment problem is complicated by the fact that these projects have adhered to the Berkeley FrameNet model to varying degrees, and they were also founded at different times, when different versions of the Berkeley FrameNet data were available. We describe several new methods for finding relations of similarity between semantic frames across languages. We will demonstrate ViToXF, a new tool which provides interactive visualizations of these cross-lingual relations, between frames, lexical units, and frame elements, based on resources such as multilingual dictionaries and on shared distributional vector spaces, making clear the strengths and weaknesses of different alignment methods.

up

bib (full) Workshop on Games and Natural Language Processing

pdf bib
Workshop on Games and Natural Language Processing
Stephanie M. Lukin

pdf bib
Creating a Sentiment Lexicon with Game-Specific Words for Analyzing NPC Dialogue in The Elder Scrolls V : SkyrimNPC Dialogue in The Elder Scrolls V: Skyrim
Thérèse Bergsma | Judith van Stegeren | Mariët Theune

A weak point of rule-based sentiment analysis systems is that the underlying sentiment lexicons are often not adapted to the domain of the text we want to analyze. We created a game-specific sentiment lexicon for video game Skyrim based on the E-ANEW word list and a dataset of Skyrim’s in-game documents. We calculated sentiment ratings for NPC dialogue using both our lexicon and E-ANEW and compared the resulting sentiment ratings to those of human raters. Both lexicons perform comparably well on our evaluation dialogues, but the game-specific extension performs slightly better on the dominance dimension for dialogue segments and the arousal dimension for full dialogues. To our knowledge, this is the first time that a sentiment analysis lexicon has been adapted to the video game domain.

pdf bib
ClueMeIn : Obtaining More Specific Image Labels Through a GameClueMeIn: Obtaining More Specific Image Labels Through a Game
Christopher Harris

The ESP Game (also known as the Google Image Labeler) demonstrated how the crowd could perform a task that is straightforward for humans but challenging for computers providing labels for images. The game facilitated the task of basic image labeling ; however, the labels generated were non-specific and limited the ability to distinguish similar images from one another, limiting its ability in search tasks, annotating images for the visually impaired, and training computer vision machine algorithms. In this paper, we describe ClueMeIn, an entertaining web-based game with a purpose that generates more detailed image labels than the ESP Game. We conduct experiments to generate specific image labels, show how the results can lead to improvements in the accuracy of image searches over image labels generated by the ESP Game when using the same public dataset.

pdf bib
Cipher : A Prototype Game-with-a-Purpose for Detecting Errors in TextCipher: A Prototype Game-with-a-Purpose for Detecting Errors in Text
Liang Xu | Jon Chamberlain

Errors commonly exist in machine-generated documents and publication materials ; however, some correction algorithms do not perform well for complex errors and it is costly to employ humans to do the task. To solve the problem, a prototype computer game called Cipher was developed that encourages people to identify errors in text. Gamification is achieved by introducing the idea of steganography as the entertaining game element. People play the game for entertainment while they make valuable annotations to locate text errors. The prototype was tested by 35 players in a evaluation experiment, creating 4,764 annotations. After filtering the data, the system detected manually introduced text errors and also genuine errors in the texts that were not noticed when they were introduced into the game.

pdf bib
Game Design Evaluation of GWAPs for Collecting Word AssociationsGWAPs for Collecting Word Associations
Mathieu Lafourcade | Le Brun Nathalie

GWAP design might have a tremendous effect on its popularity of course but also on the quality of the data collected. In this paper, a comparison is undertaken between two GWAPs for building term association lists, namely JeuxDeMots and Quicky Goose. After comparing both game designs, the Cohen kappa of associative lists in various configurations is computed in order to assess likeness and differences of the data they provide.

pdf bib
The Challenge of the TV game La Ghigliottina to NLPTV game La Ghigliottina to NLP
Federico Sangati | Antonio Pascucci | Johanna Monti

In this paper, we describe a Telegram bot, Mago della Ghigliottina (Ghigliottina Wizard), able to solve La Ghigliottina game (The Guillotine), the final game of the Italian TV quiz show L’Eredit. Our system relies on linguistic resources and artificial intelligence and achieves better results than human players (and competitors of L’Eredit too). In addition to solving a game, Mago della Ghigliottina can also generate new game instances and challenge the users to match the solution.

pdf bib
A 3D Role-Playing Game for Abusive Language AnnotationD Role-Playing Game for Abusive Language Annotation
Federico Bonetti | Sara Tonelli

Gamification has been applied to many linguistic annotation tasks, as an alternative to crowdsourcing platforms to collect annotated data in an inexpensive way. However, we think that still much has to be explored. Games with a Purpose (GWAPs) tend to lack important elements that we commonly see in commercial games, such as 2D and 3D worlds or a story. Making GWAPs more similar to full-fledged video games in order to involve users more easily and increase dissemination is a demanding yet interesting ground to explore. In this paper we present a 3D role-playing game for abusive language annotation that is currently under development.

up

bib (full) Proceedings of the 2020 Globalex Workshop on Linked Lexicography

pdf bib
Proceedings of the 2020 Globalex Workshop on Linked Lexicography
Ilan Kernerman | Simon Krek | John P. McCrae | Jorge Gracia | Sina Ahmadi | Besim Kabashi

pdf bib
Modelling Frequency and Attestations for OntoLex-LemonOntoLex-Lemon
Christian Chiarcos | Maxim Ionov | Jesse de Does | Katrien Depuydt | Anas Fahad Khan | Sander Stolk | Thierry Declerck | John Philip McCrae

The OntoLex vocabulary enjoys increasing popularity as a means of publishing lexical resources with RDF and as Linked Data. The recent publication of a new OntoLex module for lexicography, lexicog, reflects its increasing importance for digital lexicography. However, not all aspects of digital lexicography have been covered to the same extent. In particular, supplementary information drawn from corpora such as frequency information, links to attestations, and collocation data were considered to be beyond the scope of lexicog. Therefore, the OntoLex community has put forward the proposal for a novel module for frequency, attestation and corpus information (FrAC), that not only covers the requirements of digital lexicography, but also accommodates essential data structures for lexical information in natural language processing. This paper introduces the current state of the OntoLex-FrAC vocabulary, describes its structure, some selected use cases, elementary concepts and fundamental definitions, with a focus on frequency and attestations.

pdf bib
Towards an Extension of the Linking of the Open Dutch WordNet with Dutch Lexicographic ResourcesDutch WordNet with Dutch Lexicographic Resources
Thierry Declerck

This extended abstract presents on-going work consisting in interlinking and merging the Open Dutch WordNet and generic lexicographic resources for Dutch, focusing for now on the Dutch and English versions of Wiktionary and using the Algemeen Nederlands Woordenboek as a quality checking instance. As the Open Dutch WordNet is already equipped with a relevant number of complex lexical units, we are aiming at expanding it and proposing a new representational framework for the encoding of the interlinked and integrated data. The longer term goal of the work is to investigate if and on how senses can be restricted to particular morphological variations of Dutch lexical entries, and how to represent this information in a Linguistic Linked Open Data compliant format.

pdf bib
Widening the Discussion on False Friends in Multilingual WordnetsFriends” in Multilingual Wordnets
Hugo Gonçalo Oliveira | Ana Luís

There are wordnets in many languages, many aligned with Princeton WordNet, some of which in a (semi-)automatic process, but we rarely see actual discussions on the role of false friends in this process. Having in mind known issues related to such words in language translation, and further motivated by false friend-related issues on the alignment of a Portuguese wordnet with Princeton Wordnet, we aim to widen this discussion, while suggesting preliminary ideas of how wordnets could benefit from this kind of research.

pdf bib
Building Sense Representations in Danish by Combining Word Embeddings with Lexical ResourcesDanish by Combining Word Embeddings with Lexical Resources
Ida Rørmann Olsen | Bolette Pedersen | Asad Sayeed

Our aim is to identify suitable sense representations for NLP in Danish. We investigate sense inventories that correlate with human interpretations of word meaning and ambiguity as typically described in dictionaries and wordnets and that are well reflected distributionally as expressed in word embeddings. To this end, we study a number of highly ambiguous Danish nouns and examine the effectiveness of sense representations constructed by combining vectors from a distributional model with the information from a wordnet. We establish representations based on centroids obtained from wordnet synests and example sentences as well as representations established via are tested in a word sense disambiguation task. We conclude that the more information extracted from the wordnet entries (example sentence, definition, semantic relations) the more successful the sense representation vector.

pdf bib
Translation Inference by Concept Propagation
Christian Chiarcos | Niko Schenk | Christian Fäth

This paper describes our contribution to the Third Shared Task on Translation Inference across Dictionaries (TIAD-2020). We describe an approach on translation inference based on symbolic methods, the propagation of concepts over a graph of interconnected dictionaries : Given a mapping from source language words to lexical concepts (e.g., synsets) as a seed, we use bilingual dictionaries to extrapolate a mapping of pivot and target language words to these lexical concepts. Translation inference is then performed by looking up the lexical concept(s) of a source language word and returning the target language word(s) for which these lexical concepts have the respective highest score. We present two instantiations of this system : One using WordNet synsets as concepts, and one using lexical entries (translations) as concepts. With a threshold of 0, the latter configuration is the second among participant systems in terms of F1 score. We also describe additional evaluation experiments on Apertium data, a comparison with an earlier approach based on embedding projection, and an approach for constrained projection that outperforms the TIAD-2020 vanilla system by a large margin.

up

bib (full) 16th Joint ACL - ISO Workshop on Interoperable Semantic Annotation PROCEEDINGS

pdf bib
16th Joint ACL - ISO Workshop on Interoperable Semantic Annotation PROCEEDINGS
Harry Bunt

pdf bib
Annotation-based Semantics
Kiyong Lee

This paper proposes a semantics ABS for the model-theoretic interpretation of annotation structures. It provides a language ABSr, that represents semantic forms in a (possibly -free) type-theoretic first-order logic. For semantic compositionality, the representation language introduces two operators and with subtypes for the conjunctive or distributive composition of semantic forms. ABS also introduces a small set of logical predicates to represent semantic forms in a simplified format. The use of ABSr is illustrated with some annotation structures that conform to ISO 24617 standards on semantic annotation such as ISO-TimeML and ISO-Space.semantic forms in a (possibly \\lambda-free) type-theoretic first-order logic. For semantic compositionality, the representation language introduces two operators \\oplus and \\oslash with subtypes for the conjunctive or distributive composition of semantic forms. ABS also introduces a small set of logical predicates to represent semantic forms in a simplified format. The use of ABSr is illustrated with some annotation structures that conform to ISO 24617 standards on semantic annotation such as ISO-TimeML and ISO-Space.

up

bib (full) Proceedings of the 1st International Workshop on Language Technology Platforms

pdf bib
Proceedings of the 1st International Workshop on Language Technology Platforms
Georg Rehm | Kalina Bontcheva | Khalid Choukri | Jan Hajič | Stelios Piperidis | Andrejs Vasiļjevs

pdf bib
CLARIN : Distributed Language Resources and Technology in a European InfrastructureCLARIN: Distributed Language Resources and Technology in a European Infrastructure
Maria Eskevich | Franciska de Jong | Alexander König | Darja Fišer | Dieter Van Uytvanck | Tero Aalto | Lars Borin | Olga Gerassimenko | Jan Hajic | Henk van den Heuvel | Neeme Kahusk | Krista Liin | Martin Matthiesen | Stelios Piperidis | Kadri Vider

CLARIN is a European Research Infrastructure providing access to digital language resources and tools from across Europe and beyond to researchers in the humanities and social sciences. This paper focuses on CLARIN as a platform for the sharing of language resources. It zooms in on the service offer for the aggregation of language repositories and the value proposition for a number of communities that benefit from the enhanced visibility of their data and services as a result of integration in CLARIN. The enhanced findability of language resources is serving the social sciences and humanities (SSH) community at large and supports research communities that aim to collaborate based on virtual collections for a specific domain. The paper also addresses the wider landscape of service platforms based on language technologies which has the potential of becoming a powerful set of interoperable facilities to a variety of communities of use.

pdf bib
Removing European Language Barriers with Innovative Machine Translation TechnologyEuropean Language Barriers with Innovative Machine Translation Technology
Dario Franceschini | Chiara Canton | Ivan Simonini | Armin Schweinfurth | Adelheid Glott | Sebastian Stüker | Thai-Son Nguyen | Felix Schneider | Thanh-Le Ha | Alex Waibel | Barry Haddow | Philip Williams | Rico Sennrich | Ondřej Bojar | Sangeet Sagar | Dominik Macháček | Otakar Smrž

This paper presents our progress towards deploying a versatile communication platform in the task of highly multilingual live speech translation for conferences and remote meetings live subtitling. The platform has been designed with a focus on very low latency and high flexibility while allowing research prototypes of speech and text processing tools to be easily connected, regardless of where they physically run. We outline our architecture solution and also briefly compare it with the ELG platform. Technical details are provided on the most important components and we summarize the test deployment events we ran so far.

pdf bib
The Kairntech Sherpa An ML Platform and API for the Enrichment of (not only) Scientific ContentKairntech Sherpa – An ML Platform and API for the Enrichment of (not only) Scientific Content
Stefan Geißler

We present an software platform and API that combines various ML and NLP approaches for the analysis and enrichment of textual content. The platform’s design and implementation is guided by the goal to allow non-technical users to conduct their own experiments and training runs on their respective data, allowing to test, tune and deploy analysis models for production. Dedicated specific packages for subtasks such as document structure processing, document categorization, annotation with existing thesauri, disambiguation and linking, annotation with newly created entity recognizers and summarization available as open source components in isolation are combined into an end-user-facing, collaborative, scalable platform to support large-scale industrial document analysis document analysis. We see the Sherpa’s setup as an answer to the observation that ML has reached a level of maturity that allows to attain useful results in many analysis scenarios today, but that in-depth technical competencies in the required fields of NLP and AI is often scarce ; a setup that focusses on non-technical domain-expert end-users can help to bring required analysis functionalities closer to the day-to-day reality in business contexts.

pdf bib
NTeALan Dictionaries Platforms : An Example Of Collaboration-Based ModelNTeALan Dictionaries Platforms: An Example Of Collaboration-Based Model
Elvis Mboning | Daniel Baleba | Jean Marc Bassahak | Ornella Wandji | Jules Assoumou

Nowadays the scarcity and dispersion of open-source NLP resources and tools in and for African languages make it difficult for researchers to truly fit these languages into current algorithms of artificial intelligence, resulting in the stagnation of these numerous languages, as far as technological progress is concerned. Created in 2017, with the aim of building communities of voluntary contributors around African native and/or national languages, cultures, NLP technologies and artificial intelligence, the NTeALan association has set up a series of web collaborative platforms intended to allow the aforementioned communities to create and manage their own lexicographic and linguistic resources. This paper aims at presenting the first versions of three lexicographic platforms that we developed in and for African languages : the REST / GraphQL API for saving lexicographic resources, the dictionary management platform and the collaborative dictionary platform. We also describe the data representation format used for these resources. After experimenting with a few dictionaries and looking at users feedback, we are convinced that only collaboration-based approaches and platforms can effectively respond to challenges of producing quality resources in and for African native and/or national languages.

pdf bib
A Workflow Manager for Complex NLP and Content Curation WorkflowsNLP and Content Curation Workflows
Julian Moreno-Schneider | Peter Bourgonje | Florian Kintzel | Georg Rehm

We present a workflow manager for the flexible creation and customisation of NLP processing pipelines. The workflow manager addresses challenges in interoperability across various different NLP tasks and hardware-based resource usage. Based on the four key principles of generality, flexibility, scalability and efficiency, we present the first version of the workflow manager by providing details on its custom definition language, explaining the communication components and the general system architecture and setup. We currently implement the system, which is grounded and motivated by real-world industry use cases in several innovation and transfer projects.

up

pdf (full)
bib (full)
Proceedings of the 16th International Conference on Parsing Technologies and the IWPT 2020 Shared Task on Parsing into Enhanced Universal Dependencies

pdf bib
Proceedings of the 16th International Conference on Parsing Technologies and the IWPT 2020 Shared Task on Parsing into Enhanced Universal Dependencies
Gosse Bouma | Yuji Matsumoto | Stephan Oepen | Kenji Sagae | Djamé Seddah | Weiwei Sun | Anders Søgaard | Reut Tsarfaty | Dan Zeman

pdf bib
Syntactic Parsing in Humans and Machines
Paola Merlo

To process the syntactic structures of a language in ways that are compatible with human expectations, we need computational representations of lexical and syntactic properties that form the basis of human knowledge of words and sentences. Recent neural-network-based and distributed semantics techniques have developed systems of considerable practical success and impressive performance. As has been advocated by many, however, such systems still lack human-like properties. In particular, linguistic, psycholinguistic and neuroscientific investigations have shown that human processing of sentences is sensitive to structure and unbounded relations. In the spirit of better understanding the structure building and long-distance properties of neural networks, I will present an overview of recent results on agreement and island effects in syntax in several languages. While certain sets of results in the literature indicate that neural language models exhibit long-distance agreement abilities, other finer-grained investigation of how these effects are calculated indicates that that the similarity spaces they define do not correlate with human experimental results on intervention similarity in long-distance dependencies. This opens the way to reflections on how to better match the syntactic properties of natural languages in the representations of neural models.

pdf bib
Semi-supervised Parsing with a Variational Autoencoding Parser
Xiao Zhang | Dan Goldwasser

We propose an end-to-end variational autoencoding parsing (VAP) model for semi-supervised graph-based projective dependency parsing. It encodes the input using continuous latent variables in a sequential manner by deep neural networks (DNN) that can utilize the contextual information, and reconstruct the input using a generative model. The VAP model admits a unified structure with different loss functions for labeled and unlabeled data with shared parameters. We conducted experiments on the WSJ data sets, showing the proposed model can use the unlabeled data to increase the performance on a limited amount of labeled data, on a par with a recently proposed semi-supervised parser with faster inference.

pdf bib
Obfuscation for Privacy-preserving Syntactic Parsing
Zhifeng Hu | Serhii Havrylov | Ivan Titov | Shay B. Cohen

The goal of homomorphic encryption is to encrypt data such that another party can operate on it without being explicitly exposed to the content of the original data. We introduce an idea for a privacy-preserving transformation on natural language data, inspired by homomorphic encryption. Our primary tool is obfuscation, relying on the properties of natural language. Specifically, a given English text is obfuscated using a neural model that aims to preserve the syntactic relationships of the original sentence so that the obfuscated sentence can be parsed instead of the original one. The model works at the word level, and learns to obfuscate each word separately by changing it into a new word that has a similar syntactic role. The text obfuscated by our model leads to better performance on three syntactic parsers (two dependency and one constituency parsers) in comparison to an upper-bound random substitution baseline. More specifically, the results demonstrate that as more terms are obfuscated (by their part of speech), the substitution upper bound significantly degrades, while the neural model maintains a relatively high performing parser. All of this is done without much sacrifice of privacy compared to the random substitution upper bound. We also further analyze the results, and discover that the substituted words have similar syntactic properties, but different semantic content, compared to the original words.obfuscation, relying on the properties of natural language. Specifically, a given English text is obfuscated using a neural model that aims to preserve the syntactic relationships of the original sentence so that the obfuscated sentence can be parsed instead of the original one. The model works at the word level, and learns to obfuscate each word separately by changing it into a new word that has a similar syntactic role. The text obfuscated by our model leads to better performance on three syntactic parsers (two dependency and one constituency parsers) in comparison to an upper-bound random substitution baseline. More specifically, the results demonstrate that as more terms are obfuscated (by their part of speech), the substitution upper bound significantly degrades, while the neural model maintains a relatively high performing parser. All of this is done without much sacrifice of privacy compared to the random substitution upper bound. We also further analyze the results, and discover that the substituted words have similar syntactic properties, but different semantic content, compared to the original words.

pdf bib
Tensors over Semirings for Latent-Variable Weighted Logic Programs
Esma Balkir | Daniel Gildea | Shay B. Cohen

Semiring parsing is an elegant framework for describing parsers by using semiring weighted logic programs. In this paper we present a generalization of this concept : latent-variable semiring parsing. With our framework, any semiring weighted logic program can be latentified by transforming weights from scalar values of a semiring to rank-n arrays, or tensors, of semiring values, allowing the modelling of latent-variable models within the semiring parsing framework. Semiring is too strong a notion when dealing with tensors, and we have to resort to a weaker structure : a partial semiring. We prove that this generalization preserves all the desired properties of the original semiring framework while strictly increasing its expressiveness.

pdf bib
Self-Training for Unsupervised Parsing with PRPNPRPN
Anhad Mohananey | Katharina Kann | Samuel R. Bowman

Neural unsupervised parsing (UP) models learn to parse without access to syntactic annotations, while being optimized for another task like language modeling. In this work, we propose self-training for neural UP models : we leverage aggregated annotations predicted by copies of our model as supervision for future copies. To be able to use our model’s predictions during training, we extend a recent neural UP architecture, the PRPN (Shen et al., 2018a), such that it can be trained in a semi-supervised fashion. We then add examples with parses predicted by our model to our unlabeled UP training data. Our self-trained model outperforms the PRPN by 8.1 % F1 and the previous state of the art by 1.6 % F1. In addition, we show that our architecture can also be helpful for semi-supervised parsing in ultra-low-resource settings.

pdf bib
Adaptation of Multilingual Transformer Encoder for Robust Enhanced Universal Dependency ParsingUniversal Dependency Parsing
Han He | Jinho D. Choi

This paper presents our enhanced dependency parsing approach using transformer encoders, coupled with a simple yet powerful ensemble algorithm that takes advantage of both tree and graph dependency parsing. Two types of transformer encoders are compared, a multilingual encoder and language-specific encoders. Our dependency tree parsing (DTP) approach generates only primary dependencies to form trees whereas our dependency graph parsing (DGP) approach handles both primary and secondary dependencies to form graphs. Since DGP does not guarantee the generated graphs are acyclic, the ensemble algorithm is designed to add secondary arcs predicted by DGP to primary arcs predicted by DTP. Our results show that models using the multilingual encoder outperform ones using the language specific encoders for most languages. The ensemble models generally show higher labeled attachment score on enhanced dependencies (ELAS) than the DTP and DGP models. As the result, our best models rank the third place on the macro-average ELAS over 17 languages.

pdf bib
Linear Neural Parsing and Hybrid Enhancement for Enhanced Universal DependenciesUniversal Dependencies
Giuseppe Attardi | Daniele Sartiano | Maria Simi

To accomplish the shared task on dependency parsing we explore the use of a linear transition-based neural dependency parser as well as a combination of three of them by means of a linear tree combination algorithm. We train separate models for each language on the shared task data. We compare our base parser with two biaffine parsers and also present an ensemble combination of all five parsers, which achieves an average UAS 1.88 point lower than the top official submission. For producing the enhanced dependencies, we exploit a hybrid approach, coupling an algorithmic graph transformation of the dependency tree with predictions made by a multitask machine learning model.

pdf bib
How Much of Enhanced UD Is Contained in UD?UD Is Contained in UD?
Adam Ek | Jean-Philippe Bernardy

In this paper, we present the submission of team CLASP to the IWPT 2020 Shared Task on parsing enhanced universal dependencies. We develop a tree-to-graph transformation algorithm based on dependency patterns. This algorithm can transform gold UD trees to EUD graphs with an ELAS score of 81.55 and a EULAS score of 96.70. These results show that much of the information needed to construct EUD graphs from UD trees are present in the UD trees. Coupled with a standard UD parser, the method applies to the official test data and yields and ELAS score of 67.85 and a EULAS score is 80.18.

up

bib (full) Proceedings of the 7th Workshop on Linked Data in Linguistics (LDL-2020)

pdf bib
Proceedings of the 7th Workshop on Linked Data in Linguistics (LDL-2020)
Maxim Ionov | John P. McCrae | Christian Chiarcos | Thierry Declerck | Julia Bosque-Gil | Jorge Gracia

pdf bib
Transforming the Cologne Digital Sanskrit Dictionaries into OntoLex-LemonSanskrit Dictionaries into OntoLex-Lemon
Francisco Mondaca | Felix Rau

The Cologne Digital Sanskrit Dictionaries (CDSD) is a large collection of complex digitized Sanskrit dictionaries, consisting of over thirty-five works, and is the most prominent collection of Sanskrit dictionaries worldwide. In this paper we evaluate two methods for transforming the CDSD into Ontolex-Lemon based on a modelling exercise. The first method that we evaluate consists of applying RDFa to the existent TEI-P5 files. The second method consists of transforming the TEI-encoded dictionaries into new files containing RDF triples modelled in OntoLex-Lemon. As a result of the modelling exercise we choose the second method : to transform TEI-encoded lexical data into Ontolex-Lemon by creating new files containing exclusively RDF triples.

pdf bib
Challenges of Word Sense Alignment : Portuguese Language ResourcesPortuguese Language Resources
Ana Salgado | Sina Ahmadi | Alberto Simões | John Philip McCrae | Rute Costa

This paper reports on an ongoing task of monolingual word sense alignment in which a comparative study between the Portuguese Academy of Sciences Dictionary and the Dicionrio Aberto is carried out in the context of the ELEXIS (European Lexicographic Infrastructure) project. Word sense alignment involves searching for matching senses within dictionary entries of different lexical resources and linking them, which poses significant challenges. The lexicographic criteria are not always entirely consistent within individual dictionaries and even less so across different projects where different options may have been assumed in terms of structure and especially wording techniques of lexicographic glosses. This hinders the task of matching senses. We aim to present our annotation workflow in Portuguese using the Semantic Web technologies. The results obtained are useful for the discussion within the community.

pdf bib
Lexemes in Wikidata : 2020 statusWikidata: 2020 status
Finn Nielsen

Wikidata now records data about lexemes, senses and lexical forms and exposes them as Linguistic Linked Open Data. Since lexemes in Wikidata was first established in 2018, this data has grown considerable in size. Links between lexemes in different languages can be made, e.g., through a derivation property or senses. We present some descriptive statistics about the lexemes of Wikidata, focusing on the multilingual aspects and show that there are still relatively few multilingual links.

up

bib (full) Proceedings of the Second Workshop on Linguistic and Neurocognitive Resources

pdf bib
Proceedings of the Second Workshop on Linguistic and Neurocognitive Resources
Emmanuele Chersoni | Barry Devereux | Chu-Ren Huang

pdf bib
Extrapolating Binder Style Word Embeddings to New Words
Jacob Turton | David Vinson | Robert Smith

Word embeddings such as Word2Vec not only uniquely identify words but also encode important semantic information about them. However, as single entities they are difficult to interpret and their individual dimensions do not have obvious meanings. A more intuitive and interpretable feature space based on neural representations of words was presented by Binder and colleagues (2016) but is only available for a very limited vocabulary. Previous research (Utsumi, 2018) indicates that Binder features can be predicted for words from their embedding vectors (such as Word2Vec), but only looked at the original Binder vocabulary. This paper aimed to demonstrate that Binder features can effectively be predicted for a large number of new words and that the predicted values are sensible. The results supported this, showing that correlations between predicted feature values were consistent with those in the original Binder dataset. Additionally, vectors of predicted values performed comparatively to established embedding models in tests of word-pair semantic similarity. Being able to predict Binder feature space vectors for any number of new words opens up many uses not possible with the original vocabulary size.

pdf bib
Towards the First Dyslexic Font in RussianRussian
Svetlana Alexeeva | Aleksandra Dobrego | Vladislav Zubov

Texts comprise a large part of visual information that we process every day, so one of the tasks of language science is to make them more accessible. However, often the text design process is focused on the font size, but not on its type ; which might be crucial especially for the people with reading disabilities. The current paper represents a study on text accessibility and the first attempt to create a research-based accessible font for Cyrillic letters. This resulted in the dyslexic-specific font, LexiaD. Its design rests on the reduction of inter-letter similarity of the Russian alphabet. In evaluation stage, dyslexic and non-dyslexic children were asked to read sentences from the Children version of the Russian Sentence Corpus. We tested the readability of LexiaD compared to PT Sans and PT Serif fonts. The results showed that all children had some advantage in letter feature extraction and information integration while reading in LexiaD, but lexical access was improved when sentences were rendered in PT Sans or PT Serif. Therefore, in several aspects, LexiaD proved to be faster to read and could be recommended to use by dyslexics who have visual deficiency or those who struggle with text understanding resulting in re-reading.

pdf bib
The Little Prince in 26 Languages : Towards a Multilingual Neuro-Cognitive Corpus
Sabrina Stehwien | Lena Henke | John Hale | Jonathan Brennan | Lars Meyer

We present the Le Petit Prince Corpus (LPPC), a multi-lingual resource for research in (computational) psycho- and neurolinguistics. The corpus consists of the children’s story The Little Prince in 26 languages. The dataset is in the process of being built using state-of-the-art methods for speech and language processing and electroencephalography (EEG). The planned release of LPPC dataset will include raw text annotated with dependency graphs in the Universal Dependencies standard, a near-natural-sounding synthetic spoken subset as well as EEG recordings. We will use this corpus for conducting neurolinguistic studies that generalize across a wide range of languages, overcoming typological constraints to traditional approaches. The planned release of the LPPC combines linguistic and EEG data for many languages using fully automatic methods, and thus constitutes a readily extendable resource that supports cross-linguistic and cross-disciplinary research.

pdf bib
Sensorimotor Norms for 506 Russian NounsRussian Nouns
Alex Miklashevsky

Embodied cognitive science suggested a number of variables describing our sensorimotor experience associated with different concepts : modality experience rating (i.e., relationship between words and images of a particular perceptive modalityvisual, auditory, haptic etc.), manipulability (the necessity for an object to interact with human hands in order to perform its function), vertical spatial localization. According to the embodied cognition theory, these semantic variables capture our mental representations and thus should influence word learning, processing and production. However, it is not clear how these new variables are related to such traditional variables as imageability, age of acquisition (AoA) and word frequency. In the presented database, normative data on the modality (visual, auditory, haptic, olfactory, and gustatory) ratings, vertical spatial localization of the object, manipulability, imageability, age of acquisition, and subjective frequency for 506 Russian nouns are collected. Factor analysis revealed four factors : (1) visual and haptic modality ratings were combined with imageability, manipulability and AoA ; (2) word length, frequency and AoA ; (3) olfactory modality was united with gustatory ; (4) spatial localization only was included in the fourth factor. The database is available online together with a publication describing the method of data collection and data parameters (Miklashevsky, 2018).

up

bib (full) Proceedings of the Workshop about Language Resources for the SSH Cloud

pdf bib
Proceedings of the Workshop about Language Resources for the SSH Cloud
Daan Broeder | Maria Eskevich | Monica Monachini

pdf bib
Mining Wages in Nineteenth-Century Job Advertisements. The Application of Language Resources and Language Technology to study Economic and Social Inequality
Ruben Ros | Marieke van Erp | Auke Rijpma | Richard Zijdeman

For the analysis of historical wage development, no structured data is available. Job advertisements, as found in newspapers can provide insights into what different types of jobs paid, but require language technology to structure in a format conducive to quantitative analysis. In this paper, we report on our experiments to mine wages from 19th century newspaper advertisements and detail the challenges that need to be overcome to perform a socio-economic analysis of textual data sources.

pdf bib
EOSC as a game-changer in the Social Sciences and Humanities research activitiesEOSC as a game-changer in the Social Sciences and Humanities research activities
Donatella Castelli

This paper aims to give some insights on how the European Open Science Cloud (EOSC) will be able to influence the Social Sciences and Humanities (SSH) sector, thus paving the way towards innovation. Points of discussion on how the LRs and RIs community can contribute to the revolution in the practice of research areas are provided.

pdf bib
Crossing the SSH Bridge with Interview DataSSH Bridge with Interview Data
Henk van den Heuvel

Spoken audio data, such as interview data, is a scientific instrument used by researchers in various disciplines crossing the boundaries of social sciences and humanities. In this paper, we will have a closer look at a portal designed to perform speech-to-text conversion on audio recordings through Automatic Speech Recognition (ASR) in the CLARIN infrastructure. Within the cluster cross-domain EU project SSHOC the potential value of such a linguistic tool kit for processing spoken language recording has found uptake in a webinar about the topic, and in a task addressing audio analysis of panel survey data. The objective of this contribution is to show that the processing of interviews as a research instrument has opened up a fascinating and fruitful area of collaboration between Social Sciences and Humanities (SSH).

up

bib (full) Proceedings of the 1st Workshop on Language Technologies for Government and Public Administration (LT4Gov)

pdf bib
Proceedings of the 1st Workshop on Language Technologies for Government and Public Administration (LT4Gov)
Doaa Samy | David Pérez-Fernández | Jerónimo Arenas-García

pdf bib
Development of Natural Language Processing Tools to Support Determination of Federal Disability Benefits in the U.S.U.S.
Bart Desmet | Julia Porcino | Ayah Zirikly | Denis Newman-Griffis | Guy Divita | Elizabeth Rasch

The disability benefits programs administered by the US Social Security Administration (SSA) receive between 2 and 3 million new applications each year. Adjudicators manually review hundreds of evidence pages per case to determine eligibility based on financial, medical, and functional criteria. Natural Language Processing (NLP) technology is uniquely suited to support this adjudication work and is a critical component of an ongoing inter-agency collaboration between SSA and the National Institutes of Health. This NLP work provides resources and models for document ranking, named entity recognition, and terminology extraction in order to automatically identify documents and reports pertinent to a case, and to allow adjudicators to search for and locate desired information quickly. In this paper, we describe our vision for how NLP can impact SSA’s adjudication process, present the resources and models that have been developed, and discuss some of the benefits and challenges in working with large-scale government data, and its specific properties in the functional domain.

up

bib (full) Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies for Historical and Ancient Languages

pdf bib
Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies for Historical and Ancient Languages
Rachele Sprugnoli | Marco Passarotti

pdf bib
Using LatInfLexi for an Entropy-Based Assessment of Predictability in Latin InflectionLatInfLexi for an Entropy-Based Assessment of Predictability in Latin Inflection
Matteo Pellegrini

This paper presents LatInfLexi, a large inflected lexicon of Latin providing information on all the inflected wordforms of 3,348 verbs and 1,038 nouns. After a description of the structure of the resource and some data on its size, the procedure followed to obtain the lexicon from the database of the Lemlat 3.0 morphological analyzer is detailed, as well as the choices made regarding overabundant and defective cells. The way in which the data of LatInfLexi can be exploited in order to perform a quantitative assessment of predictability in Latin verb inflection is then illustrated : results obtained by computing the conditional entropy of guessing the content of a paradigm cell assuming knowledge of one wordform or multiple wordforms are presented in turn, highlighting the descriptive and theoretical relevance of the analysis. Lastly, the paper envisages the advantages of an inclusion of LatInfLexi into the LiLa knowledge base, both for the presented resource and for the knowledge base itself.

pdf bib
A Tool for Facilitating OCR Postediting in Historical DocumentsOCR Postediting in Historical Documents
Alberto Poncelas | Mohammad Aboomar | Jan Buts | James Hadley | Andy Way

Optical character recognition (OCR) for historical documents is a complex procedure subject to a unique set of material issues, including inconsistencies in typefaces and low quality scanning. Consequently, even the most sophisticated OCR engines produce errors. This paper reports on a tool built for postediting the output of Tesseract, more specifically for correcting common errors in digitized historical documents. The proposed tool suggests alternatives for word forms not found in a specified vocabulary. The assumed error is replaced by a presumably correct alternative in the post-edition based on the scores of a Language Model (LM). The tool is tested on a chapter of the book An Essay Towards Regulating the Trade and Employing the Poor of this Kingdom (Cary, 1719). As demonstrated below, the tool is successful in correcting a number of common errors. If sometimes unreliable, it is also transparent and subject to human intervention.

pdf bib
A Thesaurus for Biblical HebrewHebrew
Miriam Azar | Aliza Pahmer | Joshua Waxman

We built a thesaurus for Biblical Hebrew, with connections between roots based on phonetic, semantic, and distributional similarity. To this end, we apply established algorithms to find connections between headwords based on existing lexicons and other digital resources. For semantic similarity, we utilize the cosine-similarity of tf-idf vectors of English gloss text of Hebrew headwords from Ernest Klein’s A Comprehensive Etymological Dictionary of the Hebrew Language for Readers of English as well as to Brown-Driver-Brigg’s Hebrew Lexicon. For phonetic similarity, we digitize part of Matityahu Clark’s Etymological Dictionary of Biblical Hebrew, grouping Hebrew roots into phonemic classes, and establish phonetic relationships between headwords in Klein’s Dictionary. For distributional similarity, we consider the cosine similarity of PPMI vectors of Hebrew roots and also, in a somewhat novel approach, apply Word2Vec to a Biblical corpus reduced to its lexemes. The resulting resource is helpful to those trying to understand Biblical Hebrew, and also stands as a good basis for programs trying to process the Biblical text.

pdf bib
Comparing Statistical and Neural Models for Learning Sound Correspondences
Clémentine Fourrier | Benoît Sagot

Cognate prediction and proto-form reconstruction are key tasks in computational historical linguistics that rely on the study of sound change regularity. Solving these tasks appears to be very similar to machine translation, though methods from that field have barely been applied to historical linguistics. Therefore, in this paper, we investigate the learnability of sound correspondences between a proto-language and daughter languages for two machine-translation-inspired models, one statistical, the other neural. We first carry out our experiments on plausible artificial languages, without noise, in order to study the role of each parameter on the algorithms respective performance under almost perfect conditions. We then study real languages, namely Latin, Italian and Spanish, to see if those performances generalise well. We show that both model types manage to learn sound changes despite data scarcity, although the best performing model type depends on several parameters such as the size of the training data, the ambiguity, and the prediction direction.

pdf bib
Latin-Spanish Neural Machine Translation : from the Bible to Saint AugustineLatin-Spanish Neural Machine Translation: from the Bible to Saint Augustine
Eva Martínez Garcia | Álvaro García Tejedor

Although there are several sources where to find historical texts, they usually are available in the original language that makes them generally inaccessible. This paper presents the development of state-of-the-art Neural Machine Systems for the low-resourced Latin-Spanish language pair. First, we build a Transformer-based Machine Translation system on the Bible parallel corpus. Then, we build a comparable corpus from Saint Augustine texts and their translations. We use this corpus to study the domain adaptation case from the Bible texts to Saint Augustine’s works. Results show the difficulties of handling a low-resourced language as Latin. First, we noticed the importance of having enough data, since the systems do not achieve high BLEU scores. Regarding domain adaptation, results show how using in-domain data helps systems to achieve a better quality translation. Also, we observed that it is needed a higher amount of data to perform an effective vocabulary extension that includes in-domain vocabulary.

pdf bib
A Gradient Boosting-Seq2Seq System for Latin POS Tagging and LemmatizationSeq2Seq System for Latin POS Tagging and Lemmatization
Giuseppe G. A. Celano

The paper presents the system used in the EvaLatin shared task to POS tag and lemmatize Latin. It consists of two components. A gradient boosting machine (LightGBM) is used for POS tagging, mainly fed with pre-computed word embeddings of a window of seven contiguous tokensthe token at hand plus the three preceding and following onesper target feature value. Word embeddings are trained on the texts of the Perseus Digital Library, Patrologia Latina, and Biblioteca Digitale di Testi Tardo Antichi, which together comprise a high number of texts of different genres from the Classical Age to Late Antiquity. Word forms plus the outputted POS labels are used to feed a seq2seq algorithm implemented in Keras to predict lemmas. The final shared-task accuracies measured for Classical Latin texts are in line with state-of-the-art POS taggers (0.96) and lemmatizers (0.95).

up

bib (full) Proceedings of the LREC 2020 Workshop on Multimodal Wordnets (MMW2020)

pdf bib
Proceedings of the LREC 2020 Workshop on Multimodal Wordnets (MMW2020)
Thierry Declerk | Itziar Gonzalez-Dios | German Rigau

pdf bib
English WordNet 2020 : Improving and Extending a WordNet for English using an Open-Source MethodologyEnglish WordNet 2020: Improving and Extending a WordNet for English using an Open-Source Methodology
John Philip McCrae | Alexandre Rademaker | Ewa Rudnicka | Francis Bond

WordNet, while one of the most widely used resources for NLP, has not been updated for a long time, and as such a new project English WordNet has arisen to continue the development of the model under an open-source paradigm. In this paper, we detail the second release of this resource entitled English WordNet 2020. The work has focused firstly, on the introduction of new synsets and senses and developing guidelines for this and secondly, on the integration of contributions from other projects. We present the changes in this edition, which total over 15,000 changes over the previous release.

pdf bib
Adding Pronunciation Information to Wordnets
Thierry Declerck | Lenka Bajcetic | Melanie Siegel

We describe on-going work consisting in adding pronunciation information to wordnets, as such information can indicate specific senses of a word. Many wordnets associate with their senses only a lemma form and a part-of-speech tag. At the same time, we are aware that additional linguistic information can be useful for identifying a specific sense of a wordnet lemma when encountered in a corpus. While work already deals with the addition of grammatical number or grammatical gender information to wordnet lemmas, we are investigating the linking of wordnet lemmas to pronunciation information, adding thus a speech-related modality to wordnets

up

bib (full) Proceedings of the LREC 2020 Workshop on Multilingual Biomedical Text Processing (MultilingualBIO 2020)

pdf bib
Proceedings of the LREC 2020 Workshop on Multilingual Biomedical Text Processing (MultilingualBIO 2020)
Maite Melero

pdf bib
Detecting Adverse Drug Events from Swedish Electronic Health Records using Text MiningSwedish Electronic Health Records using Text Mining
Maria Bampa | Hercules Dalianis

Electronic Health Records are a valuable source of patient information which can be leveraged to detect Adverse Drug Events (ADEs) and aid post-mark drug-surveillance. The overall aim of this study is to scrutinize text written by clinicians in the EHRs and build a model for ADE detection that produces medically relevant predictions. Natural Language Processing techniques will be exploited to create important predictors and incorporate them into the learning process. The study focuses on the 5 most frequent ADE cases found ina Swedish electronic patient record corpus. The results indicate that considering textual features, rather than the structured, can improve the classification performance by 15 % in some ADE cases. Additionally, variable patient history lengths are incorporated in the models, demonstrating the importance of the above decision rather than using an arbitrary number for a history length. The experimental findings suggest that the clinical text in EHRs includes information that can capture data beyond the ones that are found in a structured format.

pdf bib
Localising the Clinical Terminology SNOMED CT by Semi-automated Creation of a German Interface VocabularySNOMED CT by Semi-automated Creation of a German Interface Vocabulary
Stefan Schulz | Larissa Hammer | David Hashemian-Nik | Markus Kreuzthaler

Medical language exhibits great variations regarding users, institutions and language registers. With large parts of clinical documents in free text, NLP is playing a more and more important role in unlocking re-usable and interoperable meaning from medical records. This study describes the architectural principles and the evolution of a German interface vocabulary, combining machine translation with human annotation and rule-based term generation, yielding a resource with 7.7 million raw entries, each of which linked to the reference terminology SNOMED CT, an international standard with about 350 thousand concepts. The purpose is to offer a high coverage of medical jargon in order to optimise terminology grounding of clinical texts by text mining systems. The core resource is a manually curated table of English-to-German word and chunk translations, supported by a set of language generation rules. The work describes a workflow consisting the enrichment and modification of this table with human and machine efforts, manually enriched by grammarspecific tags. Top-down and bottom-up methods for terminology population used in parallel. The final interface terms are produced by a term generator, which creates one-to-many German variants per SNOMED CT English description. Filtering against a large collection of domain terminologies and corpora drastically reduces the size of the vocabulary in favour of more realistic terms or terms that can reasonably be expected to match clinical text passages within a text-mining pipeline. An evaluation was performed by a comparison between the current version of the German interface vocabulary and the English description table of the SNOMED CT International release. An exact term matching was performed with a small parallel corpus constituted by text snippets from different clinical documents. With overall low retrieval parameters (with F-values around 30 %), the performance of the German language scenario reaches 80 90 % of the English one. Interestingly, annotations are slightly better with machine-translated (German English) texts, using the International SNOMED CT resource only.

pdf bib
Multilingual enrichment of disease biomedical ontologies
Léo Bouscarrat | Antoine Bonnefoy | Cécile Capponi | Carlos Ramisch

Translating biomedical ontologies is an important challenge, but doing it manually requires much time and money. We study the possibility to use open-source knowledge bases to translate biomedical ontologies. We focus on two aspects : coverage and quality. We look at the coverage of two biomedical ontologies focusing on diseases with respect to Wikidata for 9 European languages (Czech, Dutch, English, French, German, Italian, Polish, Portuguese and Spanish) for both, plus Arabic, Chinese and Russian for the second. We first use direct links between Wikidata and the studied ontologies and then use second-order links by going through other intermediate ontologies. We then compare the quality of the translations obtained thanks to Wikidata with a commercial machine translation tool, here Google Cloud Translation.

pdf bib
Automated Processing of Multilingual Online News for the Monitoring of Animal Infectious Diseases
Sarah Valentin | Renaud Lancelot | Mathieu Roche

The Platform for Automated extraction of animal Disease Information from the web (PADI-web) is an automated system which monitors the web for monitoring and detecting emerging animal infectious diseases. The tool automatically collects news via customised multilingual queries, classifies them and extracts epidemiological information. We detail the processing of multilingual online sources by PADI-web and analyse the translated outputs in a case study

up

pdf (full)
bib (full)
Proceedings of the Fourth Workshop on Neural Generation and Translation

pdf bib
Proceedings of the Fourth Workshop on Neural Generation and Translation
Alexandra Birch | Andrew Finch | Hiroaki Hayashi | Kenneth Heafield | Marcin Junczys-Dowmunt | Ioannis Konstas | Xian Li | Graham Neubig | Yusuke Oda

pdf bib
Learning to Generate Multiple Style Transfer Outputs for an Input Sentence
Kevin Lin | Ming-Yu Liu | Ming-Ting Sun | Jan Kautz

Text style transfer refers to the task of rephrasing a given text in a different style. While various methods have been proposed to advance the state of the art, they often assume the transfer output follows a delta distribution, and thus their models can not generate different style transfer results for a given input text. To address the limitation, we propose a one-to-many text style transfer framework. In contrast to prior works that learn a one-to-one mapping that converts an input sentence to one output sentence, our approach learns a one-to-many mapping that can convert an input sentence to multiple different output sentences, while preserving the input content. This is achieved by applying adversarial training with a latent decomposition scheme. Specifically, we decompose the latent representation of the input sentence to a style code that captures the language style variation and a content code that encodes the language style-independent content. We then combine the content code with the style code for generating a style transfer output. By combining the same content code with a different style code, we generate a different style transfer output. Extensive experimental results with comparisons to several text style transfer approaches on multiple public datasets using a diverse set of performance metrics validate effectiveness of the proposed approach.

pdf bib
Automatically Ranked Russian Paraphrase Corpus for Text GenerationRussian Paraphrase Corpus for Text Generation
Vadim Gudkov | Olga Mitrofanova | Elizaveta Filippskikh

The article is focused on automatic development and ranking of a large corpus for Russian paraphrase generation which proves to be the first corpus of such type in Russian computational linguistics. Existing manually annotated paraphrase datasets for Russian are limited to small-sized ParaPhraser corpus and ParaPlag which are suitable for a set of NLP tasks, such as paraphrase and plagiarism detection, sentence similarity and relatedness estimation, etc. Due to size restrictions, these datasets can hardly be applied in end-to-end text generation solutions. Meanwhile, paraphrase generation requires a large amount of training data. In our study we propose a solution to the problem : we collect, rank and evaluate a new publicly available headline paraphrase corpus (ParaPhraser Plus), and then perform text generation experiments with manual evaluation on automatically ranked corpora using the Universal Transformer architecture.

pdf bib
Distill, Adapt, Distill : Training Small, In-Domain Models for Neural Machine Translation
Mitchell Gordon | Kevin Duh

We explore best practices for training small, memory efficient machine translation models with sequence-level knowledge distillation in the domain adaptation setting. While both domain adaptation and knowledge distillation are widely-used, their interaction remains little understood. Our large-scale empirical results in machine translation (on three language pairs with three domains each) suggest distilling twice for best performance : once using general-domain data and again using in-domain data with an adapted teacher.

pdf bib
The ADAPT System Description for the STAPLE 2020 English-to-Portuguese Translation TaskADAPT System Description for the STAPLE 2020 English-to-Portuguese Translation Task
Rejwanul Haque | Yasmin Moslem | Andy Way

This paper describes the ADAPT Centre’s submission to STAPLE (Simultaneous Translation and Paraphrase for Language Education) 2020, a shared task of the 4th Workshop on Neural Generation and Translation (WNGT), for the English-to-Portuguese translation task. In this shared task, the participants were asked to produce high-coverage sets of plausible translations given English prompts (input source sentences). We present our English-to-Portuguese machine translation (MT) models that were built applying various strategies, e.g. data and sentence selection, monolingual MT for generating alternative translations, and combining multiple n-best translations. Our experiments show that adding the aforementioned techniques to the baseline yields an excellent performance in the English-to-Portuguese translation task.

pdf bib
Efficient and High-Quality Neural Machine Translation with OpenNMTOpenNMT
Guillaume Klein | Dakun Zhang | Clément Chouteau | Josep Crego | Jean Senellart

This paper describes the OpenNMT submissions to the WNGT 2020 efficiency shared task. We explore training and acceleration of Transformer models with various sizes that are trained in a teacher-student setup. We also present a custom and optimized C++ inference engine that enables fast CPU and GPU decoding with few dependencies. By combining additional optimizations and parallelization techniques, we create small, efficient, and high-quality neural machine translation models.

pdf bib
Edinburgh’s Submissions to the 2020 Machine Translation Efficiency TaskEdinburgh’s Submissions to the 2020 Machine Translation Efficiency Task
Nikolay Bogoychev | Roman Grundkiewicz | Alham Fikri Aji | Maximiliana Behnke | Kenneth Heafield | Sidharth Kashyap | Emmanouil-Ioannis Farsarakis | Mateusz Chudyk

We participated in all tracks of the Workshop on Neural Generation and Translation 2020 Efficiency Shared Task : single-core CPU, multi-core CPU, and GPU. At the model level, we use teacher-student training with a variety of student sizes, tie embeddings and sometimes layers, use the Simpler Simple Recurrent Unit, and introduce head pruning. On GPUs, we used 16-bit floating-point tensor cores. On CPUs, we customized 8-bit quantization and multiple processes with affinity for the multi-core setting. To reduce model size, we experimented with 4-bit log quantization but use floats at runtime. In the shared task, most of our submissions were Pareto optimal with respect the trade-off between time and quality.

pdf bib
Improving Document-Level Neural Machine Translation with Domain Adaptation
Sami Ul Haq | Sadaf Abdul Rauf | Arslan Shoukat | Noor-e- Hira

Recent studies have shown that translation quality of NMT systems can be improved by providing document-level contextual information. In general sentence-based NMT models are extended to capture contextual information from large-scale document-level corpora which are difficult to acquire. Domain adaptation on the other hand promises adapting components of already developed systems by exploiting limited in-domain data. This paper presents FJWU’s system submission at WNGT, we specifically participated in Document level MT task for German-English translation. Our system is based on context-aware Transformer model developed on top of original NMT architecture by integrating contextual information using attention networks. Our experimental results show providing previous sentences as context significantly improves the BLEU score as compared to a strong NMT baseline. We also studied the impact of domain adaptation on document level translationand were able to improve results by adaptingthe systems according to the testing domain.

up

pdf (full)
bib (full)
Proceedings of the First Workshop on Natural Language Interfaces

pdf bib
Proceedings of the First Workshop on Natural Language Interfaces
Ahmed Hassan Awadallah | Yu Su | Huan Sun | Scott Wen-tau Yih

pdf bib
Answering Complex Questions by Combining Information from Curated and Extracted Knowledge Bases
Nikita Bhutani | Xinyi Zheng | Kun Qian | Yunyao Li | H. Jagadish

Knowledge-based question answering (KB_QA) has long focused on simple questions that can be answered from a single knowledge source, a manually curated or an automatically extracted KB. In this work, we look at answering complex questions which often require combining information from multiple sources. We present a novel KB-QA system, Multique, which can map a complex question to a complex query pattern using a sequence of simple queries each targeted at a specific KB. It finds simple queries using a neural-network based model capable of collective inference over textual relations in extracted KB and ontological relations in curated KB. Experiments show that our proposed system outperforms previous KB-QA systems on benchmark datasets, ComplexWebQuestions and WebQuestionsSP.

pdf bib
Towards Reversal-Based Textual Data Augmentation for NLI Problems with Opposable ClassesNLI Problems with Opposable Classes
Alexey Tarasov

Data augmentation methods are commonly used in computer vision and speech. However, in domains dealing with textual data, such techniques are not that common. Most of the existing methods rely on rephrasing, i.e. new sentences are generated by changing a source sentence, preserving its meaning. We argue that in tasks with opposable classes (such as Positive and Negative in sentiment analysis), it might be beneficial to also invert the source sentence, reversing its meaning, to generate examples of the opposing class. Methods that use somewhat similar intuition exist in the space of adversarial learning, but are not always applicable to text classification (in our experiments, some of them were even detrimental to the resulting classifier accuracy). We propose and evaluate two reversal-based methods on an NLI task of recognising a type of a simple logical expression from its description in plain-text form. After gathering a dataset on MTurk, we show that a simple heuristic using a notion of negating the main verb has a potential not only on its own, but that it can also boost existing state-of-the-art rephrasing-based approaches.

up

pdf (full)
bib (full)
Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI

pdf bib
Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI
Tsung-Hsien Wen | Asli Celikyilmaz | Zhou Yu | Alexandros Papangelis | Mihail Eric | Anuj Kumar | Iñigo Casanueva | Rushin Shah

pdf bib
How to Tame Your Data : Data Augmentation for Dialog State Tracking
Adam Summerville | Jordan Hashemi | James Ryan | William Ferguson

Dialog State Tracking (DST) is a problem space in which the effective vocabulary is practically limitless. For example, the domain of possible movie titles or restaurant names is bound only by the limits of language. As such, DST systems often encounter out-of-vocabulary words at inference time that were never encountered during training. To combat this issue, we present a targeted data augmentation process, by which a practitioner observes the types of errors made on held-out evaluation data, and then modifies the training data with additional corpora to increase the vocabulary size at training time. Using this with a RoBERTa-based Transformer architecture, we achieve state-of-the-art results in comparison to systems that only mask trouble slots with special tokens. Additionally, we present a data-representation scheme for seamlessly retargeting DST architectures to new domains.

pdf bib
Efficient Intent Detection with Dual Sentence Encoders
Iñigo Casanueva | Tadas Temčinas | Daniela Gerz | Matthew Henderson | Ivan Vulić

Building conversational systems in new domains and with added functionality requires resource-efficient models that work under low-data regimes (i.e., in few-shot setups). Motivated by these requirements, we introduce intent detection methods backed by pretrained dual sentence encoders such as USE and ConveRT. We demonstrate the usefulness and wide applicability of the proposed intent detectors, showing that : 1) they outperform intent detectors based on fine-tuning the full BERT-Large model or using BERT as a fixed black-box encoder on three diverse intent detection data sets ; 2) the gains are especially pronounced in few-shot setups (i.e., with only 10 or 30 annotated examples per intent) ; 3) our intent detectors can be trained in a matter of minutes on a single CPU ; and 4) they are stable across different hyperparameter settings. In hope of facilitating and democratizing research focused on intention detection, we release our code, as well as a new challenging single-domain intent detection dataset comprising 13,083 annotated examples over 77 intents.

pdf bib
Accelerating Natural Language Understanding in Task-Oriented Dialog
Ojas Ahuja | Shrey Desai

Task-oriented dialog models typically leverage complex neural architectures and large-scale, pre-trained Transformers to achieve state-of-the-art performance on popular natural language understanding benchmarks. However, these models frequently have in excess of tens of millions of parameters, making them impossible to deploy on-device where resource-efficiency is a major concern. In this work, we show that a simple convolutional model compressed with structured pruning achieves largely comparable results to BERT on ATIS and Snips, with under 100 K parameters. Moreover, we perform acceleration experiments on CPUs, where we observe our multi-task model predicts intents and slots nearly 63x faster than even DistilBERT.

pdf bib
Automating Template Creation for Ranking-Based Dialogue Models
Jingxiang Chen | Heba Elfardy | Simi Wang | Andrea Kahn | Jared Kramer

Dialogue response generation models that use template ranking rather than direct sequence generation allow model developers to limit generated responses to pre-approved messages. However, manually creating templates is time-consuming and requires domain expertise. To alleviate this problem, we explore automating the process of creating dialogue templates by using unsupervised methods to cluster historical utterances and selecting representative utterances from each cluster. Specifically, we propose an end-to-end model called Deep Sentence Encoder Clustering (DSEC) that uses an auto-encoder structure to jointly learn the utterance representation and construct template clusters. We compare this method to a random baseline that randomly assigns templates to clusters as well as a strong baseline that performs the sentence encoding and the utterance clustering sequentially. To evaluate the performance of the proposed method, we perform an automatic evaluation with two annotated customer service datasets to assess clustering effectiveness, and a human-in-the-loop experiment using a live customer service application to measure the acceptance rate of the generated templates. DSEC performs best in the automatic evaluation, beats both the sequential and random baselines on most metrics in the human-in-the-loop experiment, and shows promising results when compared to gold / manually created templates.

pdf bib
MultiWOZ 2.2 : A Dialogue Dataset with Additional Annotation Corrections and State Tracking BaselinesMultiWOZ 2.2 : A Dialogue Dataset with Additional Annotation Corrections and State Tracking Baselines
Xiaoxue Zang | Abhinav Rastogi | Srinivas Sunkara | Raghav Gupta | Jianguo Zhang | Jindong Chen

MultiWOZ is a well-known task-oriented dialogue dataset containing over 10,000 annotated dialogues spanning 8 domains. It is extensively used as a benchmark for dialogue state tracking. However, recent works have reported presence of substantial noise in the dialogue state annotations. MultiWOZ 2.1 identified and fixed many of these erroneous annotations and user utterances, resulting in an improved version of this dataset. This work introduces MultiWOZ 2.2, which is a yet another improved version of this dataset. Firstly, we identify and fix dialogue state annotation errors across 17.3 % of the utterances on top of MultiWOZ 2.1. Secondly, we redefine the ontology by disallowing vocabularies of slots with a large number of possible values (e.g., restaurant name, time of booking). In addition, we introduce slot span annotations for these slots to standardize them across recent models, which previously used custom string matching heuristics to generate them. We also benchmark a few state of the art dialogue state tracking models on the corrected dataset to facilitate comparison for future work. In the end, we discuss best practices for dialogue data collection that can help avoid annotation errors.

pdf bib
Probing Neural Dialog Models for Conversational Understanding
Abdelrhman Saleh | Tovly Deutsch | Stephen Casper | Yonatan Belinkov | Stuart Shieber

The predominant approach to open-domain dialog generation relies on end-to-end training of neural models on chat datasets. However, this approach provides little insight as to what these models learn (or do not learn) about engaging in dialog. In this study, we analyze the internal representations learned by neural open-domain dialog systems and evaluate the quality of these representations for learning basic conversational skills. Our results suggest that standard open-domain dialog systems struggle with answering questions, inferring contradiction, and determining the topic of conversation, among other tasks. We also find that the dyadic, turn-taking nature of dialog is not fully leveraged by these models. By exploring these limitations, we highlight the need for additional research into architectures and training methods that can better capture high-level information about dialog.

up

pdf (full)
bib (full)
Proceedings of the First Workshop on Natural Language Processing for Medical Conversations

pdf bib
Proceedings of the First Workshop on Natural Language Processing for Medical Conversations
Parminder Bhatia | Steven Lin | Rashmi Gangadharaiah | Byron Wallace | Izhak Shafran | Chaitanya Shivade | Nan Du | Mona Diab

pdf bib
Methods for Extracting Information from Messages from Primary Care Providers to Specialists
Xiyu Ding | Michael Barnett | Ateev Mehrotra | Timothy Miller

Electronic consult (eConsult) systems allow specialists more flexibility to respond to referrals more efficiently, thereby increasing access in under-resourced healthcare settings like safety net systems. Understanding the usage patterns of eConsult system is an important part of improving specialist efficiency. In this work, we develop and apply classifiers to a dataset of eConsult questions from primary care providers to specialists, classifying the messages for how they were triaged by the specialist office, and the underlying type of clinical question posed by the primary care provider. We show that pre-trained transformer models are strong baselines, with improving performance from domain-specific training and shared representations.

pdf bib
Towards Understanding ASR Error Correction for Medical ConversationsASR Error Correction for Medical Conversations
Anirudh Mani | Shruti Palaskar | Sandeep Konam

Domain Adaptation for Automatic Speech Recognition (ASR) error correction via machine translation is a useful technique for improving out-of-domain outputs of pre-trained ASR systems to obtain optimal results for specific in-domain tasks. We use this technique on our dataset of Doctor-Patient conversations using two off-the-shelf ASR systems : Google ASR (commercial) and the ASPIRE model (open-source). We train a Sequence-to-Sequence Machine Translation model and evaluate it on seven specific UMLS Semantic types, including Pharmacological Substance, Sign or Symptom, and Diagnostic Procedure to name a few. Lastly, we breakdown, analyze and discuss the 7 % overall improvement in word error rate in view of each Semantic type.

pdf bib
On the Utility of Audiovisual Dialog Technologies and Signal Analytics for Real-time Remote Monitoring of Depression Biomarkers
Michael Neumann | Oliver Roessler | David Suendermann-Oeft | Vikram Ramanarayanan

We investigate the utility of audiovisual dialog systems combined with speech and video analytics for real-time remote monitoring of depression at scale in uncontrolled environment settings. We collected audiovisual conversational data from participants who interacted with a cloud-based multimodal dialog system, and automatically extracted a large set of speech and vision metrics based on the rich existing literature of laboratory studies. We report on the efficacy of various audio and video metrics in differentiating people with mild, moderate and severe depression, and discuss the implications of these results for the deployment of such technologies in real-world neurological diagnosis and monitoring applications.

up

pdf (full)
bib (full)
Proceedings of the First Joint Workshop on Narrative Understanding, Storylines, and Events

pdf bib
Proceedings of the First Joint Workshop on Narrative Understanding, Storylines, and Events
Claire Bonial | Tommaso Caselli | Snigdha Chaturvedi | Elizabeth Clark | Ruihong Huang | Mohit Iyyer | Alejandro Jaimes | Heng Ji | Lara J. Martin | Ben Miller | Teruko Mitamura | Nanyun Peng | Joel Tetreault

pdf bib
Improving the Identification of the Discourse Function of News Article Paragraphs
Deya Banisakher | W. Victor Yarlott | Mohammed Aldawsari | Naphtali Rishe | Mark Finlayson

Identifying the discourse structure of documents is an important task in understanding written text. Building on prior work, we demonstrate an improved approach to automatically identifying the discourse function of paragraphs in news articles. We start with the hierarchical theory of news discourse developed by van Dijk (1988) which proposes how paragraphs function within news articles. This discourse information is a level intermediate between phrase- or sentence-sized discourse segments and document genre, characterizing how individual paragraphs convey information about the events in the storyline of the article. Specifically, the theory categorizes the relationships between narrated events and (1) the overall storyline (such as Main Events, Background, or Consequences) as well as (2) commentary (such as Verbal Reactions and Evaluations). We trained and tested a linear chain conditional random field (CRF) with new features to model van Dijk’s labels and compared it against several machine learning models presented in previous work. Our model significantly outperformed all baselines and prior approaches, achieving an average of 0.71 F1 score which represents a 31.5 % improvement over the previously best-performing support vector machine model.

pdf bib
Extensively Matching for Few-shot Learning Event Detection
Viet Dac Lai | Thien Huu Nguyen | Franck Dernoncourt

Current event detection models under supervised learning settings fail to transfer to new event types. Few-shot learning has not been explored in event detection even though it allows a model to perform well with high generalization on new event types. In this work, we formulate event detection as a few-shot learning problem to enable to extend event detection to new event types. We propose two novel loss factors that matching examples in the support set to provide more training signals to the model. Moreover, these training signals can be applied in many metric-based few-shot learning models. Our extensive experiments on the ACE-2005 dataset (under a few-shot learning setting) show that the proposed method can improve the performance of few-shot learning.

pdf bib
Annotating and quantifying narrative time disruptions in modernist and hypertext fiction
Edward Kearns

This paper outlines work in progress on a new method of annotating and quantitatively discussing narrative techniques related to time in fiction. Specifically those techniques are analepsis, prolepsis, narrative level changes, and stream-of-consciousness and free-indirect-discourse narration. By counting the frequency and extent of the usage of these techniques, the narrative characteristics of different works from different time periods and genres can be compared. This project uses modernist fiction and hypertext fiction as its case studies.

pdf bib
Exploring aspects of similarity between spoken personal narratives by disentangling them into narrative clause types
Belen Saldias | Deb Roy

Sharing personal narratives is a fundamental aspect of human social behavior as it helps share our life experiences. We can tell stories and rely on our background to understand their context, similarities, and differences. A substantial effort has been made towards developing storytelling machines or inferring characters’ features. However, we do n’t usually find models that compare narratives. This task is remarkably challenging for machines since they, as sometimes we do, lack an understanding of what similarity means. To address this challenge, we first introduce a corpus of real-world spoken personal narratives comprising 10,296 narrative clauses from 594 video transcripts. Second, we ask non-narrative experts to annotate those clauses under Labov’s sociolinguistic model of personal narratives (i.e., action, orientation, and evaluation clause types) and train a classifier that reaches 84.7 % F-score for the highest-agreed clauses. Finally, we match stories and explore whether people implicitly rely on Labov’s framework to compare narratives. We show that actions followed by the narrator’s evaluation of these are the aspects non-experts consider the most. Our approach is intended to help inform machine learning methods aimed at studying or representing personal narratives.

pdf bib
On-The-Fly Information Retrieval Augmentation for Language Models
Hai Wang | David McAllester

Here we experiment with the use of information retrieval as an augmentation for pre-trained language models. The text corpus used in information retrieval can be viewed as form of episodic memory which grows over time. By augmenting GPT 2.0 with information retrieval we achieve a zero shot 15 % relative reduction in perplexity on Gigaword corpus without any re-training. We also validate our IR augmentation on an event co-reference task.

up

bib (full) Proceedings of LREC2020 Workshop "People in language, vision and the mind" (ONION2020)

pdf bib
Proceedings of LREC2020 Workshop "People in language, vision and the mind" (ONION2020)
Patrizia Paggio | Albert Gatt | Roman Klinger

pdf bib
Analysis of Body Behaviours in Human-Human and Human-Robot Interactions
Taiga Mori | Kristiina Jokinen | Yasuharu Den

We conducted preliminary comparison of human-robot (HR) interaction with human-human (HH) interaction conducted in English and in Japanese. As the result, body gestures increased in HR, while hand and head gestures decreased in HR. Concerning hand gesture, they were composed of more diverse and complex forms, trajectories and functions in HH than in HR. Moreover, English speakers produced 6 times more hand gestures than Japanese speakers in HH. Regarding head gesture, even though there was no difference in the frequency of head gestures between English speakers and Japanese speakers in HH, Japanese speakers produced slightly more nodding during the robot’s speaking than English speakers in HR. Furthermore, positions of nod were different depending on the language. Concerning body gesture, participants produced body gestures mostly to regulate appropriate distance with the robot in HR. Additionally, English speakers produced slightly more body gestures than Japanese speakers.

pdf bib
Improving Sentiment Analysis with Biofeedback Data
Daniel Schlör | Albin Zehe | Konstantin Kobs | Blerta Veseli | Franziska Westermeier | Larissa Brübach | Daniel Roth | Marc Erich Latoschik | Andreas Hotho

Humans frequently are able to read and interpret emotions of others by directly taking verbal and non-verbal signals in human-to-human communication into account or to infer or even experience emotions from mediated stories. For computers, however, emotion recognition is a complex problem : Thoughts and feelings are the roots of many behavioural responses and they are deeply entangled with neurophysiological changes within humans. As such, emotions are very subjective, often are expressed in a subtle manner, and are highly depending on context. For example, machine learning approaches for text-based sentiment analysis often rely on incorporating sentiment lexicons or language models to capture the contextual meaning. This paper explores if and how we further can enhance sentiment analysis using biofeedback of humans which are experiencing emotions while reading texts. Specifically, we record the heart rate and brain waves of readers that are presented with short texts which have been annotated with the emotions they induce. We use these physiological signals to improve the performance of a lexicon-based sentiment classifier. We find that the combination of several biosignals can improve the ability of a text-based classifier to detect the presence of a sentiment in a text on a per-sentence level.

up

bib (full) Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection

pdf bib
Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection
Hend Al-Khalifa | Walid Magdy | Kareem Darwish | Tamer Elsayed | Hamdy Mubarak

pdf bib
AraBERT : Transformer-based Model for Arabic Language UnderstandingAraBERT: Transformer-based Model for Arabic Language Understanding
Wissam Antoun | Fady Baly | Hazem Hajj

The Arabic language is a morphologically rich language with relatively few resources and a less explored syntax compared to English. Given these limitations, Arabic Natural Language Processing (NLP) tasks like Sentiment Analysis (SA), Named Entity Recognition (NER), and Question Answering (QA), have proven to be very challenging to tackle. Recently, with the surge of transformers based models, language-specific BERT based models have proven to be very efficient at language understanding, provided they are pre-trained on a very large corpus. Such models were able to set new standards and achieve state-of-the-art results for most NLP tasks. In this paper, we pre-trained BERT specifically for the Arabic language in the pursuit of achieving the same success that BERT did for the English language. The performance of AraBERT is compared to multilingual BERT from Google and other state-of-the-art approaches. The results showed that the newly developed AraBERT achieved state-of-the-art performance on most tested Arabic NLP tasks. The pretrained araBERT models are publicly available on https://github.com/aub-mind/araBERT hoping to encourage research and applications for Arabic NLP.

pdf bib
From Arabic Sentiment Analysis to Sarcasm Detection : The ArSarcasm DatasetArabic Sentiment Analysis to Sarcasm Detection: The ArSarcasm Dataset
Ibrahim Abu Farha | Walid Magdy

Sarcasm is one of the main challenges for sentiment analysis systems. Its complexity comes from the expression of opinion using implicit indirect phrasing. In this paper, we present ArSarcasm, an Arabic sarcasm detection dataset, which was created through the reannotation of available Arabic sentiment analysis datasets. The dataset contains 10,547 tweets, 16 % of which are sarcastic. In addition to sarcasm the data was annotated for sentiment and dialects. Our analysis shows the highly subjective nature of these tasks, which is demonstrated by the shift in sentiment labels based on annotators’ biases. Experiments show the degradation of state-of-the-art sentiment analysers when faced with sarcastic content. Finally, we train a deep learning model for sarcasm detection using BiLSTM. The model achieves an F1 score of 0.46, which shows the challenging nature of the task, and should act as a basic baseline for future research on our dataset.

pdf bib
ALT Submission for OSACT Shared Task on Offensive Language DetectionALT Submission for OSACT Shared Task on Offensive Language Detection
Sabit Hassan | Younes Samih | Hamdy Mubarak | Ahmed Abdelali | Ammar Rashed | Shammur Absar Chowdhury

In this paper, we describe our efforts at OSACT Shared Task on Offensive Language Detection. The shared task consists of two subtasks : offensive language detection (Subtask A) and hate speech detection (Subtask B). For offensive language detection, a system combination of Support Vector Machines (SVMs) and Deep Neural Networks (DNNs) achieved the best results on development set, which ranked 1st in the official results for Subtask A with F1-score of 90.51 % on the test set. For hate speech detection, DNNs were less effective and a system combination of multiple SVMs with different parameters achieved the best results on development set, which ranked 4th in official results for Subtask B with F1-macro score of 80.63 % on the test set.

pdf bib
ASU_OPTO at OSACT4-Offensive Language Detection for Arabic textASU_OPTO at OSACT4 - Offensive Language Detection for Arabic text
Amr Keleg | Samhaa R. El-Beltagy | Mahmoud Khalil

In the past years, toxic comments and offensive speech are polluting the internet and manual inspection of these comments is becoming a tiresome task to manage. Having a machine learning based model that is able to filter offensive Arabic content is of high need nowadays. In this paper, we describe the model that was submitted to the Shared Task on Offensive Language Detection that is organized by (The 4th Workshop on Open-Source Arabic Corpora and Processing Tools). Our model makes use transformer based model (BERT) to detect offensive content. We came in the fourth place in subtask A (detecting Offensive Speech) and in the third place in subtask B (detecting Hate Speech).

pdf bib
Multi-Task Learning using AraBert for Offensive Language DetectionAraBert for Offensive Language Detection
Marc Djandji | Fady Baly | Wissam Antoun | Hazem Hajj

The use of social media platforms has become more prevalent, which has provided tremendous opportunities for people to connect but has also opened the door for misuse with the spread of hate speech and offensive language. This phenomenon has been driving more and more people to more extreme reactions and online aggression, sometimes causing physical harm to individuals or groups of people. There is a need to control and prevent such misuse of online social media through automatic detection of profane language. The shared task on Offensive Language Detection at the OSACT4 has aimed at achieving state of art profane language detection methods for Arabic social media. Our team BERTologists tackled this problem by leveraging state of the art pretrained Arabic language model, AraBERT, that we augment with the addition of Multi-task learning to enable our model to learn efficiently from little data. Our Multitask AraBERT approach achieved the second place in both subtasks A & B, which shows that the model performs consistently across different tasks.

up

bib (full) Proceedings of the Second ParlaCLARIN Workshop

pdf bib
Proceedings of the Second ParlaCLARIN Workshop
Darja Fišer | Maria Eskevich | Franciska de Jong

pdf bib
Compiling Czech Parliamentary Stenographic Protocols into a CorpusCzech Parliamentary Stenographic Protocols into a Corpus
Barbora Hladka | Matyáš Kopp | Pavel Straňák

The Parliament of the Czech Republic consists of two chambers : the Chamber of Deputies (Lower House) and the Senate (Upper House). In our work, we focus on agenda and documents that relate to the Chamber of Deputies exclusively. We pay particular attention to stenographic protocols that record the Chamber of Deputies’ meetings. Our overall goal is to (1) compile the protocols into a ParlaCLARIN TEI encoded corpus, (2) make this corpus accessible and searchable in the TEITOK web-based platform, (3) annotate the corpus using the modules available in TEITOK, e.g. detect and recognize named entities, and (4) highlight the annotations in TEITOK. In addition, we add two more goals that we consider innovative : (5) update the corpus every time a new stenographic protocol is published online by the Chambers of Deputies and (6) expose the annotations as the linked open data in order to improve the protocols’ interoperability with other existing linked open data. This paper is devoted to the goals (1) and (5).

pdf bib
Who mentions whom? Recognizing political actors in proceedings
Lennart Kerkvliet | Jaap Kamps | Maarten Marx

We show that it is straightforward to train a state of the art named entity tagger (spaCy) to recognize political actors in Dutch parliamentary proceedings with high accuracy. The tagger was trained on 3.4 K manually labeled examples, which were created in a modest 2.5 days work. This resource is made available on github. Besides proper nouns of persons and political parties, the tagger can recognize quite complex definite descriptions referring to cabinet ministers, ministries, and parliamentary committees. We also provide a demo search engine which employs the tagged entities in its SERP and result summaries.

pdf bib
Querying a large annotated corpus of parliamentary debates
Sascha Diwersy | Giancarlo Luxardo

The TAPS corpus makes it possible to share a large volume of French parliamentary data. The TEI-compliant approach behind its design choices facilitates the publishing and the interoperability of data, but also the implementation of exploratory data analysis techniques in order to process institutional or political discourse. We demonstrate its application to the debates occurred in the context of a specific legislative process, which generated a strong opposition.

up

bib (full) Proceedings of the first workshop on Resources for African Indigenous Languages

pdf bib
Proceedings of the first workshop on Resources for African Indigenous Languages
Rooweither Mabuya | Phathutshedzo Ramukhadi | Mmasibidi Setaka | Valencia Wagner | Menno van Zaanen

pdf bib
Endangered African Languages Featured in a Digital Collection : The Case of the Khomani San, Hugh Brody CollectionAfrican Languages Featured in a Digital Collection: The Case of the ǂKhomani San, Hugh Brody Collection
Kerry Jones | Sanjin Muftic

The Khomani San, Hugh Brody Collection features the voices and history of indigenous hunter gatherer descendants in three endangered languages namely, N|uu, Kora and Khoekhoe as well as a regional dialect of Afrikaans. A large component of this collection is audio-visual (legacy media) recordings of interviews conducted with members of the community by Hugh Brody and his colleagues between 1997 and 2012, referring as far back as the 1800s. The Digital Library Services team at the University of Cape Town aim to showcase the collection digitally on the UCT-wide Digital Collections platform, Ibali which runs on Omeka-S. In this paper we highlight the importance of such a collection in the context of South Africa, and the ethical steps that were taken to ensure the respect of the Khomani San as their stories get uploaded onto a repository and become accessible to all. We will also feature some of the completed collection on Ibali and guide the reader through the organisation of the collection on the Omeka-S backend. Finally, we will outline our development process, from digitisation to repository publishing as well as present some of the challenges in data clean-up, the curation of legacy media, multi-lingual support, and site organisation.

pdf bib
Complex Setswana Parts of Speech Tagging
Gabofetswe Malema | Boago Okgetheng | Bopaki Tebalo | Moffat Motlhanka | Goaletsa Rammidi

Setswana language is one of the Bantu languages written disjunctively. Some of its parts of speech such as qualificatives and some adverbs are made up of multiple words. That is, the part of speech is made up of a group of words. The disjunctive style of writing poses a challenge when a sentence is tokenized or when tagging. A few studies have been done on identification of multi-word parts of speech. In this study we go further to tokenize complex parts of speech which are formed by extending basic forms of multi-word parts of speech. The parts of speech are extended by recursively concatenating more parts of speech to a basic form of parts of speech. We developed rules for building complex relative parts of speech. A morphological analyzer and Python NLTK are used to tag individual words and basic forms of multi-word parts of speech. Developed rules are then used to identify complex parts of speech. Results from a 300 sentence text files give a performance of 74 %. The tagger fails when it encounters expansion rules not implemented and when tagging by the morphological analyzer is incorrect.

pdf bib
Comparing Neural Network Parsers for a Less-resourced and Morphologically-rich Language : Amharic Dependency ParserAmharic Dependency Parser
Binyam Ephrem Seyoum | Yusuke Miyao | Baye Yimam Mekonnen

In this paper, we compare four state-of-the-art neural network dependency parsers for the Semitic language Amharic. As Amharic is a morphologically-rich and less-resourced language, the out-of-vocabulary (OOV) problem will be higher when we develop data-driven models. This fact limits researchers to develop neural network parsers because the neural network requires large quantities of data to train a model. We empirically evaluate neural network parsers when a small Amharic treebank is used for training. Based on our experiment, we obtain an 83.79 LAS score using the UDPipe system. Better accuracy is achieved when the neural parsing system uses external resources like word embedding. Using such resources, the LAS score for UDPipe improves to 85.26. Our experiment shows that the neural networks can learn dependency relations better from limited data while segmentation and POS tagging require much data.

pdf bib
Mobilizing Metadata : Open Data Kit (ODK) for Language Resource Development in East AfricaODK) for Language Resource Development in East Africa
Richard Griscom

Linguistic fieldworkers collect and archive metadata as part of the language resources (LRs) that they create, but they often work in resource-constrained environments that prevent them from using computers for data entry. In such situations, linguists must complete time-consuming and error-prone digitization tasks that limit the quantity and quality of the resources and metadata that they produce (Thieberger & Berez 2012 ; Margetts & Margetts 2012). This paper describes a method for entering linguistic metadata into mobile devices using the Open Data Kit (ODK) platform, a suite of open source tools designed for mobile data collection.

pdf bib
A Computational Grammar of GaGa
Lars Hellan

The paper describes aspects of an HPSG style computational grammar of the West African language Ga (a Kwa language spoken in the Accra area of Ghana). As a Volta Basin Kwa language, Ga features many types of multiverb expressions and other particular constructional patterns in the verbal and nominal domain. The paper highlights theoretical and formal features of the grammar motivated by these phenomena, some of them possibly innovative to the formal framework. As a so-called deep grammar of the language, it hosts a rich lexical structure, and we describe ways in which the grammar builds on previously available lexical resources. We outline an environment of current resources in which the grammar is part, and lines of research and development in which it and its environment can be used.

up

bib (full) Proceedings of the 1st Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI)

pdf bib
Proceedings of the 1st Workshop on Tools and Resources to Empower People with REAding DIfficulties (READI)
Núria Gala | Rodrigo Wilkens

pdf bib
Automatically Assess Children’s Reading Skills
Ornella Mich | Nadia Mana | Roberto Gretter | Marco Matassoni | Daniele Falavigna

Assessing reading skills is an important task teachers have to perform at the beginning of a new scholastic year to evaluate the starting level of the class and properly plan next learning activities. Digital tools based on automatic speech recognition (ASR) may be really useful to support teachers in this task, currently very time consuming and prone to human errors. This paper presents a web application for automatically assessing fluency and accuracy of oral reading in children attending Italian primary and lower secondary schools. Our system, based on ASR technology, implements the Cornoldi’s MT battery, which is a well-known Italian test to assess reading skills. The front-end of the system has been designed following the participatory design approach by involving end users from the beginning of the creation process. Teachers may use our system to both test student’s reading skills and monitor their performance over time. In fact, the system offers an effective graphical visualization of the assessment results for both individual students and entire class. The paper also presents the results of a pilot study to evaluate the system usability with teachers.

pdf bib
Text Simplification to Help Individuals with Low Vision Read More Fluently
Lauren Sauvan | Natacha Stolowy | Carlos Aguilar | Thomas François | Núria Gala | Frédéric Matonti | Eric Castet | Aurélie Calabrèse

The objective of this work is to introduce text simplification as a potential reading aid to help improve the poor reading performance experienced by visually impaired individuals. As a first step, we explore what makes a text especially complex when read with low vision, by assessing the individual effect of three word properties (frequency, orthographic similarity and length) on reading speed in the presence of Central visual Field Loss (CFL). Individuals with bilateral CFL induced by macular diseases read pairs of French sentences displayed with the self-paced reading method. For each sentence pair, sentence n contained a target word matched with a synonym word of the same length included in sentence n+1. Reading time was recorded for each target word. Given the corpus we used, our results show that (1) word frequency has a significant effect on reading time (the more frequent the faster the reading speed) with larger amplitude (in the range of seconds) compared to normal vision ; (2) word neighborhood size has a significant effect on reading time (the more neighbors the slower the reading speed), this effect being rather small in amplitude, but interestingly reversed compared to normal vision ; (3) word length has no significant effect on reading time. Supporting the development of new and more effective assistive technology to help low vision is an important and timely issue, with massive potential implications for social and rehabilitation practices. The end goal of this project will be to use our findings to custom text simplification to this specific population and use it as an optimal and efficient reading aid.

pdf bib
LagunTest : A NLP Based Application to Enhance Reading ComprehensionLagunTest: A NLP Based Application to Enhance Reading Comprehension
Kepa Bengoetxea | Itziar Gonzalez-Dios | Amaia Aguirregoitia

The ability to read and understand written texts plays an important role in education, above all in the last years of primary education. This is especially pertinent in language immersion educational programmes, where some students have low linguistic competence in the languages of instruction. In this context, adapting the texts to the individual needs of each student requires a considerable effort by education professionals. However, language technologies can facilitate the laborious adaptation of materials in order to enhance reading comprehension. In this paper, we present LagunTest, a NLP based application that takes as input a text in Basque or English, and offers synonyms, definitions, examples of the words in different contexts and presents some linguistic characteristics as well as visualizations. LagunTest is based on reusable and open multilingual and multimodal tools, and it is also distributed with an open license. LagunTest is intended to ease the burden of education professionals in the task of adapting materials, and the output should always be supervised by them.

pdf bib
A Lexical Simplification Tool for Promoting Health Literacy
Leonardo Zilio | Liana Braga Paraguassu | Luis Antonio Leiva Hercules | Gabriel Ponomarenko | Laura Berwanger | Maria José Bocorny Finatto

This paper presents MedSimples, an authoring tool that combines Natural Language Processing, Corpus Linguistics and Terminology to help writers to convert health-related information into a more accessible version for people with low literacy skills. MedSimples applies parsing methods associated with lexical resources to automatically evaluate a text and present simplification suggestions that are more suitable for the target audience. Using the suggestions provided by the tool, the author can adapt the original text and make it more accessible. The focus of MedSimples lies on texts for special purposes, so that it not only deals with general vocabulary, but also with specialized terms. The tool is currently under development, but an online working prototype exists and can be tested freely. An assessment of MedSimples was carried out aiming at evaluating its current performance with some promising results, especially for informing the future developments that are planned for the tool.

pdf bib
A multi-lingual and cross-domain analysis of features for text simplification
Regina Stodden | Laura Kallmeyer

In text simplification and readability research, several features have been proposed to estimate or simplify a complex text, e.g., readability scores, sentence length, or proportion of POS tags. These features are however mainly developed for English. In this paper, we investigate their relevance for Czech, German, English, Spanish, and Italian text simplification corpora. Our multi-lingual and multi-domain corpus analysis shows that the relevance of different features for text simplification is different per corpora, language, and domain. For example, the relevance of the lexical complexity is different across all languages, the BLEU score across all domains, and 14 features within the web domain corpora. Overall, the negative statistical tests regarding the other features across and within domains and languages lead to the assumption that text simplification models may be transferable between different domains or different languages.

up

pdf (full)
bib (full)
Proceedings of the 5th Workshop on Representation Learning for NLP

pdf bib
Proceedings of the 5th Workshop on Representation Learning for NLP
Spandana Gella | Johannes Welbl | Marek Rei | Fabio Petroni | Patrick Lewis | Emma Strubell | Minjoon Seo | Hannaneh Hajishirzi

pdf bib
Zero-Resource Cross-Domain Named Entity Recognition
Zihan Liu | Genta Indra Winata | Pascale Fung

Existing models for cross-domain named entity recognition (NER) rely on numerous unlabeled corpus or labeled NER training data in target domains. However, collecting data for low-resource target domains is not only expensive but also time-consuming. Hence, we propose a cross-domain NER model that does not use any external resources. We first introduce a Multi-Task Learning (MTL) by adding a new objective function to detect whether tokens are named entities or not. We then introduce a framework called Mixture of Entity Experts (MoEE) to improve the robustness for zero-resource domain adaptation. Finally, experimental results show that our model outperforms strong unsupervised cross-domain sequence labeling models, and the performance of our model is close to that of the state-of-the-art model which leverages extensive resources.

pdf bib
Encodings of Source Syntax : Similarities in NMT Representations Across Target LanguagesNMT Representations Across Target Languages
Tyler A. Chang | Anna Rafferty

We train neural machine translation (NMT) models from English to six target languages, using NMT encoder representations to predict ancestor constituent labels of source language words. We find that NMT encoders learn similar source syntax regardless of NMT target language, relying on explicit morphosyntactic cues to extract syntactic features from source sentences. Furthermore, the NMT encoders outperform RNNs trained directly on several of the constituent label prediction tasks, suggesting that NMT encoder representations can be used effectively for natural language tasks involving syntax. However, both the NMT encoders and the directly-trained RNNs learn substantially different syntactic information from a probabilistic context-free grammar (PCFG) parser. Despite lower overall accuracy scores, the PCFG often performs well on sentences for which the RNN-based models perform poorly, suggesting that RNN architectures are constrained in the types of syntax they can learn.

pdf bib
Learning Probabilistic Sentence Representations from Paraphrases
Mingda Chen | Kevin Gimpel

Probabilistic word embeddings have shown effectiveness in capturing notions of generality and entailment, but there is very little work on doing the analogous type of investigation for sentences. In this paper we define probabilistic models that produce distributions for sentences. Our best-performing model treats each word as a linear transformation operator applied to a multivariate Gaussian distribution. We train our models on paraphrases and demonstrate that they naturally capture sentence specificity. While our proposed model achieves the best performance overall, we also show that specificity is represented by simpler architectures via the norm of the sentence vectors. Qualitative analysis shows that our probabilistic model captures sentential entailment and provides ways to analyze the specificity and preciseness of individual words.

pdf bib
Word Embeddings as Tuples of Feature Probabilities
Siddharth Bhat | Alok Debnath | Souvik Banerjee | Manish Shrivastava

In this paper, we provide an alternate perspective on word representations, by reinterpreting the dimensions of the vector space of a word embedding as a collection of features. In this reinterpretation, every component of the word vector is normalized against all the word vectors in the vocabulary. This idea now allows us to view each vector as an n-tuple (akin to a fuzzy set), where n is the dimensionality of the word representation and each element represents the probability of the word possessing a feature. Indeed, this representation enables the use fuzzy set theoretic operations, such as union, intersection and difference. Unlike previous attempts, we show that this representation of words provides a notion of similarity which is inherently asymmetric and hence closer to human similarity judgements. We compare the performance of this representation with various benchmarks, and explore some of the unique properties including function word detection, detection of polysemous words, and some insight into the interpretability provided by set theoretic operations.n-tuple (akin to a fuzzy set), where n is the dimensionality of the word representation and each element represents the probability of the word possessing a feature. Indeed, this representation enables the use fuzzy set theoretic operations, such as union, intersection and difference. Unlike previous attempts, we show that this representation of words provides a notion of similarity which is inherently asymmetric and hence closer to human similarity judgements. We compare the performance of this representation with various benchmarks, and explore some of the unique properties including function word detection, detection of polysemous words, and some insight into the interpretability provided by set theoretic operations.

pdf bib
Compositionality and Capacity in Emergent Languages
Abhinav Gupta | Cinjon Resnick | Jakob Foerster | Andrew Dai | Kyunghyun Cho

Recent works have discussed the extent to which emergent languages can exhibit properties of natural languages particularly learning compositionality. In this paper, we investigate the learning biases that affect the efficacy and compositionality in multi-agent communication in addition to the communicative bandwidth. Our foremost contribution is to explore how the capacity of a neural network impacts its ability to learn a compositional language. We additionally introduce a set of evaluation metrics with which we analyze the learned languages. Our hypothesis is that there should be a specific range of model capacity and channel bandwidth that induces compositional structure in the resulting language and consequently encourages systematic generalization. While we empirically see evidence for the bottom of this range, we curiously do not find evidence for the top part of the range and believe that this is an open question for the community.

pdf bib
Learning Geometric Word Meta-Embeddings
Pratik Jawanpuria | Satya Dev N T V | Anoop Kunchukuttan | Bamdev Mishra

We propose a geometric framework for learning meta-embeddings of words from different embedding sources. Our framework transforms the embeddings into a common latent space, where, for example, simple averaging or concatenation of different embeddings (of a given word) is more amenable. The proposed latent space arises from two particular geometric transformations-source embedding specific orthogonal rotations and a common Mahalanobis metric scaling. Empirical results on several word similarity and word analogy benchmarks illustrate the efficacy of the proposed framework.

pdf bib
Exploring the Limits of Simple Learners in Knowledge Distillation for Document Classification with DocBERTDocBERT
Ashutosh Adhikari | Achyudh Ram | Raphael Tang | William L. Hamilton | Jimmy Lin

Fine-tuned variants of BERT are able to achieve state-of-the-art accuracy on many natural language processing tasks, although at significant computational costs. In this paper, we verify BERT’s effectiveness for document classification and investigate the extent to which BERT-level effectiveness can be obtained by different baselines, combined with knowledge distillationa popular model compression method. The results show that BERT-level effectiveness can be achieved by a single-layer LSTM with at least 40 fewer FLOPS and only 3 % parameters. More importantly, this study analyzes the limits of knowledge distillation as we distill BERT’s knowledge all the way down to linear modelsa relevant baseline for the task. We report substantial improvement in effectiveness for even the simplest models, as they capture the knowledge learnt by BERT.40\\times fewer FLOPS and only {\\sim}3\\% parameters. More importantly, this study analyzes the limits of knowledge distillation as we distill BERT’s knowledge all the way down to linear models—a relevant baseline for the task. We report substantial improvement in effectiveness for even the simplest models, as they capture the knowledge learnt by BERT.

pdf bib
Are All Languages Created Equal in Multilingual BERT?BERT?
Shijie Wu | Mark Dredze

Multilingual BERT (mBERT) trained on 104 languages has shown surprisingly good cross-lingual performance on several NLP tasks, even without explicit cross-lingual signals. However, these evaluations have focused on cross-lingual transfer with high-resource languages, covering only a third of the languages covered by mBERT. We explore how mBERT performs on a much wider set of languages, focusing on the quality of representation for low-resource languages, measured by within-language performance. We consider three tasks : Named Entity Recognition (99 languages), Part-of-speech Tagging and Dependency Parsing (54 languages each). mBERT does better than or comparable to baselines on high resource languages but does much worse for low resource languages. Furthermore, monolingual BERT models for these languages do even worse. Paired with similar languages, the performance gap between monolingual BERT and mBERT can be narrowed. We find that better models for low resource languages require more efficient pretraining techniques or more data.

pdf bib
Evaluating Compositionality of Sentence Representation Models
Hanoz Bhathena | Angelica Willis | Nathan Dass

We evaluate the compositionality of general-purpose sentence encoders by proposing two different metrics to quantify compositional understanding capability of sentence encoders. We introduce a novel metric, Polarity Sensitivity Scoring (PSS), which utilizes sentiment perturbations as a proxy for measuring compositionality. We then compare results from PSS with those obtained via our proposed extension of a metric called Tree Reconstruction Error (TRE) (CITATION) where compositionality is evaluated by measuring how well a true representation producing model can be approximated by a model that explicitly combines representations of its primitives.

up

bib (full) Proceedings of the Workshop on Resources and Techniques for User and Author Profiling in Abusive Language

pdf bib
Proceedings of the Workshop on Resources and Techniques for User and Author Profiling in Abusive Language
Johanna Monti | Valerio Basile | Maria Pia Di Buono | Raffaele Manna | Antonio Pascucci | Sara Tonelli

pdf bib
Profiling Bots, Fake News Spreaders and Haters
Paolo Rosso

Author profiling studies how language is shared by people. Stylometry techniques help in identifying aspects such as gender, age, native language, or even personality. Author profiling is a problem of growing importance, not only in marketing and forensics, but also in cybersecurity. The aim is not only to identify users whose messages are potential threats from a terrorism viewpoint but also those whose messages are a threat from a social exclusion perspective because containing hate speech, cyberbullying etc. Bots often play a key role in spreading hate speech, as well as fake news, with the purpose of polarizing the public opinion with respect to controversial issues like Brexit or the Catalan referendum. For instance, the authors of a recent study about the 1 Oct 2017 Catalan referendum, showed that in a dataset with 3.6 million tweets, about 23.6 % of tweets were produced by bots. The target of these bots were pro-independence influencers that were sent negative, emotional and aggressive hateful tweets with hashtags such as # sonunesbesties (i.e. # theyareanimals). Since 2013 at the PAN Lab at CLEF (https://pan.webis.de/) we have addressed several aspects of author profiling in social media. In 2019 we investigated the feasibility of distinguishing whether the author of a Twitter feed is a bot, while this year we are addressing the problem of profiling those authors that are more likely to spread fake news in Twitter because they did in the past. We aim at identifying possible fake news spreaders as a first step towards preventing fake news from being propagated among online users (fake news aim to polarize the public opinion and may contain hate speech). In 2021 we specifically aim at addressing the challenging problem of profiling haters in social media in order to monitor abusive language and prevent cases of social exclusion in order to combat, for instance, racism, xenophobia and misogyny. Although we already started addressing the problem of detecting hate speech when targets are immigrants or women at the HatEval shared task in SemEval-2019, and when targets are women also in the Automatic Misogyny Identification tasks at IberEval-2018, Evalita-2018 and Evalita-2020, it was not done from an author profiling perspective. At the end of the keynote, I will present some insights in order to stress the importance of monitoring abusive language in social media, for instance, in foreseeing sexual crimes. In fact, previous studies confirmed that a correlation might lay between the yearly per capita rate of rape and the misogynistic language used in Twitter.

pdf bib
An Indian Language Social Media Collection for Hate and Offensive SpeechIndian Language Social Media Collection for Hate and Offensive Speech
Anita Saroj | Sukomal Pal

In social media, people express themselves every day on issues that affect their lives. During the parliamentary elections, people’s interaction with the candidates in social media posts reflects a lot of social trends in a charged atmosphere. People’s likes and dislikes on leaders, political parties and their stands often become subject of hate and offensive posts. We collected social media posts in Hindi and English from Facebook and Twitter during the run-up to the parliamentary election 2019 of India (PEI data-2019). We created a dataset for sentiment analysis into three categories : hate speech, offensive and not hate, or not offensive. We report here the initial results of sentiment classification for the dataset using different classifiers.

up

pdf (full)
bib (full)
Proceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

pdf bib
Proceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology
Garrett Nicolai | Kyle Gorman | Ryan Cotterell

pdf bib
The CMU-LTI submission to the SIGMORPHON 2020 Shared Task 0 : Language-Specific Cross-Lingual TransferCMU-LTI submission to the SIGMORPHON 2020 Shared Task 0: Language-Specific Cross-Lingual Transfer
Nikitha Murikinati | Antonios Anastasopoulos

This paper describes the CMU-LTI submission to the SIGMORPHON 2020 Shared Task 0 on typologically diverse morphological inflection. The (unrestricted) submission uses the cross-lingual approach of our last year’s winning submission (Anastasopoulos and Neubig, 2019), but adapted to use specific transfer languages for each test language. Our system, with fixed non-tuned hyperparameters, achieved a macro-averaged accuracy of 80.65 ranking 20th among 31 systems, but it was still tied for best system in 25 of the 90 total languages.

pdf bib
Grapheme-to-Phoneme Conversion with a Multilingual Transformer Model
Omnia ElSaadany | Benjamin Suter

In this paper, we describe our three submissions to the SIGMORPHON 2020 shared task 1 on grapheme-to-phoneme conversion for 15 languages. We experimented with a single multilingual transformer model. We observed that the multilingual model achieves results on par with our separately trained monolingual models and is even able to avoid a few of the errors made by the monolingual models.

pdf bib
The IMSCUBoulder System for the SIGMORPHON 2020 Shared Task on Unsupervised Morphological Paradigm CompletionIMSCUBoulder System for the SIGMORPHON 2020 Shared Task on Unsupervised Morphological Paradigm Completion
Manuel Mager | Katharina Kann

In this paper, we present the systems of the University of Stuttgart IMS and the University of Colorado Boulder (IMSCUBoulder) for SIGMORPHON 2020 Task 2 on unsupervised morphological paradigm completion (Kann et al., 2020). The task consists of generating the morphological paradigms of a set of lemmas, given only the lemmas themselves and unlabeled text. Our proposed system is a modified version of the baseline introduced together with the task. In particular, we experiment with substituting the inflection generation component with an LSTM sequence-to-sequence model and an LSTM pointer-generator network. Our pointer-generator system obtains the best score of all seven submitted systems on average over all languages, and outperforms the official baseline, which was best overall, on Bulgarian and Kannada.

pdf bib
Exploring Neural Architectures And Techniques For Typologically Diverse Morphological Inflection
Pratik Jayarao | Siddhanth Pillay | Pranav Thombre | Aditi Chaudhary

Morphological inflection in low resource languages is critical to augment existing corpora in Low Resource Languages, which can help develop several applications in these languages with very good social impact. We describe our attention-based encoder-decoder approach that we implement using LSTMs and Transformers as the base units. We also describe the ancillary techniques that we experimented with, such as hallucination, language vector injection, sparsemax loss and adversarial language network alongside our approach to select the related language(s) for training. We present the results we generated on the constrained as well as unconstrained SIGMORPHON 2020 dataset (CITATION). One of the primary goals of our paper was to study the contribution varied components described above towards the performance of our system and perform an analysis on the same.

pdf bib
Leveraging Principal Parts for Morphological Inflection
Ling Liu | Mans Hulden

This paper presents the submission by the CU Ling team from the University of Colorado to SIGMORPHON 2020 shared task 0 on morphological inflection. The task is to generate the target inflected word form given a lemma form and a target morphosyntactic description. Our system uses the Transformer architecture. Our overall approach is to treat the morphological inflection task as a paradigm cell filling problem and to design the system to leverage principal parts information for better morphological inflection when the training data is limited. We train one model for each language separately without external data. The overall average performance of our submission ranks the first in both average accuracy and Levenshtein distance from the gold inflection among all submissions including those using external resources.

pdf bib
Multi-Tiered Strictly Local Functions
Phillip Burness | Kevin McMullin

Tier-based Strictly Local functions, as they have so far been defined, are equipped with just a single tier. In light of this fact, they are currently incapable of modelling simultaneous phonological processes that would require different tiers. In this paper we consider whether and how we can allow a single function to operate over more than one tier. We conclude that multiple tiers can and should be permitted, but that the relationships between them must be restricted in some way to avoid overgeneration. The particular restriction that we propose comes in two parts. First, each input element is associated with a set of tiers that on their own can fully determine what the element is mapped to. Second, the set of tiers associated to a given input element must form a strict superset-subset hierarchy. In this way, we can track multiple, related sources of information when deciding how to process a particular input element. We demonstrate that doing so enables simple and intuitive analyses to otherwise challenging phonological phenomena.

up

bib (full) Proceedings of the LREC2020 9th Workshop on the Representation and Processing of Sign Languages: Sign Language Resources in the Service of the Language Community, Technological Challenges and Application Perspectives

pdf bib
Proceedings of the LREC2020 9th Workshop on the Representation and Processing of Sign Languages: Sign Language Resources in the Service of the Language Community, Technological Challenges and Application Perspectives
Eleni Efthimiou | Stavroula-Evita Fotinea | Thomas Hanke | Julie A. Hochgesang | Jette Kristoffersen | Johanna Mesch

pdf bib
Back and Forth between Theory and Application : Shared Phonological Coding Between ASL Signbank and ASL-LEXASL Signbank and ASL-LEX
Amelia Becker | Donovan Catt | Julie A. Hochgesang

The development of signed language lexical databases, digital organizations that describe different phonological features of and attempt to establish relationships between signs has resulted in a renewed interest in the phonological descriptions used to uniquely identify and organize the lexicons of respective sign languages (van der Kooij, 2002 ; Fenlon et al., 2016 ; Brentari et al., 2018). Throughout the mutually shared coding process involved in organizing two lexical databases, ASL Signbank (Hochgesang, Crasborn and Lillo-Martin, 2020) and ASL-LEX (Caselli et al., 2016), issues have arisen that require revisiting how phonological features and categories are to be applied and even decided upon, and which would adequately distinguish lexical contrast for respective sign languages. The paper concludes by exploring the inverse of the theory-to-database relationship. Examples are given of theoretical implications and research questions that arise from consequences of language resource building. These are presented as evidence that not only does theory impact organization of databases but that the process of database creation can also inform our theories.

pdf bib
Measuring Lexical Similarity across Sign Languages in Global SignbankGlobal Signbank
Carl Börstell | Onno Crasborn | Lori Whynot

Lexicostatistics is the main method used in previous work measuring linguistic distances between sign languages. As a method, it disregards any possible structural / grammatical similarity, instead focusing exclusively on lexical items, but it is time consuming as it requires some comparable phonological coding (i.e. form description) as well as concept matching (i.e. meaning description) of signs across the sign languages to be compared. In this paper, we present a novel approach for measuring lexical similarity across any two sign languages using the Global Signbank platform, a lexical database of uniformly coded signs. The method involves a feature-by-feature comparison of all matched phonological features. This method can be used in two distinct ways : 1) automatically comparing the amount of lexical overlap between two sign languages (with a more detailed feature-description than previous lexicostatistical methods) ; 2) finding exact form-matches across languages that are either matched or mismatched in meaning (i.e. true or false friends). We show the feasability of this method by comparing three languages (datasets) in Global Signbank, and are currently expanding both the size of these three as well as the total number of datasets.

pdf bib
PE2LGP Animator : A Tool To Animate A Portuguese Sign Language AvatarPE2LGP Animator: A Tool To Animate A Portuguese Sign Language Avatar
Pedro Cabral | Matilde Gonçalves | Hugo Nicolau | Luísa Coheur | Ruben Santos

Software for the production of sign languages is much less common than for spoken languages. Such software usually relies on 3D humanoid avatars to produce signs which, inevitably, necessitates the use of animation. One barrier to the use of popular animation tools is their complexity and steep learning curve, which can be hard to master for inexperienced users. Here, we present PE2LGP, an authoring system that features a 3D avatar that signs Portuguese Sign Language. Our Animator is designed specifically to craft sign language animations using a key frame method, and is meant to be easy to use and learn to users without animation skills. We conducted a preliminary evaluation of the Animator, where we animated seven Portuguese Sign Language sentences and asked four sign language users to evaluate their quality. This evaluation revealed that the system, in spite of its simplicity, is indeed capable of producing comprehensible messages.

pdf bib
Translating an Aesop’s Fable to Filipino Sign Language through 3D AnimationAesop’s Fable to Filipino Sign Language through 3D Animation
Mark Cueto | Winnie He | Rei Untiveros | Josh Zuñiga | Joanna Pauline Rivera

According to the National Statistics Office (2003) in the 2000 Population Census, the deaf community in the Philippines numbered to about 121,000 deaf and hard of hearing Filipinos. Deaf and hard of hearing Filipinos in these communities use the Filipino Sign Language (FSL) as the main method of manual communication. Deaf and hard of hearing children experience difficulty in developing reading and writing skills through traditional methods of teaching used primarily for hearing children. This study aims to translate an Aesop’s fable to Filipino Sign Language with the use of 3D animation resulting to a video output. The video created contains a 3D animated avatar performing the sign translations to FSL (mainly focusing on hand gestures which includes hand shape, palm orientation, location, and movement) on screen beside their English text equivalent and related images. The final output was then evaluated by FSL deaf signers. Evaluation results showed that the final output can potentially be used as a learning material. In order to make it more effective as a learning material, it is very important to consider the animation’s appearance, speed, naturalness, and accuracy. In this paper, the common action units were also listed for easier construction of animations of the signs.

pdf bib
Elicitation and Corpus of Spontaneous Sign Language Discourse Representation Diagrams
Michael Filhol

While Sign Languages have no standard written form, many signers do capture their language in some form of spontaneous graphical form. We list a few use cases (discourse preparation, deverbalising for translation, etc.) and give examples of diagrams. After hypothesising that they contain regular patterns of significant value, we propose to build a corpus of such productions. The main contribution of this paper is the specification of the elicitation protocol, explaining the variables that are likely to affect the diagrams collected. We conclude with a report on the current state of a collection following this protocol, and a few observations on the collected contents. A first prospect is the standardisation of a scheme to represent SL discourse in a way that would make them sharable. A subsequent longer-term prospect is for this scheme to be owned by users and with time be shaped into a script for their language.

pdf bib
Signing as Input for a Dictionary Query : Matching Signs Based on Joint Positions of the Dominant Hand
Manolis Fragkiadakis | Victoria Nyst | Peter van der Putten

This study presents a new methodology to search sign language lexica, using a full sign as input for a query. Thus, a dictionary user can look up information about a sign by signing the sign to a webcam. The recorded sign is then compared to potential matching signs in the lexicon. As such, it provides a new way of searching sign language dictionaries to complement existing methods based on (spoken language) glosses or phonological features, like handshape or location. The method utilizes OpenPose to extract the body and finger joint positions. Dynamic Time Warping (DTW) is used to quantify the variation of the trajectory of the dominant hand and the average trajectories of the fingers. Ten people with various degrees of sign language proficiency have participated in this study. Each subject viewed a set of 20 signs from the newly compiled Ghanaian sign language lexicon and was asked to replicate the signs. The results show that DTW can predict the matching sign with 87 % and 74 % accuracy at the Top-10 and Top-5 ranking level respectively by using only the trajectory of the dominant hand. Additionally, more proficient signers obtain 90 % accuracy at the Top-10 ranking. The methodology has the potential to be used also as a variation measurement tool to quantify the difference in signing between different signers or sign languages in general.

pdf bib
An Isolated-Signing RGBD Dataset of 100 American Sign Language Signs Produced by Fluent ASL SignersRGBD Dataset of 100 American Sign Language Signs Produced by Fluent ASL Signers
Saad Hassan | Larwan Berke | Elahe Vahdani | Longlong Jing | Yingli Tian | Matt Huenerfauth

We have collected a new dataset consisting of color and depth videos of fluent American Sign Language (ASL) signers performing sequences of 100 ASL signs from a Kinect v2 sensor. This directed dataset had originally been collected as part of an ongoing collaborative project, to aid in the development of a sign-recognition system for identifying occurrences of these 100 signs in video. The set of words consist of vocabulary items that would commonly be learned in a first-year ASL course offered at a university, although the specific set of signs selected for inclusion in the dataset had been motivated by project-related factors. Given increasing interest among sign-recognition and other computer-vision researchers in red-green-blue-depth (RBGD) video, we release this dataset for use by the research community. In addition to the RGB video files, we share depth and HD face data as well as additional features of face, hands, and body produced through post-processing of this data.

pdf bib
Sign Language Motion Capture Dataset for Data-driven Synthesis
Pavel Jedlička | Zdeněk Krňoul | Jakub Kanis | Miloš Železný

This paper presents a new 3D motion capture dataset of Czech Sign Language (CSE). Its main purpose is to provide the data for further analysis and data-based automatic synthesis of CSE utterances. The content of the data in the given limited domain of weather forecasts was carefully selected by the CSE linguists to provide the necessary utterances needed to produce any new weather forecast. The dataset was recorded using the state-of-the-art motion capture (MoCap) technology to provide the most precise trajectories of the motion. In general, MoCap is a device capable of accurate recording of motion directly in 3D space. The data contains trajectories of body, arms, hands and face markers recorded at once to provide consistent data without the need for the time alignment.

pdf bib
Recognition of Static Features in Sign Language Using Key-Points
Ioannis Koulierakis | Georgios Siolas | Eleni Efthimiou | Evita Fotinea | Andreas-Georgios Stafylopatis

In this paper we report on a research effort focusing on recognition of static features of sign formation in single sign videos. Three sequential models have been developed for handshape, palm orientation and location of sign formation respectively, which make use of key-points extracted via OpenPose software. The models have been applied to a Danish and a Greek Sign Language dataset, providing results around 96 %. Moreover, during the reported research, a method has been developed for identifying the time-frame of real signing in the video, which allows to ignore transition frames during sign recognition processing.

pdf bib
Machine Learning for Enhancing Dementia Screening in Ageing Deaf Signers of British Sign LanguageBritish Sign Language
Xing Liang | Bencie Woll | Kapetanios Epaminondas | Anastasia Angelopoulou | Reda Al-Batat

Ageing trend in populations is correlated with increased prevalence of acquired cognitive impairments such as dementia. Although there is no cure for dementia, a timely diagnosis helps in obtaining necessary support and appropriate medication. With this in mind, researchers are working urgently to develop effective technological tools that can help doctors undertake early identification of cognitive disorder. In this paper, we introduce an automatic dementia screening system for ageing Deaf signers of British Sign Language (BSL), using Convolutional Neural Networks (CNN), by analysing the sign space envelope and facial expression of BSL signers using normal 2D videos from BSL corpus. Our approach firstly establishes an accurate real-time hand trajectory tracking model together with a real-time landmark facial motion analysis model to identify differences in sign space envelope and facial movement as the keys to identifying language changes associated with dementia. Based on the differences in patterns obtained from facial and trajectory motion data, CNN models (ResNet50 / VGG16) are fine-tuned using Keras deep learning models to incrementally identify and improve dementia recognition rates. We report the results for two methods using different modalities (sign trajectory and facial motion), together with the performance comparisons between different deep learning CNN models in ResNet50 and VGG16. The experiments show the effectiveness of our deep learning based approach in terms of sign space tracking, facial motion tracking and early stage dementia performance assessment tasks. The results are validated against cognitive assessment scores as of our ground truth data with a test set performance of 87.88 %.

pdf bib
Machine Translation from Spoken Language to Sign Language using Pre-trained Language Model as Encoder
Taro Miyazaki | Yusuke Morita | Masanori Sano

Sign language is the first language for those who were born deaf or lost their hearing in early childhood, so such individuals require services provided with sign language. To achieve flexible open-domain services with sign language, machine translations into sign language are needed. Machine translations generally require large-scale training corpora, but there are only small corpora for sign language. To overcome this data-shortage scenario, we developed a method that involves using a pre-trained language model of spoken language as the initial model of the encoder of the machine translation model. We evaluated our method by comparing it to baseline methods, including phrase-based machine translation, using only 130,000 phrase pairs of training data. Our method outperformed the baseline method, and we found that one of the reasons of translation error is from pointing, which is a special feature used in sign language. We also conducted trials to improve the translation quality for pointing. The results are somewhat disappointing, so we believe that there is still room for improving translation quality, especially for pointing.

pdf bib
Automatic Classification of Handshapes in Russian Sign LanguageRussian Sign Language
Medet Mukushev | Alfarabi Imashev | Vadim Kimmelman | Anara Sandygulova

Handshapes are one of the basic parameters of signs, and any phonological or phonetic analysis of a sign language must account for handshapes. Many sign languages have been carefully analysed by sign language linguists to create handshape inventories. This has theoretical implications, but also applied use, as it is important due to the need of generating corpora for sign languages that can be searched, filtered, sorted by different sign components (such as handshapes, orientation, location, movement, etc.). However, it is a very time-consuming process, thus only a handful of sign languages have such inventories. This work proposes a process of automatically generating such inventories for sign languages by applying automatic hand detection, cropping, and clustering techniques. We applied our proposed method to a commonly used resource : the Spreadthesign online dictionary (www.spreadthesign.com), in particular to Russian Sign Language (RSL). We then manually verified the data to be able to perform classification. Thus, the proposed pipeline can serve as an alternative approach to manual annotation, and can help linguists in answering numerous research questions in relation to handshape frequencies in sign languages.

pdf bib
BosphorusSign22k Sign Language Recognition DatasetBosphorusSign22k Sign Language Recognition Dataset
Oğulcan Özdemir | Ahmet Alp Kındıroğlu | Necati Cihan Camgöz | Lale Akarun

Sign Language Recognition is a challenging research domain. It has recently seen several advancements with the increased availability of data. In this paper, we introduce the BosphorusSign22k, a publicly available large scale sign language dataset aimed at computer vision, video recognition and deep learning research communities. The primary objective of this dataset is to serve as a new benchmark in Turkish Sign Language Recognition for its vast lexicon, the high number of repetitions by native signers, high recording quality, and the unique syntactic properties of the signs it encompasses. We also provide state-of-the-art human pose estimates to encourage other tasks such as Sign Language Production. We survey other publicly available datasets and expand on how BosphorusSign22k can contribute to future research that is being made possible through the widespread availability of similar Sign Language resources. We have conducted extensive experiments and present baseline results to underpin future research on our dataset.

pdf bib
Video-to-HamNoSys Automated Annotation SystemHamNoSys Automated Annotation System
Victor Skobov | Yves Lepage

The Hamburg Notation System (HamNoSys) was developed for movement annotation of any sign language (SL) and can be used to produce signing animations for a virtual avatar with the JASigning platform. This provides the potential to use HamNoSys, i.e., strings of characters, as a representation of an SL corpus instead of video material. Processing strings of characters instead of images can significantly contribute to sign language research. However, the complexity of HamNoSys makes it difficult to annotate without a lot of time and effort. Therefore annotation has to be automatized. This work proposes a conceptually new approach to this problem. It includes a new tree representation of the HamNoSys grammar that serves as a basis for the generation of grammatical training data and classification of complex movements using machine learning. Our automatic annotation system relies on HamNoSys grammar structure and can potentially be used on already existing SL corpora. It is retrainable for specific settings such as camera angles, speed, and gestures. Our approach is conceptually different from other SL recognition solutions and offers a developed methodology for future research.

pdf bib
Cross-Lingual Keyword Search for Sign Language
Nazif Can Tamer | Murat Saraçlar

Sign language research most often relies on exhaustively annotated and segmented data, which is scarce even for the most studied sign languages. However, parallel corpora consisting of sign language interpreting are rarely explored. By utilizing such data for the task of keyword search, this work aims to enable information retrieval from sign language with the queries from the translated written language. With the written language translations as labels, we train a weakly supervised keyword search model for sign language and further improve the retrieval performance with two context modeling strategies. In our experiments, we compare the gloss retrieval and cross language retrieval performance on RWTH-PHOENIX-Weather 2014 T dataset.

up

bib (full) Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)

pdf bib
Proceedings of the 1st Joint Workshop on Spoken Language Technologies for Under-resourced languages (SLTU) and Collaboration and Computing for Under-Resourced Languages (CCURL)
Dorothee Beermann | Laurent Besacier | Sakriani Sakti | Claudia Soria

pdf bib
Open-Source High Quality Speech Datasets for Basque, Catalan and GalicianBasque, Catalan and Galician
Oddur Kjartansson | Alexander Gutkin | Alena Butryna | Isin Demirsahin | Clara Rivera

This paper introduces new open speech datasets for three of the languages of Spain : Basque, Catalan and Galician. Catalan is furthermore the official language of the Principality of Andorra. The datasets consist of high-quality multi-speaker recordings of the three languages along with the associated transcriptions. The resulting corpora include over 33 hours of crowd-sourced recordings of 132 male and female native speakers. The recording scripts also include material for elicitation of global and local place names, personal and business names. The datasets are released under a permissive license and are available for free download for commercial, academic and personal use. The high-quality annotated speech datasets described in this paper can be used to, among other things, build text-to-speech systems, serve as adaptation data in automatic speech recognition and provide useful phonetic and phonological insights in corpus linguistics.

pdf bib
Morphological Disambiguation of South Smi with FSTs and Neural NetworksSouth Sámi with FSTs and Neural Networks
Mika Hämäläinen | Linda Wiechetek

We present a method for conducting morphological disambiguation for South Smi, which is an endangered language. Our method uses an FST-based morphological analyzer to produce an ambiguous set of morphological readings for each word in a sentence. These readings are disambiguated with a Bi-RNN model trained on the related North Smi UD Treebank and some synthetically generated South Smi data. The disambiguation is done on the level of morphological tags ignoring word forms and lemmas ; this makes it possible to use North Smi training data for South Smi without the need for a bilingual dictionary or aligned word embeddings. Our approach requires only minimal resources for South Smi, which makes it usable and applicable in the contexts of any other endangered language as well.

pdf bib
Neural Text-to-Speech Synthesis for an Under-Resourced Language in a Diglossic Environment : the Case of Gascon OccitanGascon Occitan
Ander Corral | Igor Leturia | Aure Séguier | Michäel Barret | Benaset Dazéas | Philippe Boula de Mareüil | Nicolas Quint

Occitan is a minority language spoken in Southern France, some Alpine Valleys of Italy, and the Val d’Aran in Spain, which only very recently started developing language and speech technologies. This paper describes the first project for designing a Text-to-Speech synthesis system for one of its main regional varieties, namely Gascon. We used a state-of-the-art deep neural network approach, the Tacotron2-WaveGlow system. However, we faced two additional difficulties or challenges : on the one hand, we wanted to test if it was possible to obtain good quality results with fewer recording hours than is usually reported for such systems ; on the other hand, we needed to achieve a standard, non-Occitan pronunciation of French proper names, therefore we needed to record French words and test phoneme-based approaches. The evaluation carried out over the various developed systems and approaches shows promising results with near production-ready quality. It has also allowed us to detect the phenomena for which some flaws or fall of quality occur, pointing at the direction of future work to improve the quality of the actual system and for new systems for other language varieties and voices.

pdf bib
Poio Text Prediction : Lessons on the Development and Sustainability of LTs for Endangered LanguagesLTs for Endangered Languages
Gema Zamora Fernández | Vera Ferreira | Pedro Manha

2019, the International Year of Indigenous Languages (IYIL), marked a crucial milestone for a diverse community united by a strong sense of urgency. In this presentation, we evaluate the impact of IYIL’s outcomes in the development of LTs for endangered languages. We give a brief description of the field of Language Documentation, whose experts have led the research and data collection efforts surrounding endangered languages for the past 30 years. We introduce the work of the Interdisciplinary Centre for Social and Language Documentation and we look at Poio as an example of an LT developed specifically with speakers of endangered languages in mind. This example illustrates how the deeper systemic causes of language endangerment are reflected in the development of LTs. Additionally, we share some of the strategic decisions that have led the development of this project. Finally, we advocate the importance of bridging the divide between research and activism, pushing for the inclusion of threatened languages in the world of LTs, and doing so in close collaboration with the speaker community.

pdf bib
Scaling Language Data Import / Export with a Data Transformer Interface
Nicholas Buckeridge | Ben Foley

This paper focuses on the technical improvement of Elpis, a language technology which assists people in the process of transcription, particularly for low-resource language documentation situations. To provide better support for the diversity of file formats encountered by people working to document the world’s languages, a Data Transformer interface has been developed to abstract the complexities of designing individual data import scripts. This work took place as part of a larger project of code quality improvement and the publication of template code that can be used for development of other language technologies.

pdf bib
Natural Language Processing Chains Inside a Cross-lingual Event-Centric Knowledge Pipeline for European Union Under-resourced LanguagesEuropean Union Under-resourced Languages
Diego Alves | Gaurish Thakkar | Marko Tadić

This article presents the strategy for developing a platform containing Language Processing Chains for European Union languages, consisting of Tokenization to Parsing, also including Named Entity recognition and with addition of Sentiment Analysis. These chains are part of the first step of an event-centric knowledge processing pipeline whose aim is to process multilingual media information about major events that can cause an impact in Europe and the rest of the world. Due to the differences in terms of availability of language resources for each language, we have built this strategy in three steps, starting with processing chains for the well-resourced languages and finishing with the development of new modules for the under-resourced ones. In order to classify all European Union official languages in terms of resources, we have analysed the size of annotated corpora as well as the existence of pre-trained models in mainstream Language Processing tools, and we have combined this information with the proposed classification published at META-NET whitepaper series.

pdf bib
Acoustic-Phonetic Approach for ASR of Less Resourced Languages Using Monolingual and Cross-Lingual InformationASR of Less Resourced Languages Using Monolingual and Cross-Lingual Information
Shweta Bansal

The exploration of speech processing for endangered languages has substantially increased in the past epoch of time. In this paper, we present the acoustic-phonetic approach for automatic speech recognition (ASR) using monolingual and cross-lingual information with application to under-resourced Indian languages, Punjabi, Nepali and Hindi. The challenging task while developing the ASR was the collection of the acoustic corpus for under-resourced languages. We have described here, in brief, the strategies used for designing the corpus and also highlighted the issues pertaining while collecting data for these languages. The bootstrap GMM-UBM based approach is used, which integrates pronunciation lexicon, language model and acoustic-phonetic model. Mel Frequency Cepstral Coefficients were used for extracting the acoustic signal features for training in monolingual and cross-lingual settings. The experimental result shows the overall performance of ASR for cross-lingual and monolingual. The phone substitution plays a key role in the cross-lingual as well as monolingual recognition. The result obtained by cross-lingual recognition compared with other baseline system and it has been found that the performance of the recognition system is based on phonemic units. The recognition rate of cross-lingual generally declines as compared with the monolingual.

pdf bib
A Sentiment Analysis Dataset for Code-Mixed Malayalam-EnglishMalayalam-English
Bharathi Raja Chakravarthi | Navya Jose | Shardul Suryawanshi | Elizabeth Sherly | John Philip McCrae

There is an increasing demand for sentiment analysis of text from social media which are mostly code-mixed. Systems trained on monolingual data fail for code-mixed data due to the complexity of mixing at different levels of the text. However, very few resources are available for code-mixed data to create models specific for this data. Although much research in multilingual and cross-lingual sentiment analysis has used semi-supervised or unsupervised methods, supervised methods still performs better. Only a few datasets for popular languages such as English-Spanish, English-Hindi, and English-Chinese are available. There are no resources available for Malayalam-English code-mixed data. This paper presents a new gold standard corpus for sentiment analysis of code-mixed text in Malayalam-English annotated by voluntary annotators. This gold standard corpus obtained a Krippendorff’s alpha above 0.8 for the dataset. We use this new corpus to provide the benchmark for sentiment analysis in Malayalam-English code-mixed texts.

pdf bib
Macsen : A Voice Assistant for Speakers of a Lesser Resourced LanguageMacsen: A Voice Assistant for Speakers of a Lesser Resourced Language
Dewi Jones

This paper reports on the development of a voice assistant mobile app for speakers of a lesser resourced language Welsh. An assistant with a smaller set of effective but useful skills is both desirable and urgent for the wider Welsh speaking community. Descriptions of the app’s skills, architecture, design decisions and user interface is provided before elaborating on the most recent research and activities in open source speech technology for Welsh. The paper reports on the progress to date on crowdsourcing Welsh speech data in Mozilla Common Voice and of its suitability for training Mozilla’s DeepSpeech speech recognition for a voice assistant application according to conventional and transfer learning methods. We demonstrate that with smaller datasets of speech data, transfer learning and a domain specific language model, acceptable speech recognition is achievable that facilitates, as confirmed by beta users, a practical and useful voice assistant for Welsh speakers. We hope that this work informs and serves as a model to researchers and developers in other lesser-resourced linguistic communities and helps bring into being voice assistant apps for their languages.

pdf bib
Gender Detection from Human Voice Using Tensor Analysis
Prasanta Roy | Parabattina Bhagath | Pradip Das

Speech-based communication is one of the most preferred modes of communication for humans. The human voice contains several important information and clues that help in interpreting the voice message. The gender of the speaker can be accurately guessed by a person based on the received voice of a speaker. The knowledge of the speaker’s gender can be a great aid to design accurate speech recognition systems. GMM based classifier is a popular choice used for gender detection. In this paper, we propose a Tensor-based approach for detecting the gender of a speaker and discuss its implementation details for low resourceful languages. Experiments were conducted using the TIMIT and SHRUTI dataset. An average gender detection accuracy of 91 % is recorded. Analysis of the results with the proposed method is presented in this paper.

pdf bib
Data-Driven Parametric Text Normalization : Rapidly Scaling Finite-State Transduction Verbalizers to New Languages
Sandy Ritchie | Eoin Mahon | Kim Heiligenstein | Nikos Bampounis | Daan van Esch | Christian Schallhart | Jonas Mortensen | Benoit Brard

This paper presents a methodology for rapidly generating FST-based verbalizers for ASR and TTS systems by efficiently sourcing language-specific data. We describe a questionnaire which collects the necessary data to bootstrap the number grammar induction system and parameterize the verbalizer templates described in Ritchie et al. (2019), and a machine-readable data store which allows the data collected through the questionnaire to be supplemented by additional data from other sources. This system allows us to rapidly scale technologies such as ASR and TTS to more languages, including low-resource languages.

pdf bib
Adapting a Welsh Terminology Tool to Develop a Cornish DictionaryWelsh Terminology Tool to Develop a Cornish Dictionary
Delyth Prys

Cornish and Welsh are closely related Celtic languages and this paper provides a brief description of a recent project to publish an online bilingual English / Cornish dictionary, the Gerlyver Kernewek, based on similar work previously undertaken for Welsh. Both languages are endangered, Cornish critically so, but both can benefit from the use of language technology. Welsh has previous experience of using language technologies for language revitalization, and this is now being used to help the Cornish language create new tools and resources, including lexicographical ones, helping a dispersed team of language specialists and editors, many of them in a voluntary capacity, to work collaboratively online. Details are given of the Maes T dictionary writing and publication platform, originally developed for Welsh, and of some of the adaptations that had to be made to accommodate the specific needs of Cornish, including their use of Middle and Late varieties due to its development as a revived language.

pdf bib
On the Exploration of English to Urdu Machine TranslationEnglish to Urdu Machine Translation
Sadaf Abdul Rauf | Syeda Abida | Noor-e- Hira | Syeda Zahra | Dania Parvez | Javeria Bashir | Qurat-ul-ain Majid

Machine Translation is the inevitable technology to reduce communication barriers in today’s world. It has made substantial progress in recent years and is being widely used in commercial as well as non-profit sectors. Such is only the case for European and other high resource languages. For English-Urdu language pair, the technology is in its infancy stage due to scarcity of resources. Present research is an important milestone in English-Urdu machine translation, as we present results for four major domains including Biomedical, Religious, Technological and General using Statistical and Neural Machine Translation. We performed series of experiments in attempts to optimize the performance of each system and also to study the impact of data sources on the systems. Finally, we established a comparison of the data sources and the effect of language model size on statistical machine translation performance.

pdf bib
Adapting Language Specific Components of Cross-Media Analysis Frameworks to Less-Resourced Languages : the Case of AmharicAmharic
Yonas Woldemariam | Adam Dahlgren

We present an ASR based pipeline for Amharic that orchestrates NLP components within a cross media analysis framework (CMAF). One of the major challenges that are inherently associated with CMAFs is effectively addressing multi-lingual issues. As a result, many languages remain under-resourced and fail to leverage out of available media analysis solutions. Although spoken natively by over 22 million people and there is an ever-increasing amount of Amharic multimedia content on the Web, querying them with simple text search is difficult. Searching for, especially audio / video content with simple key words, is even hard as they exist in their raw form. In this study, we introduce a spoken and textual content processing workflow into a CMAF for Amharic. We design an ASR-named entity recognition (NER) pipeline that includes three main components : ASR, a transliterator and NER. We explore various acoustic modeling techniques and develop an OpenNLP-based NER extractor along with a transliterator that interfaces between ASR and NER. The designed ASR-NER pipeline for Amharic promotes the multi-lingual support of CMAFs. Also, the state-of-the art design principles and techniques employed in this study shed light for other less-resourced languages, particularly the Semitic ones.

pdf bib
Owksape-An Online Language Learning Platform for LakotaLakota
Jan Ullrich | Elliot Thornton | Peter Vieira | Logan Swango | Marek Kupiec

This paper presents Owksape, an online language learning platform for the under-resourced language Lakota. The Lakota language (Laktiyapi) is a Siouan language native to the United States with fewer than 2000 fluent speakers. Owksape was developed by The Language Conservancy to support revitalization efforts, including reaching younger generations and providing a tool to complement traditional teaching methods. This project grew out of various multimedia resources in order to combine their most effective aspects into a single, self-paced learning tool. The first section of this paper discusses the motivation for and background of Owksape. Section two details the linguistic features and language documentation principles that form the backbone of the platform. Section three lays out the unique integration of cultural aspects of the Lakota people into the visual design of the application. Section four explains the pedagogical principles of Owksape. Application features and exercise types are then discussed in detail with visual examples, followed by an overview of the software design, as well as the effort required to develop the platform. Finally, a description of future features and considerations is presented.

pdf bib
Speech Transcription Challenges for Resource Constrained Indigenous Language CreeCree
Vishwa Gupta | Gilles Boulianne

Cree is one of the most spoken Indigenous languages in Canada. From a speech recognition perspective, it is a low-resource language, since very little data is available for either acoustic or language modeling. This has prevented development of speech technology that could help revitalize the language. We describe our experiments with available Cree data to improve automatic transcription both in speaker- independent and dependent scenarios. While it was difficult to get low speaker-independent word error rates with only six speakers, we were able to get low word and phoneme error rates in the speaker-dependent scenario. We compare our phoneme recognition with two state-of-the-art open-source phoneme recognition toolkits, which use end-to-end training and sequence-to-sequence modeling. Our phoneme error rate (8.7 %) is significantly lower than that achieved by the best of these systems (15.1 %). With these systems and varying amounts of transcribed and text data, we show that pre-training on other languages is important for speaker-independent recognition, and even small amounts of additional text-only documents are useful. These results can guide practical language documentation work, when deciding how much transcribed and text data is needed to achieve useful phoneme accuracies.

up

pdf (full)
bib (full)
Proceedings of the Eighth International Workshop on Natural Language Processing for Social Media

pdf bib
Proceedings of the Eighth International Workshop on Natural Language Processing for Social Media
Lun-Wei Ku | Cheng-Te Li

pdf bib
Enhancing Bias Detection in Political News Using Pragmatic Presupposition
Lalitha Kameswari | Dama Sravani | Radhika Mamidi

Usage of presuppositions in social media and news discourse can be a powerful way to influence the readers as they usually tend to not examine the truth value of the hidden or indirectly expressed information. Fairclough and Wodak (1997) discuss presupposition at a discourse level where some implicit claims are taken for granted in the explicit meaning of a text or utterance. From the Gricean perspective, the presuppositions of a sentence determine the class of contexts in which the sentence could be felicitously uttered. This paper aims to correlate the type of knowledge presupposed in a news article to the bias present in it. We propose a set of guidelines to identify various kinds of presuppositions in news articles and present a dataset consisting of 1050 articles which are annotated for bias (positive, negative or neutral) and the magnitude of presupposition. We introduce a supervised classification approach for detecting bias in political news which significantly outperforms the existing systems.

pdf bib
NARMADA : Need and Available Resource Managing Assistant for Disasters and AdversitiesNARMADA: Need and Available Resource Managing Assistant for Disasters and Adversities
Kaustubh Hiware | Ritam Dutt | Sayan Sinha | Sohan Patro | Kripa Ghosh | Saptarshi Ghosh

Although a lot of research has been done on utilising Online Social Media during disasters, there exists no system for a specific task that is critical in a post-disaster scenario identifying resource-needs and resource-availabilities in the disaster-affected region, coupled with their subsequent matching. To this end, we present NARMADA, a semi-automated platform which leverages the crowd-sourced information from social media posts for assisting post-disaster relief coordination efforts. The system employs Natural Language Processing and Information Retrieval techniques for identifying resource-needs and resource-availabilities from microblogs, extracting resources from the posts, and also matching the needs to suitable availabilities. The system is thus capable of facilitating the judicious management of resources during post-disaster relief operations.

up

bib (full) Proceedings for the First International Workshop on Social Threats in Online Conversations: Understanding and Management

pdf bib
Proceedings for the First International Workshop on Social Threats in Online Conversations: Understanding and Management
Archna Bhatia | Samira Shaikh

pdf bib
A Privacy Preserving Data Publishing Middleware for Unstructured, Textual Social Media Data
Prasadi Abeywardana | Uthayasanker Thayasivam

Privacy is going to be an integral part of data science and analytics in the coming years. The next hype of data experimentation is going to be heavily dependent on privacy preserving techniques mainly as it’s going to be a legal responsibility rather than a mere social responsibility. Privacy preservation becomes more challenging specially in the context of unstructured data. Social networks have become predominantly popular over the past couple of decades and they are creating a huge data lake at a high velocity. Social media profiles contain a wealth of personal and sensitive information, creating enormous opportunities for third parties to analyze them with different algorithms, draw conclusions and use in disinformation campaigns and micro targeting based dark advertising. This study provides a mitigation mechanism for disinformation campaigns that are done based on the insights extracted from personal / sensitive data analysis. Specifically, this research is aimed at building a privacy preserving data publishing middleware for unstructured social media data without compromising the true analytical value of those data. A novel way is proposed to apply traditional structured privacy preserving techniques on unstructured data. Creating a comprehensive twitter corpus annotated with privacy attributes is another objective of this research, especially because the research community is lacking one.

pdf bib
Information Space Dashboard
Theresa Krumbiegel | Albert Pritzkau | Hans-Christian Schmitz

The information space, where information is generated, stored, exchanged and discussed, is not idyllic but a space where campaigns of disinformation and destabilization are conducted. Such campaigns are subsumed under the terms hybrid warfare and information warfare (Woolley and Howard, 2017). In order to enable awareness of them, we propose an information state dashboard comprising various components / apps for data collection, analysis and visualization. The aim of the dashboard is to support an analyst in generating a common operational picture of the information space, link it with an operational picture of the physical space and, thus, contribute to overarching situational awareness. The dashboard is work in progress. However, a first prototype with components for exploiting elementary language statistics, keyword and metadata analysis, text classification and network analysis has been implemented. Further components, in particular, for event extraction and sentiment analysis are under development. As a demonstration case, we briefly discuss the analysis of historical data regarding violent anti-migrant protests and respective counter-protests that took place in Chemnitz in 2018.

up

bib (full) Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying

pdf bib
Proceedings of the Second Workshop on Trolling, Aggression and Cyberbullying
Ritesh Kumar | Atul Kr. Ojha | Bornini Lahiri | Marcos Zampieri | Shervin Malmasi | Vanessa Murdock | Daniel Kadar

pdf bib
Aggression Identification in Social Media : a Transfer Learning Based Approach
Faneva Ramiandrisoa | Josiane Mothe

The way people communicate have changed in many ways with the outbreak of social media. One of the aspects of social media is the ability for their information producers to hide, fully or partially, their identity during a discussion ; leading to cyber-aggression and interpersonal aggression. Automatically monitoring user-generated content in order to help moderating it is thus a very hot topic. In this paper, we propose to use the transformer based language model BERT (Bidirectional Encoder Representation from Transformer) (Devlin et al., 2019) to identify aggressive content. Our model is also used to predict the level of aggressiveness. The evaluation part of this paper is based on the dataset provided by the TRAC shared task (Kumar et al., 2018a). When compared to the other participants of this shared task, our model achieved the third best performance according to the weighted F1 measure on both Facebook and Twitter collections.

pdf bib
A Comparative Study of Different State-of-the-Art Hate Speech Detection Methods in Hindi-English Code-Mixed DataHindi-English Code-Mixed Data
Priya Rani | Shardul Suryawanshi | Koustava Goswami | Bharathi Raja Chakravarthi | Theodorus Fransen | John Philip McCrae

Hate speech detection in social media communication has become one of the primary concerns to avoid conflicts and curb undesired activities. In an environment where multilingual speakers switch among multiple languages, hate speech detection becomes a challenging task using methods that are designed for monolingual corpora. In our work, we attempt to analyze, detect and provide a comparative study of hate speech in a code-mixed social media text. We also provide a Hindi-English code-mixed data set consisting of Facebook and Twitter posts and comments. Our experiments show that deep learning models trained on this code-mixed corpus perform better.

pdf bib
Bagging BERT Models for Robust Aggression IdentificationBERT Models for Robust Aggression Identification
Julian Risch | Ralf Krestel

Modern transformer-based models with hundreds of millions of parameters, such as BERT, achieve impressive results at text classification tasks. This also holds for aggression identification and offensive language detection, where deep learning approaches consistently outperform less complex models, such as decision trees. While the complex models fit training data well (low bias), they also come with an unwanted high variance. Especially when fine-tuning them on small datasets, the classification performance varies significantly for slightly different training data. To overcome the high variance and provide more robust predictions, we propose an ensemble of multiple fine-tuned BERT models based on bootstrap aggregating (bagging). In this paper, we describe such an ensemble system and present our submission to the shared tasks on aggression identification 2020 (team name : Julian). Our submission is the best-performing system for five out of six subtasks. For example, we achieve a weighted F1-score of 80.3 % for task A on the test dataset of English social media posts. In our experiments, we compare different model configurations and vary the number of models used in the ensemble. We find that the F1-score drastically increases when ensembling up to 15 models, but the returns diminish for more models.

pdf bib
Scmhl5 at TRAC-2 Shared Task on Aggression Identification : Bert Based Ensemble Learning ApproachTRAC-2 Shared Task on Aggression Identification: Bert Based Ensemble Learning Approach
Han Liu | Pete Burnap | Wafa Alorainy | Matthew Williams

This paper presents a system developed during our participation (team name : scmhl5) in the TRAC-2 Shared Task on aggression identification. In particular, we participated in English Sub-task A on three-class classification (‘Overtly Aggressive’, ‘Covertly Aggressive’ and ‘Non-aggressive’) and English Sub-task B on binary classification for Misogynistic Aggression (‘gendered’ or ‘non-gendered’). For both sub-tasks, our method involves using the pre-trained Bert model for extracting the text of each instance into a 768-dimensional vector of embeddings, and then training an ensemble of classifiers on the embedding features. Our method obtained accuracy of 0.703 and weighted F-measure of 0.664 for Sub-task A, whereas for Sub-task B the accuracy was 0.869 and weighted F-measure was 0.851. In terms of the rankings, the weighted F-measure obtained using our method for Sub-task A is ranked in the 10th out of 16 teams, whereas for Sub-task B the weighted F-measure is ranked in the 8th out of 15 teams.

pdf bib
Spyder : Aggression Detection on Multilingual TweetsSpyder: Aggression Detection on Multilingual Tweets
Anisha Datta | Shukrity Si | Urbi Chakraborty | Sudip Kumar Naskar

In the last few years, hate speech and aggressive comments have covered almost all the social media platforms like facebook, twitter etc. As a result hatred is increasing. This paper describes our (Team name : Spyder) participation in the Shared Task on Aggression Detection organised by TRAC-2, Second Workshop on Trolling, Aggression and Cyberbullying. The Organizers provided datasets in three languages English, Hindi and Bengali. The task was to classify each instance of the test sets into three categories Overtly Aggressive (OAG), Covertly Aggressive (CAG) and Non-Aggressive (NAG). In this paper, we propose three different models using Tf-Idf, sentiment polarity and machine learning based classifiers. We obtained f1 score of 43.10 %, 59.45 % and 44.84 % respectively for English, Hindi and Bengali.Team name:\n Spyder) participation in the Shared Task on Aggression Detection organised by TRAC-2, Second Workshop on Trolling, Aggression and Cyberbullying. The Organizers provided datasets in three languages – English, Hindi and Bengali. The task was to classify each instance of the test sets into three categories – “Overtly Aggressive” (OAG), “Covertly Aggressive” (CAG) and “Non-Aggressive” (NAG). In this paper, we propose three different models using Tf-Idf, sentiment polarity and machine learning based classifiers. We obtained f1 score of 43.10%, 59.45% and 44.84% respectively for English, Hindi and Bengali.

pdf bib
BERT of all trades, master of someBERT of all trades, master of some
Denis Gordeev | Olga Lykova

This paper describes our results for TRAC 2020 competition held together with the conference LREC 2020. Our team name was Ms8qQxMbnjJMgYcw. The competition consisted of 2 subtasks in 3 languages (Bengali, English and Hindi) where the participants’ task was to classify aggression in short texts from social media and decide whether it is gendered or not. We used a single BERT-based system with two outputs for all tasks simultaneously. Our model placed first in English and second in Bengali gendered text classification competition tasks with 0.87 and 0.93 in F1-score respectively.

pdf bib
FlorUniTo@TRAC-2 : Retrofitting Word Embeddings on an Abusive Lexicon for Aggressive Language DetectionFlorUniTo@TRAC-2: Retrofitting Word Embeddings on an Abusive Lexicon for Aggressive Language Detection
Anna Koufakou | Valerio Basile | Viviana Patti

This paper describes our participation to the TRAC-2 Shared Tasks on Aggression Identification. Our team, FlorUniTo, investigated the applicability of using an abusive lexicon to enhance word embeddings towards improving detection of aggressive language. The embeddings used in our paper are word-aligned pre-trained vectors for English, Hindi, and Bengali, to reflect the languages in the shared task data sets. The embeddings are retrofitted to a multilingual abusive lexicon, HurtLex. We experimented with an LSTM model using the original as well as the transformed embeddings and different language and setting variations. Overall, our systems placed toward the middle of the official rankings based on weighted F1 score. However, the results on the development and test sets show promising improvements across languages, especially on the misogynistic aggression sub-task.

pdf bib
Multilingual Joint Fine-tuning of Transformer models for identifying Trolling, Aggression and Cyberbullying at TRAC 2020TRAC 2020
Sudhanshu Mishra | Shivangi Prasad | Shubhanshu Mishra

We present our team ‘3Idiots’ (referred as ‘sdhanshu’ in the official rankings) approach for the Trolling, Aggression and Cyberbullying (TRAC) 2020 shared tasks. Our approach relies on fine-tuning various Transformer models on the different datasets. We also investigated the utility of task label marginalization, joint label classification, and joint training on multilingual datasets as possible improvements to our models. Our team came second in English sub-task A, a close fourth in the English sub-task B and third in the remaining 4 sub-tasks. We find the multilingual joint training approach to be the best trade-off between computational efficiency of model deployment and model’s evaluation performance. We open source our approach at https://github.com/socialmediaie/TRAC2020.

pdf bib
Aggression and Misogyny Detection using BERT : A Multi-Task ApproachBERT: A Multi-Task Approach
Niloofar Safi Samghabadi | Parth Patwa | Srinivas PYKL | Prerana Mukherjee | Amitava Das | Thamar Solorio

In recent times, the focus of the NLP community has increased towards offensive language, aggression, and hate-speech detection. This paper presents our system for TRAC-2 shared task on Aggression Identification (sub-task A) and Misogynistic Aggression Identification (sub-task B). The data for this shared task is provided in three different languages-English, Hindi, and Bengali. Each data instance is annotated into one of the three aggression classes-Not Aggressive, Covertly Aggressive, Overtly Aggressive, as well as one of the two misogyny classes-Gendered and Non-Gendered. We propose an end-to-end neural model using attention on top of BERT that incorporates a multi-task learning paradigm to address both the sub-tasks simultaneously. Our team, na14, scored 0.8579 weighted F1-measure on the English sub-task B and secured 3rd rank out of 15 teams for the task. The code and the model weights are publicly available at https://github.com/NiloofarSafi/TRAC-2. Keywords : Aggression, Misogyny, Abusive Language, Hate-Speech Detection, BERT, NLP, Neural Networks, Social Media

pdf bib
Lexicon-Enhancement of Embedding-based Approaches Towards the Detection of Abusive Language
Anna Koufakou | Jason Scott

Detecting abusive language is a significant research topic, which has received a lot of attention recently. Our work focuses on detecting personal attacks in online conversations. As previous research on this task has largely used deep learning based on embeddings, we explore the use of lexicons to enhance embedding-based methods in an effort to see how these methods apply in the particular task of detecting personal attacks. The methods implemented and experimented with in this paper are quite different from each other, not only in the type of lexicons they use (sentiment or semantic), but also in the way they use the knowledge from the lexicons, in order to construct or to change embeddings that are ultimately fed into the learning model. The sentiment lexicon approaches focus on integrating sentiment information (in the form of sentiment embeddings) into the learning model. The semantic lexicon approaches focus on transforming the original word embeddings so that they better represent relationships extracted from a semantic lexicon. Based on our experimental results, semantic lexicon methods are superior to the rest of the methods in this paper, with at least 4 % macro-averaged F1 improvement over the baseline.

up

bib (full) Proceedings of the 12th Web as Corpus Workshop

pdf bib
Proceedings of the 12th Web as Corpus Workshop
Adrien Barbaresi | Felix Bildhauer | Roland Schäfer | Egon Stemle

pdf bib
Current Challenges in Web Corpus Building
Miloš Jakubíček | Vojtěch Kovář | Pavel Rychlý | Vit Suchomel

In this paper we discuss some of the current challenges in web corpus building that we faced in the recent years when expanding the corpora in Sketch Engine. The purpose of the paper is to provide an overview and raise discussion on possible solutions, rather than bringing ready solutions to the readers. For every issue we try to assess its severity and briefly discuss possible mitigation options.

pdf bib
The ELTE.DH Pilot Corpus Creating a Handcrafted Gigaword Web Corpus with MetadataELTE.DH Pilot Corpus – Creating a Handcrafted Gigaword Web Corpus with Metadata
Balázs Indig | Árpád Knap | Zsófia Sárközi-Lindner | Mária Timári | Gábor Palkó

In this article, we present the method we used to create a middle-sized corpus using targeted web crawling. Our corpus contains news portal articles along with their metadata, that can be useful for diverse audiences, ranging from digital humanists to NLP users. The method presented in this paper applies rule-based components that allow the curation of the text and the metadata content. The curated data can thereon serve as a reference for various tasks and measurements. We designed our workflow to encourage modification and customisation. Our concept can also be applied to other genres of portals by using the discovered patterns in the architecture of the portals. We found that for a systematic creation or extension of a similar corpus, our method provides superior accuracy and ease of use compared to The Wayback Machine, while requiring minimal manpower and computational resources. Reproducing the corpus is possible if changes are introduced to the text-extraction process. The standard TEI format and Schema.org encoded metadata is used for the output format, but we stress that placing the corpus in a digital repository system is recommended in order to be able to define semantic relations between the segments and to add rich annotation.

up

bib (full) Proceedings of the WILDRE5– 5th Workshop on Indian Language Data: Resources and Evaluation

pdf bib
Proceedings of the WILDRE5– 5th Workshop on Indian Language Data: Resources and Evaluation
Girish Nath Jha | Kalika Bali | Sobha L. | S. S. Agrawal | Atul Kr. Ojha

pdf bib
Handling Noun-Noun Coreference in TamilTamil
Vijay Sundar Ram | Sobha Lalitha Devi

Natural language understanding by automatic tools is the vital requirement for document processing tools. To achieve it, automatic system has to understand the coherence in the text. Co-reference chains bring coherence to the text. The commonly occurring reference markers which bring cohesiveness are Pronominal, Reflexives, Reciprocals, Distributives, One-anaphors, Nounnoun reference. Here in this paper, we deal with noun-noun reference in Tamil. We present the methodology to resolve these noun-noun anaphors and also present the challenges in handling the noun-noun anaphoric relations in Tamil.

pdf bib
Determination of Idiomatic Sentences in Paragraphs Using Statement Classification and Generalization of Grammar Rules
Naziya Shaikh

The translation systems are often not able to determine the presence of an idiom in a given paragraph. Due to this many systems tend to return the word-for-word translation of such statements leading to loss in the flavor of the idioms in the paragraph. This paper suggests a novel approach to efficiently determine probability of any statement in a given English paragraph to be an idiom. This approach combines the rule-based generalization of idioms in English language and classification of statements based on the context to determine the idioms in the sentence. The context based classification method can be used further for determination of idioms in regional Indian languages such as Marathi, Konkani and Hindi as the difference in the semantic context of the proverb as compared to the context in a paragraph is also evident in these other languages.

pdf bib
A Deeper Study on Features for Named Entity Recognition
Malarkodi C S | Sobha Lalitha Devi

This paper deals with the various features used for the identification of named entities. The performance of the machine learning system heavily depends on the feature selection criteria. The intention to trace the essential features required for the development of named entity system across languages motivated us to conduct this study. The linguistic analysis was done to find out the part of speech patterns surrounding the context of named entities and from the observation linguistic oriented features are identified for both Indian and European languages. The Indian languages belongs to Dravidian language family such as Tamil, Telugu, Malayalam, Indo-Aryan language family such as Hindi, Punjabi, Bengali and Marathi, European languages such as English, Spanish, Dutch, German and Hungarian are used in this work. The machine learning technique CRFs was used for the system development. The experiments were conducted using the linguistic features and the results obtained for each languages are comparable with state-of-art systems.

up

pdf (full)
bib (full)
Proceedings of the The Fourth Widening Natural Language Processing Workshop

pdf bib
Proceedings of the The Fourth Widening Natural Language Processing Workshop
Rossana Cunha | Samira Shaikh | Erika Varis | Ryan Georgi | Alicia Tsai | Antonios Anastasopoulos | Khyathi Raghavi Chandu

bib
Corpus based Amharic sentiment lexicon generation
Girma Neshir Alemneh | Andreas Rauber | Solomon Atnafu

Sentiment classification is an active research area with several applications including analysis of political opinions, classifying comments, movie reviews, news reviews and product reviews. To employ rule based sentiment classification, we require sentiment lexicons. However, manual construction of sentiment lexicon is time consuming and costly for resource-limited languages. To bypass manual development time and costs, we tried to build Amharic Sentiment Lexicons relying on corpus based approach. The intention of this approach is to handle sentiment terms specific to Amharic language from Amharic Corpus. Small set of seed terms are manually prepared from three parts of speech such as noun, adjective and verb. We developed algorithms for constructing Amharic sentiment lexicons automatically from Amharic news corpus. Corpus based approach is proposed relying on the word co-occurrence distributional embedding including frequency based embedding (i.e. Positive Point-wise Mutual Information PPMI). Using PPMI with threshold value of 100 and 200, we got corpus based Amharic Sentiment lexicons of size 1811 and 3794 respectively by expanding 519 seeds. Finally, the lexicon generated in corpus based approach is evaluated.

bib
Negation handling for Amharic sentiment classification
Girma Neshir Alemneh | Andreas Rauber | Solomon Atnafu

User generated content is bringing new aspects of processing data on the web. Due to the advancement of World Wide Web technology, users are not only consumer of web contents but also they are producers of contents in the form of text, audio, video and picture. This study focuses on the analysis of textual contents with subjective information (referring to sentiment analysis). Most of conventional approaches of sentiment analysis do not effectively capture negation in languages where there are limited computational linguistic resources (e.g. Amharic). For this research, we proposed Amharic negation handling framework for Amharic sentiment classification. The proposed framework combines the lexicon based sentiment classification approach and character ngram based machine learning algorithms. Finally, the performance of framework is evaluated using the annotated Amharic news comments. The system is performing the best of all models and the baselines with accuracy of 98.0. The result is compared with the baselines (without negation handling and word level ngram model).

bib
Embedding Oriented Adaptable Semantic Annotation Framework for Amharic Web Documents
Kidane Woldemariyam | Dr. Fekade Getahun

The Web has become a source of information, where information is provided by humans for humans and its growth has increased necessity to get solutions that intelligently extract valuable knowledge from existing and newly added web documents with no (minimal) supervisions. However, due to the unstructured nature of existing data on the Web, effective extraction of this knowledge is limited for both human beings and software agents. Thus, this research work designed generic and embedding oriented framework that automatically annotates semantically Amharic web documents using ontology. This framework significantly reduces manual annotation and learning cost used for semantic annotation of Amharic web documents with its nature of adaptability with minimal modification. The results have also implied that neural network techniques are promising for semantic annotation, especially for less resourced languages like Amharic in comparison to language dependent techniques that have cost of speed and challenge of adaptation into new domains and languages. We experiment the feasibility of the proposed approach using Amharic news collected from WALTA news agency and Amharic Wikipedia. Our results show that the proposed solution exhibits 70.68% of precision, 66.89% of recall and 68.53% of f-measure in semantic annotation for a morphologically complex Amharic language with limited size dataset.

bib
Similarity and Farness Based Bidirectional Neural Co-Attention for Amharic Natural Language Inference
Abebawu Eshetu | Getenesh Teshome | Ribka Alemayehu

In natural language one idea can be conveyed using different sentences; higher Natural Language Processing applications get difficulties in capturing meaning of ideas stated in different expressions. To solve this difficulty, different scholars have conducted Natural Language Inference (NLI) researches using methods from traditional discrete models with hard logic to an end-to-end neural network for different languages. In context of Amharic language, even though there are number of research efforts in higher NLP applications, still they have limitation on understanding idea expressed in different ways due to an absence of NLI in Amharic language. Accordingly, we proposed deep learning based Natural Language Inference using similarity and farness aware bidirectional attentive matching for Amharic texts. The experiment on limited Amharic NLI dataset prepared also shows promising result that can be used as baseline for subsequent works.

bib
Large Vocabulary Read Speech Corpora for Four Ethiopian Languages: Amharic, Tigrigna, Oromo, and Wolaytta
Solomon Teferra Abate | Martha Yifiru Tachbelie | Michael Melese | Hafte Abera | Tewodros Gebreselassie | Wondwossen Mulugeta | Yaregal Assabie | Million Meshesha Beyene | Solomon Atinafu | Binyam Ephrem Seyoum

Automatic Speech Recognition (ASR) is one of the most important technologies to help people live a better life in the 21st century. However, its development requires a big speech corpus for a language. The development of such a corpus is expensive especially for under-resourced Ethiopian languages. To address this problem we have developed four medium-sized (longer than 22 hours each) speech corpora for four Ethiopian languages: Amharic, Tigrigna, Oromo, and Wolaytta. In a way of checking the usability of the corpora and deliver a baseline ASR for each language. In this paper, we present the corpora and the baseline ASR systems for each language. The word error rates (WERs) we achieved show that the corpora are usable for further investigation and we recommend the collection of text corpora to train strong language models for Oromo and Wolaytta compared to others.

bib
SIMPLEX-PB 2.0: A Reliable Dataset for Lexical Simplification in Brazilian Portuguese
Nathan Hartmann | Gustavo Henrique Paetzold | Sandra Aluísio

Most research on Lexical Simplification (LS) addresses non-native speakers of English, since they are numerous and easy to recruit. This makes it difficult to create LS solutions for other languages and target audiences. This paper presents SIMPLEX-PB 2.0, a dataset for LS in Brazilian Portuguese that, unlike its predecessor SIMPLEX-PB, accurately captures the needs of Brazilian underprivileged children. To create SIMPLEX-PB 2.0, we addressed all limitations of the old SIMPLEX-PB through multiple rounds of manual annotation. As a result, SIMPLEX-PB 2.0 features much more reliable and numerous candidate substitutions to complex words, as well as word complexity rankings produced by a group underprivileged children.

bib
Bi-directional Answer-to-Answer Co-attention for Short Answer Grading using Deep Learning
Abebawu Eshetu | Getenesh Teshome | Ribka Alemahu

So far different research works have been conducted to achieve short answer questions. Hence, due to the advancement of artificial intelligence and adaptability of deep learning models, we introduced a new model to score short answer subjective questions. Using bi-directional answer to answer co-attention, we have demonstrated the extent to which each words and sentences features of student answer detected by the model and shown prom-ising result on both Kaggle and Mohler’s dataset. The experiment on Amharic short an-swer dataset prepared for this research work also shows promising result that can be used as baseline for subsequent works.

bib
Effective questions in referential visual dialogue
Mauricio Mazuecos | Alberto Testoni | Raffaella Bernardi | Luciana Benotti

An interesting challenge for situated dialogue systems is referential visual dialog: by asking questions, the system has to identify the referent to which the user refers to. Task success is the standard metric used to evaluate these systems. However, it does not consider how effective each question is, that is how much each question contributes to the goal. We propose a new metric, that measures question effectiveness. As a preliminary study, we report the new metric for state of the art publicly available models on GuessWhat?!. Surprisingly, successful dialogues do not have a higher percentage of effective questions than failed dialogues. This suggests that a system with high task success is not necessarily one that generates good questions.

bib
A Translation-Based Approach to Morphology Learning for Low Resource Languages
Tewodros Gebreselassie | Amanuel Mersha | Michael Gasser

“Low resource languages” usually refers to languages that lack corpora and basic tools such as part-of-speech taggers. But a significant number of such languages do benefit from the availability of relatively complex linguistic descriptions of phonology, morphology, and syntax, as well as dictionaries. A further category, probably the majority of the world’s languages, suffers from the lack of even these resources. In this paper, we investigate the possibility of learning the morphology of such a language by relying on its close relationship to a language with more resources. Specifically, we use a transfer-based approach to learn the morphology of the severely under-resourced language Gofa, starting with a neural morphological generator for the closely related language, Wolaytta. Both languages are members of the Omotic family, spoken and southwestern Ethiopia, and, like other Omotic languages, both are morphologically complex. We first create a finite- state transducer for morphological analysis and generation for Wolaytta, based on relatively complete linguistic descriptions and lexicons for the language. Next, we train an encoder-decoder neural network on the task of morphological generation for Wolaytta, using data generated by the FST. Such a network takes a root and a set of grammatical features as input and generates a word form as output. We then elicit Gofa translations of a small set of Wolaytta words from bilingual speakers. Finally, we retrain the decoder of the Wolaytta network, using a small set of Gofa target words that are translations of the Wolaytta outputs of the original network. The evaluation shows that the transfer network performs better than a separate encoder-decoder network trained on a larger set of Gofa words. We conclude with implications for the learning of morphology for severely under-resourced languages in regions where there are related languages with more resources.

bib
Tigrinya Automatic Speech recognition with Morpheme based recognition units
Hafte Abera | Sebsibe Hailemariam

The Tigrinya language is agglutinative and has a large number of inflected and derived forms of words. Therefore a Tigrinya large vocabulary continuous speech recognition system often has a large number of different units and a high out-of-vocabulary (OOV) rate if a word is used as a recognition unit of a language model (LM) and lexicon. Therefore a morpheme-based approach has often been used and a morpheme is used as the recognition unit to reduce the high OOV rate. This paper presents an automatic speech recognition experiment conducted to see the effect of OOV words on the performance speech recognition system for Tigrinya. We tried to solve the OOV problem by using morphemes as lexicon and language model units. It has been found that the morpheme-based recognition system is better lexical and language modeling units than words. An absolute improvement (in word recognition accuracy) of 3.45 token and 8.36 types has been obtained as a result of using a morph-based vocabulary.

bib
Variants of Vector Space Reductions for Predicting the Compositionality of English Noun Compounds
Pegah Alipoormolabashi | Sabine Schulte im Walde

Predicting the degree of compositionality of noun compounds is a crucial ingredient for lexicography and NLP applications, to know whether the compound should be treated as a whole, or through its constituents. Computational approaches for an automatic prediction typically represent compounds and their constituents within a vector space to have a numeric relatedness measure for the words. This paper provides a systematic evaluation of using different vector-space reduction variants for the prediction. We demonstrate that Word2vec and nouns-only dimensionality reductions are the most successful and stable vector space reduction variants for our task.

bib
An Assessment of Language Identification Methods on Tweets and Wikipedia Articles
Pedro Vernetti | Larissa Freitas

Language identification is the task of determining the language which a given text is written. This task is important for Natural Language Processing and Information Retrieval activities. Two popular approaches for language identification are the N-grams and stopwords models. In this paper, these two models were tested on different types of documents such as short, irregular texts (tweets) and long, regular texts (Wikipedia articles).

bib
A Comparison of Identification Methods of Brazilian Music Styles by Lyrics
Patrick Guimarães | Jader Froes | Douglas Costa | Larissa Freitas

In our work, we applied different techniques for the task of genre classification using lyrics. Utilizing our dataset with lyrics of typical genres in Brazil divided into seven classes, we apply some models used in machine learning and deep learning classification tasks. We explore the performance of usual models for text classification using an input in the Portuguese language. We also compare the use of RNN and classic machine learning approaches for text classification, exploring the most used methods in the field.

bib
Enabling fast and correct typing in ‘Leichte Sprache’ (Easy Language)
Ina Steinmetz | Karin Harbusch

Simplified languages are instruments for inclusion aiming to overcome language barriers. Leichte Sprache (LS), for instance, is a variety of German with reduced complexity (cf. Basic English). So far, LS is mainly provided for, but rarely written by, its target groups, e.g. people with cognitive impairments. One reason may be the lack of technical support during the process from message conceptualization to sentence realization. In the following, we present a system for assisted typing in LS whose accuracy and speed is largely due to the deployment of real time natural-language processing enabling efficient prediction and context-sensitive grammar support.

bib
AI4D - African Language Dataset Challenge
Kathleen Siminyu | Sackey Freshia

As language and speech technologies become more advanced, the lack of fundamental digital resources for African languages, such as data, spell checkers and PoS taggers, means that the digital divide between these languages and others keeps growing. This work details the organisation of the AI4D - African Language Dataset Challenge, an effort to incentivize the creation, curation and uncovering to African language datasets through a competitive challenge, particularly datasets that are annotated or prepared for use in a downstream NLP task.

bib
Can Wikipedia Categories Improve Masked Language Model Pretraining?
Diksha Meghwal | Katharina Kann | Iacer Calixto | Stanislaw Jastrzebski

Pretrained language models have obtained impressive results for a large set of natural language understanding tasks. However, training these models is computationally expensive and requires huge amounts of data. Thus, it would be desirable to automatically detect groups of more or less important examples. Here, we investigate if we can leverage sources of information which are commonly overlooked, Wikipedia categories as listed in DBPedia, to identify useful or harmful data points during pretraining. We define an experimental setup in which we analyze correlations between language model perplexity on specific clusters and downstream NLP task performances during pretraining. Our experiments show that Wikipedia categories are not a good indicator of the importance of specific sentences for pretraining.

bib
FFR v1.1: Fon-French Neural Machine Translation
Chris Chinenye Emezue | Femi Pancrace Bonaventure Dossou

All over the world and especially in Africa, researchers are putting efforts into building Neural Machine Translation (NMT) systems to help tackle the language barriers in Africa, a continent of over 2000 different languages. However, the low-resourceness, diacritical, and tonal complexities of African languages are major issues being faced. The FFR project is a major step towards creating a robust translation model from Fon, a very low-resource and tonal language, to French, for research and public use. In this paper, we introduce FFR Dataset, a corpus of Fon-to-French translations, describe the diacritical encoding process, and introduce our FFR v1.1 model, trained on the dataset. The dataset and model are made publicly available, to promote collaboration and reproducibility.

bib
Classification and Analysis of Neologisms Produced by Learners of Spanish: Effects of Proficiency and Task
Shira Wein

The Spanish Learner Language Oral Corpora (SPLLOC) of transcribed conversations between investigators and language learners contains a set of neologism tags. In this work, the utterances tagged as neologisms are broken down into three categories: true neologisms, loanwords, and errors. This work examines the relationships between neologism, loanword, and error production and both language learner level and conversation task. The results of this study suggest that loanwords and errors are produced most frequently by language learners with moderate experience, while neologisms are produced most frequently by native speakers. This study also indicates that tasks that require descriptions of images draw more neologism, loanword and error production. We ultimately present a unique analysis of the implications of neologism, loanword, and error production useful for further work in second language acquisition research, as well as for language educators.

bib
Developing a Monolingual Sentence Simplification Corpus for Urdu
Yusra Anees | Sadaf Abdul Rauf | Nauman Iqbal | Abdul Basit Siddiqi

Complex sentences are a hurdle in the learning process of language learners. Sentence simplification aims to convert a complex sentence into its simpler form such that it is easily comprehensible. To build such automated simplification systems, corpora of complex sentences and their simplified versions is the first step to understand sentence complexity and enable the development of automatic text simplification systems. No such corpus has yet been developed for Urdu and we fill this gap by developing one such corpus to help start readability and automatic sentence simplification research. We present a lexical and syntactically simplified Urdu simplification corpus and a detailed analysis of the various simplification operations. We further analyze our corpora using text readability measures and present a comparison of the original, lexical simplified, and syntactically simplified corpora.

bib
Translating Natural Language Instructions for Behavioral Robot Navigation with a Multi-Head Attention Mechanism
Patricio Cerda-Mardini | Vladimir Araujo | Álvaro Soto

We propose a multi-head attention mechanism as a blending layer in a neural network model that translates natural language to a high level behavioral language for indoor robot navigation. We follow the framework established by (Zang et al., 2018a) that proposes the use of a navigation graph as a knowledge base for the task. Our results show significant performance gains when translating instructions on previously unseen environments, therefore, improving the generalization capabilities of the model.

bib
Towards Mitigating Gender Bias in a decoder-based Neural Machine Translation model by Adding Contextual Information
Christine Basta | Marta R. Costa-jussà | José A. R. Fonollosa

Gender bias negatively impacts many natural language processing applications, including machine translation (MT). The motivation behind this work is to study whether recent proposed MT techniques are significantly contributing to attenuate biases in document-level and gender-balanced data. For the study, we consider approaches of adding the previous sentence and the speaker information, implemented in a decoder-based neural MT system. We show improvements both in translation quality (+1 BLEU point) as well as in gender bias mitigation on WinoMT (+5% accuracy).

bib
Predicting and Analyzing Law-Making in Kenya
Oyinlola Babafemi | Adewale Akinfaderin

Modelling and analyzing parliamentary legislation, roll-call votes and order of proceedings in developed countries has received significant attention in recent years. In this paper, we focused on understanding the bills introduced in a developing democracy, the Kenyan bicameral parliament. We developed and trained machine learning models on a combination of features extracted from the bills to predict the outcome - if a bill will be enacted or not. We observed that the texts in a bill are not as relevant as the year and month the bill was introduced and the category the bill belongs to.

bib
Defining and Evaluating Fair Natural Language Generation
Catherine Yeo | Alyssa Chen

Our work focuses on the biases that emerge in the natural language generation (NLG) task of sentence completion. In this paper, we introduce a mathematical framework of fairness for NLG followed by an evaluation of gender biases in two state-of-the-art language models. Our analysis provides a theoretical formulation for biases in NLG and empirical evidence that existing language generation models embed gender bias.

bib
Political Advertising Dataset: the use case of the Polish 2020 Presidential Elections
Lukasz Augustyniak | Krzysztof Rajda | Tomasz Kajdanowicz | Michał Bernaczyk

Political campaigns are full of political ads posted by candidates on social media. Political advertisements constitute a basic form of campaigning, subjected to various social requirements. We present the first publicly open dataset for detecting specific text chunks and categories of political advertising in the Polish language. It contains 1,705 human-annotated tweets tagged with nine categories, which constitute campaigning under Polish electoral law. We achieved a 0.65 inter-annotator agreement (Cohen’s kappa score). An additional annotator resolved the mismatches between the first two annotators improving the consistency and complexity of the annotation process. We used the newly created dataset to train a well established neural tagger (achieving a 70% percent points F1 score). We also present a possible direction of use cases for such datasets and models with an initial analysis of the Polish 2020 Presidential Elections on Twitter.

bib
The human unlikeness of neural language models in next-word prediction
Cassandra L. Jacobs | Arya D. McCarthy

The training objective of unidirectional language models (LMs) is similar to a psycholinguistic benchmark known as the cloze task, which measures next-word predictability. However, LMs lack the rich set of experiences that people do, and humans can be highly creative. To assess human parity in these models’ training objective, we compare the predictions of three neural language models to those of human participants in a freely available behavioral dataset (Luke & Christianson, 2016). Our results show that while neural models show a close correspondence to human productions, they nevertheless assign insufficient probability to how often speakers guess upcoming words, especially for open-class content words.

bib
Long-Tail Predictions with Continuous-Output Language Models
Shiran Dudy | Steven Bedrick

Neural language models typically employ a categorical approach to prediction and training, leading to well-known computational and numerical limitations. An under-explored alternative approach is to perform prediction directly against a continuous word embedding space, which according to recent research is more akin to how lexemes are represented in the brain. Choosing this method opens the door for for large-vocabulary, language models and enables substantially smaller and simpler computational complexities. In this research we explore a different important trait - the continuous output prediction models reach low-frequency vocabulary words which we show are often ignored by the categorical model. Such words are essential, as they can contribute to personalization and user vocabulary adaptation. In this work, we explore continuous-space language modeling in the context of a word prediction task over two different textual domains (newswire text and biomedical journal articles). We investigate both traditional and adversarial training approaches, and report results using several different embedding spaces and decoding mechanisms. We find that our continuous-prediction approach outperforms the standard categorical approach in terms of term diversity, in particular with rare words.

bib
Analyzing the Framing of 2020 Presidential Candidates in the News
Audrey Acken | Dorottya Demszky

In this study, we apply NLP methods to learn about the framing of the 2020 Democratic Presidential candidates in news media. We use both a lexicon-based approach and word embeddings to analyze how candidates are discussed in news sources with different political leanings. Our results show significant differences in the framing of candidates across the news sources along several dimensions, such as sentiment and agency, paving the way for a deeper investigation.

bib
Understanding the Impact of Experiment Design for Evaluating Dialogue System Output
Sashank Santhanam | Samira Shaikh

Evaluation of output from natural language generation (NLG) systems is typically conducted via crowdsourced human judgments. To understand the impact of how experiment design might affect the quality and consistency of such human judgments, we designed a between-subjects study with four experimental conditions. Through our systematic study with 40 crowdsourced workers in each task, we find that using continuous scales achieves more consistent ratings than Likert scale or ranking-based experiment design. Additionally, we find that factors such as no prior experience of participating in similar studies of rating dialogue system output

bib
Studying The Effect of Emotional and Moral Language on Information Contagion during the Charlottesville Event
Khyati Mahajan | Samira Shaikh

We highlight the contribution of emotional and moral language towards information contagion online. We find that retweet count on Twitter is significantly predicted by the use of negative emotions with negative moral language. We find that a tweet is less likely to be retweeted (hence less engagement and less potential for contagion) when it has emotional language expressed as anger along with a specific type of moral language, known as authority-vice. Conversely, when sadness is expressed with authority-vice, the tweet is more likely to be retweeted. Our findings indicate how emotional and moral language can interact in predicting information contagion.

bib
Mapping of Narrative Text Fields To ICD-10 Codes Using Natural Language Processing and Machine Learning
Risuna Nkolele

The assignment of ICD-10 codes is done manually, which is laborious and prone to errors. The use of natural language processing and machine learning approaches have been receiving increasing attention on automating the task of assigning ICD-10 codes. In this study, we investigate the effect of different approaches on automating the task of assigning ICD-10 codes. To do this we use the South African clinical dataset containing three narrative text fields (Clinical Summary, Presenting Complaints, and Examination Findings). The following traditional machine learning algorithms, namely: Logistic Regression, Multinomial Naive Bayes, Support Vector Machine, Decision Tree, RandomForest, and Extreme Gradient Boost were used as our classifiers. Our study results show the strong potential of automated ICD-10 coding from the narrative text fields. ExtremeGradient Boost outperformed other classifiers in automating the task of assigning ICD-10 codes based on the three narrative text fields with an accuracy of 79%, precision of75%, and recall of 78%. While our worst classifier (Decision Tree) achieved the accuracy of 54%, precision of 60% and recall of 56%.

bib
Multitask Models for Controlling the Complexity of Neural Machine Translation
Sweta Agrawal | Marine Carpuat

We introduce a machine translation task where the output is aimed at audiences of different levels of target language proficiency. We collect a novel dataset of news articles available in English and Spanish and written for diverse reading grade levels. We leverage this dataset to train multitask sequence to sequence models that translate Spanish into English targeted at an easier reading grade level than the original Spanish. We show that multitask models outperform pipeline approaches that translate and simplify text independently.

bib
Using Social Media For Bitcoin Day Trading Behavior Prediction
Anna Paula Pawlicka Maule | Kristen Johnson

This abstract presents preliminary work in the application of natural language processing techniques and social network modeling for the prediction of cryptocurrency trading and investment behavior. Specifically, we are building models to use language and social network behaviors to predict if the tweets of a 24-hour period can be used to buy or sell cryptocurrency to make a profit. In this paper we present our novel task and initial language modeling studies.

bib
HausaMT v1.0: Towards English–Hausa Neural Machine Translation
Adewale Akinfaderin

Neural Machine Translation (NMT) for low-resource languages suffers from low performance because of the lack of large amounts of parallel data and language diversity. To contribute to ameliorating this problem, we built a baseline model for English–Hausa machine translation, which is considered a task for low–resource language. The Hausa language is the second largest Afro–Asiatic language in the world after Arabic and it is the third largest language for trading across a larger swath of West Africa countries, after English and French. In this paper, we curated different datasets containing Hausa–English parallel corpus for our translation. We trained baseline models and evaluated the performance of our models using the Recurrent and Transformer encoder–decoder architecture with two tokenization approaches: standard word–level tokenization and Byte Pair Encoding (BPE) subword tokenization.

bib
Outcomes of coming out: Analyzing stories of LGBTQ+
Krithika Ramesh | Tanvi Anand

The Internet is frequently used as a platform through which opinions and views on various topics can be expressed. One such topic that draws controversial attention is LGBTQ+ rights. This paper attempts to analyze the reaction that members of the LGBTQ+ community face when they reveal their gender or sexuality, or in other words, when they ‘come out of the closet’. We aim to classify the experiences shared by them as positive or negative. We collected data from various sources, primarily Twitter. We have applied deep learning techniques and compared the results to other classifiers, and the results obtained from applying classical sentiment analysis techniques to it.

bib
An Evaluation of Subword Segmentation Strategies for Neural Machine Translation of Morphologically Rich Languages
Aquia Richburg | Ramy Eskander | Smaranda Muresan | Marine Carpuat

Byte-Pair Encoding (BPE) (Sennrich et al., 2016) has become a standard pre-processing step when building neural machine translation systems. However, it is not clear whether this is an optimal strategy in all settings. We conduct a controlled comparison of subword segmentation strategies for translating two low-resource morphologically rich languages (Swahili and Turkish) into English. We show that segmentations based on a unigram language model (Kudo, 2018) yield comparable BLEU and better recall for translating rare source words than BPE.

bib
Enhanced Urdu Word Segmentation using Conditional Random Fields and Morphological Context Features
Aamir Farhan | Mashrukh Islam | Dipti Misra Sharma

Word segmentation is a fundamental task for most of the NLP applications. Urdu adopts Nastalique writing style which does not have a concept of space. Furthermore, the inherent non-joining attributes of certain characters in Urdu create spaces within a word while writing in digital format. Thus, Urdu not only has space omission but also space insertion issues which make the word segmentation task challenging. In this paper, we improve upon the results of Zia, Raza and Athar (2018) by using a manually annotated corpus of 19,651 sentences along with morphological context features. Using the Conditional Random Field sequence modeler, our model achieves F 1 score of 0.98 for word boundary identification and 0.92 for sub-word boundary identification tasks. The results demonstrated in this paper outperform the state-of-the-art methods.