Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)

Wei Xu, Alan Ritter, Tim Baldwin, Afshin Rahimi (Editors)

Anthology ID:
Association for Computational Linguistics
Bib Export formats:

pdf bib
Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)
Wei Xu | Alan Ritter | Tim Baldwin | Afshin Rahimi

pdf bib
Text Simplification for Comprehension-based Question-Answering
Tanvi Dadu | Kartikey Pant | Seema Nagar | Ferdous Barbhuiya | Kuntal Dey

Text simplification is the process of splitting and rephrasing a sentence to a sequence of sentences making it easier to read and understand while preserving the content and approximating the original meaning. Text simplification has been exploited in NLP applications like machine translation, summarization, semantic role labeling, and information extraction, opening a broad avenue for its exploitation in comprehension-based question-answering downstream tasks. In this work, we investigate the effect of text simplification in the task of question-answering using a comprehension context. We release Simple-SQuAD, a simplified version of the widely-used SQuAD dataset. Firstly, we outline each step in the dataset creation pipeline, including style transfer, thresholding of sentences showing correct transfer, and offset finding for each answer. Secondly, we verify the quality of the transferred sentences through various methodologies involving both automated and human evaluation. Thirdly, we benchmark the newly created corpus and perform an ablation study for examining the effect of the simplification process in the SQuAD-based question answering task. Our experiments show that simplification leads to up to 2.04 % and 1.74 % increase in Exact Match and F1, respectively. Finally, we conclude with an analysis of the transfer process, investigating the types of edits made by the model, and the effect of sentence length on the transfer model.

pdf bib
Keyphrase Extraction with Incomplete Annotated Training Data
Yanfei Lei | Chunming Hu | Guanghui Ma | Richong Zhang

Extracting keyphrases that summarize the main points of a document is a fundamental task in natural language processing. Supervised approaches to keyphrase extraction(KPE) are largely developed based on the assumption that the training data is fully annotated. However, due to the difficulty of keyphrase annotating, KPE models severely suffer from incomplete annotated problem in many scenarios. To this end, we propose a more robust training method that learns to mitigate the misguidance brought by unlabeled keyphrases. We introduce negative sampling to adjust training loss, and conduct experiments under different scenarios. Empirical studies on synthetic datasets and open domain dataset show that our model is robust to incomplete annotated problem and surpasses prior baselines. Extensive experiments on five scientific domain datasets of different scales demonstrate that our model is competitive with the state-of-the-art method.

pdf bib
Fine-grained Temporal Relation Extraction with Ordered-Neuron LSTM and Graph Convolutional NetworksLSTM and Graph Convolutional Networks
Minh Tran Phu | Minh Van Nguyen | Thien Huu Nguyen

Fine-grained temporal relation extraction (FineTempRel) aims to recognize the durations and timeline of event mentions in text. A missing part in the current deep learning models for FineTempRel is their failure to exploit the syntactic structures of the input sentences to enrich the representation vectors. In this work, we propose to fill this gap by introducing novel methods to integrate the syntactic structures into the deep learning models for FineTempRel. The proposed model focuses on two types of syntactic information from the dependency trees, i.e., the syntax-based importance scores for representation learning of the words and the syntactic connections to identify important context words for the event mentions. We also present two novel techniques to facilitate the knowledge transfer between the subtasks of FineTempRel, leading to a novel model with the state-of-the-art performance for this task.

pdf bib
A Text Editing Approach to Joint Japanese Word Segmentation, POS Tagging, and Lexical NormalizationJapanese Word Segmentation, POS Tagging, and Lexical Normalization
Shohei Higashiyama | Masao Utiyama | Taro Watanabe | Eiichiro Sumita

Lexical normalization, in addition to word segmentation and part-of-speech tagging, is a fundamental task for Japanese user-generated text processing. In this paper, we propose a text editing model to solve the three task jointly and methods of pseudo-labeled data generation to overcome the problem of data deficiency. Our experiments showed that the proposed model achieved better normalization performance when trained on more diverse pseudo-labeled data.

pdf bib
Intrinsic evaluation of language models for code-switching
Sik Feng Cheong | Hai Leong Chieu | Jing Lim

Language models used in speech recognition are often either evaluated intrinsically using perplexity on test data, or extrinsically with an automatic speech recognition (ASR) system. The former evaluation does not always correlate well with ASR performance, while the latter could be specific to particular ASR systems. Recent work proposed to evaluate language models by using them to classify ground truth sentences among alternative phonetically similar sentences generated by a fine state transducer. Underlying such an evaluation is the assumption that the generated sentences are linguistically incorrect. In this paper, we first put this assumption into question, and observe that alternatively generated sentences could often be linguistically correct when they differ from the ground truth by only one edit. Secondly, we showed that by using multi-lingual BERT, we can achieve better performance than previous work on two code-switching data sets. Our implementation is publicly available on Github at

pdf bib
Perceived and Intended Sarcasm Detection with Graph Attention Networks
Joan Plepi | Lucie Flek

Existing sarcasm detection systems focus on exploiting linguistic markers, context, or user-level priors. However, social studies suggest that the relationship between the author and the audience can be equally relevant for the sarcasm usage and interpretation. In this work, we propose a framework jointly leveraging (1) a user context from their historical tweets together with (2) the social information from a user’s conversational neighborhood in an interaction graph, to contextualize the interpretation of the post. We use graph attention networks (GAT) over users and tweets in a conversation thread, combined with dense user history representations. Apart from achieving state-of-the-art results on the recently published dataset of 19k Twitter users with 30 K labeled tweets, adding 10 M unlabeled tweets as context, our results indicate that the model contributes to interpreting the sarcastic intentions of an author more than to predicting the sarcasm perception by others.

pdf bib
Comparing Grammatical Theories of Code-Mixing
Adithya Pratapa | Monojit Choudhury

Code-mixed text generation systems have found applications in many downstream tasks, including speech recognition, translation and dialogue. A paradigm of these generation systems relies on well-defined grammatical theories of code-mixing, and there is a lack of comparison of these theories. We present a large-scale human evaluation of two popular grammatical theories, Matrix-Embedded Language (ML) and Equivalence Constraint (EC). We compare them against three heuristic-based models and quantitatively demonstrate the effectiveness of the two grammatical theories.

pdf bib
Mitigation of Diachronic Bias in Fake News Detection Dataset
Taichi Murayama | Shoko Wakamiya | Eiji Aramaki

Fake news causes significant damage to society. To deal with these fake news, several studies on building detection models and arranging datasets have been conducted. Most of the fake news datasets depend on a specific time period. Consequently, the detection models trained on such a dataset have difficulty detecting novel fake news generated by political changes and social changes ; they may possibly result in biased output from the input, including specific person names and organizational names. We refer to this problem as Diachronic Bias because it is caused by the creation date of news in each dataset. In this study, we confirm the bias, especially proper nouns including person names, from the deviation of phrase appearances in each dataset. Based on these findings, we propose masking methods using Wikidata to mitigate the influence of person names and validate whether they make fake news detection models robust through experiments with in-domain and out-of-domain data.

pdf bib
Changes in Twitter geolocations : Insights and suggestions for future usageTwitter geolocations: Insights and suggestions for future usage
Anna Kruspe | Matthias Häberle | Eike J. Hoffmann | Samyo Rode-Hasinger | Karam Abdulahhad | Xiao Xiang Zhu

Twitter data has become established as a valuable source of data for various application scenarios in the past years. For many such applications, it is necessary to know where Twitter posts (tweets) were sent from or what location they refer to. Researchers have frequently used exact coordinates provided in a small percentage of tweets, but Twitter removed the option to share these coordinates in mid-2019. Moreover, there is reason to suspect that a large share of the provided coordinates did not correspond to GPS coordinates of the user even before that. In this paper, we explain the situation and the 2019 policy change and shed light on the various options of still obtaining location information from tweets. We provide usage statistics including changes over time, and analyze what the removal of exact coordinates means for various common research tasks performed with Twitter data. Finally, we make suggestions for future research requiring geolocated tweets.

pdf bib
Coping with Noisy Training Data Labels in Paraphrase Detection
Teemu Vahtola | Mathias Creutz | Eetu Sjöblom | Sami Itkonen

We present new state-of-the-art benchmarks for paraphrase detection on all six languages in the Opusparcus sentential paraphrase corpus : English, Finnish, French, German, Russian, and Swedish. We reach these baselines by fine-tuning BERT. The best results are achieved on smaller and cleaner subsets of the training sets than was observed in previous research. Additionally, we study a translation-based approach that is competitive for the languages with more limited and noisier training data.

pdf bib
Detecting Cross-Geographic Biases in Toxicity Modeling on Social Media
Sayan Ghosh | Dylan Baker | David Jurgens | Vinodkumar Prabhakaran

Online social media platforms increasingly rely on Natural Language Processing (NLP) techniques to detect abusive content at scale in order to mitigate the harms it causes to their users. However, these techniques suffer from various sampling and association biases present in training data, often resulting in sub-par performance on content relevant to marginalized groups, potentially furthering disproportionate harms towards them. Studies on such biases so far have focused on only a handful of axes of disparities and subgroups that have annotations / lexicons available. Consequently, biases concerning non-Western contexts are largely ignored in the literature. In this paper, we introduce a weakly supervised method to robustly detect lexical biases in broader geo-cultural contexts. Through a case study on a publicly available toxicity detection model, we demonstrate that our method identifies salient groups of cross-geographic errors, and, in a follow up, demonstrate that these groupings reflect human judgments of offensive and inoffensive language in those geographic contexts. We also conduct analysis of a model trained on a dataset with ground truth labels to better understand these biases, and present preliminary mitigation experiments.

pdf bib
Detection of Puffery on the English WikipediaEnglish Wikipedia
Amanda Bertsch | Steven Bethard

On Wikipedia, an online crowdsourced encyclopedia, volunteers enforce the encyclopedia’s editorial policies. Wikipedia’s policy on maintaining a neutral point of view has inspired recent research on bias detection, including weasel words and hedges. Yet to date, little work has been done on identifying puffery, phrases that are overly positive without a verifiable source. We demonstrate that collecting training data for this task requires some care, and construct a dataset by combining Wikipedia editorial annotations and information retrieval techniques. We compare several approaches to predicting puffery, and achieve 0.963 f1 score by incorporating citation features into a RoBERTa model. Finally, we demonstrate how to integrate our model with Wikipedia’s public infrastructure to give back to the Wikipedia editor community.

pdf bib
Robustness and Sensitivity of BERT Models Predicting Alzheimer’s Disease from TextBERT Models Predicting Alzheimer’s Disease from Text
Jekaterina Novikova

Understanding robustness and sensitivity of BERT models predicting Alzheimer’s disease from text is important for both developing better classification models and for understanding their capabilities and limitations. In this paper, we analyze how a controlled amount of desired and undesired text alterations impacts performance of BERT. We show that BERT is robust to natural linguistic variations in text. On the other hand, we show that BERT is not sensitive to removing clinically important information from text.

pdf bib
CIDEr-R : Robust Consensus-based Image Description EvaluationCIDEr-R: Robust Consensus-based Image Description Evaluation
Gabriel Oliveira dos Santos | Esther Luna Colombini | Sandra Avila

This paper shows that CIDEr-D, a traditional evaluation metric for image description, does not work properly on datasets where the number of words in the sentence is significantly greater than those in the MS COCO Captions dataset. We also show that CIDEr-D has performance hampered by the lack of multiple reference sentences and high variance of sentence length. To bypass this problem, we introduce CIDEr-R, which improves CIDEr-D, making it more flexible in dealing with datasets with high sentence length variance. We demonstrate that CIDEr-R is more accurate and closer to human judgment than CIDEr-D ; CIDEr-R is more robust regarding the number of available references. Our results reveal that using Self-Critical Sequence Training to optimize CIDEr-R generates descriptive captions. In contrast, when CIDEr-D is optimized, the generated captions’ length tends to be similar to the reference length. However, the models also repeat several times the same word to increase the sentence length.

pdf bib
Improved Multilingual Language Model Pretraining for Social Media Text via Translation Pair Prediction
Shubhanshu Mishra | Aria Haghighi

We evaluate a simple approach to improving zero-shot multilingual transfer of mBERT on social media corpus by adding a pretraining task called translation pair prediction (TPP), which predicts whether a pair of cross-lingual texts are a valid translation. Our approach assumes access to translations (exact or approximate) between source-target language pairs, where we fine-tune a model on source language task data and evaluate the model in the target language. In particular, we focus on language pairs where transfer learning is difficult for mBERT : those where source and target languages are different in script, vocabulary, and linguistic typology. We show improvements from TPP pretraining over mBERT alone in zero-shot transfer from English to Hindi, Arabic, and Japanese on two social media tasks : NER (a 37 % average relative improvement in F1 across target languages) and sentiment classification (12 % relative improvement in F1) on social media text, while also benchmarking on a non-social media task of Universal Dependency POS tagging (6.7 % relative improvement in accuracy). Our results are promising given the lack of social media bitext corpus. Our code can be found at :

pdf bib
Character Transformations for Non-Autoregressive GEC TaggingGEC Tagging
Milan Straka | Jakub Náplava | Jana Straková

We propose a character-based non-autoregressive GEC approach, with automatically generated character transformations. Recently, per-word classification of correction edits has proven an efficient, parallelizable alternative to current encoder-decoder GEC systems. We show that word replacement edits may be suboptimal and lead to explosion of rules for spelling, diacritization and errors in morphologically rich languages, and propose a method for generating character transformations from GEC corpus. Finally, we train character transformation models for Czech, German and Russian, reaching solid results and dramatic speedup compared to autoregressive systems. The source code is released at

pdf bib
Can Character-based Language Models Improve Downstream Task Performances In Low-Resource And Noisy Language Scenarios?
Arij Riabi | Benoît Sagot | Djamé Seddah

Recent impressive improvements in NLP, largely based on the success of contextual neural language models, have been mostly demonstrated on at most a couple dozen high- resource languages. Building language mod- els and, more generally, NLP systems for non- standardized and low-resource languages remains a challenging task. In this work, we fo- cus on North-African colloquial dialectal Arabic written using an extension of the Latin script, called NArabizi, found mostly on social media and messaging communication. In this low-resource scenario with data display- ing a high level of variability, we compare the downstream performance of a character-based language model on part-of-speech tagging and dependency parsing to that of monolingual and multilingual models. We show that a character-based model trained on only 99k sentences of NArabizi and fined-tuned on a small treebank of this language leads to performance close to those obtained with the same architecture pre- trained on large multilingual and monolingual models. Confirming these results a on much larger data set of noisy French user-generated content, we argue that such character-based language models can be an asset for NLP in low-resource and high language variability set- tings.

pdf bib
Something Something Hota Hai ! An Explainable Approach towards Sentiment Analysis on Indian Code-Mixed DataIndian Code-Mixed Data
Aman Priyanshu | Aleti Vardhan | Sudarshan Sivakumar | Supriti Vijay | Nipuna Chhabra

The increasing use of social media sites in countries like India has given rise to large volumes of code-mixed data. Sentiment analysis of this data can provide integral insights into people’s perspectives and opinions. Code-mixed data is often noisy in nature due to multiple spellings for the same word, lack of definite order of words in a sentence, and random abbreviations. Thus, working with code-mixed data is more challenging than monolingual data. Interpreting a model’s predictions allows us to determine the robustness of the model against different forms of noise. In this paper, we propose a methodology to integrate explainable approaches into code-mixed sentiment analysis. By interpreting the predictions of sentiment analysis models we evaluate how well the model is able to adapt to the implicit noises present in code-mixed data.

pdf bib
BERTweetFR : Domain Adaptation of Pre-Trained Language Models for French TweetsBERTweetFR : Domain Adaptation of Pre-Trained Language Models for French Tweets
Yanzhu Guo | Virgile Rennard | Christos Xypolopoulos | Michalis Vazirgiannis

We introduce BERTweetFR, the first large-scale pre-trained language model for French tweets. Our model is initialised using a general-domain French language model CamemBERT which follows the base architecture of BERT. Experiments show that BERTweetFR outperforms all previous general-domain French language models on two downstream Twitter NLP tasks of offensiveness identification and named entity recognition. The dataset used in the offensiveness detection task is first created and annotated by our team, filling in the gap of such analytic datasets in French. We make our model publicly available in the transformers library with the aim of promoting future research in analytic tasks for French tweets.

pdf bib
To What Extent Does Lexical Normalization Help English-as-a-Second Language Learners to Read Noisy English Texts?English-as-a-Second Language Learners to Read Noisy English Texts?
Yo Ehara

How difficult is it for English-as-a-second language (ESL) learners to read noisy English texts? Do ESL learners need lexical normalization to read noisy English texts? These questions may also affect community formation on social networking sites where differences can be attributed to ESL learners and native English speakers. However, few studies have addressed these questions. To this end, we built highly accurate readability assessors to evaluate the readability of texts for ESL learners. We then applied these assessors to noisy English texts to further assess the readability of the texts. The experimental results showed that although intermediate-level ESL learners can read most noisy English texts in the first place, lexical normalization significantly improves the readability of noisy English texts for ESL learners.

pdf bib
Multilingual Sequence Labeling Approach to solve Lexical Normalization
Divesh Kubal | Apurva Nagvenkar

The task of converting a nonstandard text to a standard and readable text is known as lexical normalization. Almost all the Natural Language Processing (NLP) applications require the text data in normalized form to build quality task-specific models. Hence, lexical normalization has been proven to improve the performance of numerous natural language processing tasks on social media. This study aims to solve the problem of Lexical Normalization by formulating the Lexical Normalization task as a Sequence Labeling problem. This paper proposes a sequence labeling approach to solve the problem of Lexical Normalization in combination with the word-alignment technique. The goal is to use a single model to normalize text in various languages namely Croatian, Danish, Dutch, English, Indonesian-English, German, Italian, Serbian, Slovenian, Spanish, Turkish, and Turkish-German. This is a shared task in 2021 The 7th Workshop on Noisy User-generated Text (W-NUT) in which the participants are expected to create a system / model that performs lexical normalization, which is the translation of non-canonical texts into their canonical equivalents, comprising data from over 12 languages. The proposed single multilingual model achieves an overall ERR score of 43.75 on intrinsic evaluation and an overall Labeled Attachment Score (LAS) score of 63.12 on extrinsic evaluation. Further, the proposed method achieves the highest Error Reduction Rate (ERR) score of 61.33 among the participants in the shared task.

pdf bib
Sesame Street to Mount Sinai : BERT-constrained character-level Moses models for multilingual lexical normalizationBERT-constrained character-level Moses models for multilingual lexical normalization
Yves Scherrer | Nikola Ljubešić

This paper describes the HEL-LJU submissions to the MultiLexNorm shared task on multilingual lexical normalization. Our system is based on a BERT token classification preprocessing step, where for each token the type of the necessary transformation is predicted (none, uppercase, lowercase, capitalize, modify), and a character-level SMT step where the text is translated from original to normalized given the BERT-predicted transformation constraints. For some languages, depending on the results on development data, the training data was extended by back-translating OpenSubtitles data. In the final ordering of the ten participating teams, the HEL-LJU team has taken the second place, scoring better than the previous state-of-the-art.

pdf bib
Sequence-to-Sequence Lexical Normalization with Multilingual Transformers
Ana-Maria Bucur | Adrian Cosma | Liviu P. Dinu

Current benchmark tasks for natural language processing contain text that is qualitatively different from the text used in informal day to day digital communication. This discrepancy has led to severe performance degradation of state-of-the-art NLP models when fine-tuned on real-world data. One way to resolve this issue is through lexical normalization, which is the process of transforming non-standard text, usually from social media, into a more standardized form. In this work, we propose a sentence-level sequence-to-sequence model based on mBART, which frames the problem as a machine translation problem. As the noisy text is a pervasive problem across languages, not just English, we leverage the multi-lingual pre-training of mBART to fine-tune it to our data. While current approaches mainly operate at the word or subword level, we argue that this approach is straightforward from a technical standpoint and builds upon existing pre-trained transformer networks. Our results show that while word-level, intrinsic, performance evaluation is behind other methods, our model improves performance on extrinsic, downstream tasks through normalization compared to models operating on raw, unprocessed, social media text.

pdf bib
FAL at MultiLexNorm 2021 : Improving Multilingual Lexical Normalization by Fine-tuning ByT5ÚFAL at MultiLexNorm 2021: Improving Multilingual Lexical Normalization by Fine-tuning ByT5
David Samuel | Milan Straka

We present the winning entry to the Multilingual Lexical Normalization (MultiLexNorm) shared task at W-NUT 2021 (van der Goot et al., 2021a), which evaluates lexical-normalization systems on 12 social media datasets in 11 languages. We base our solution on a pre-trained byte-level language model, ByT5 (Xue et al., 2021a), which we further pre-train on synthetic data and then fine-tune on authentic normalization data. Our system achieves the best performance by a wide margin in intrinsic evaluation, and also the best performance in extrinsic evaluation through dependency parsing. The source code is released at and the fine-tuned models at