International Conference Recent Advances in Natural Language Processing (2021)


up

bib (full) Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

pdf bib
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)
Ruslan Mitkov | Galia Angelova

pdf bib
English-Arabic Cross-language Plagiarism DetectionEnglish-Arabic Cross-language Plagiarism Detection
Naif Alotaibi | Mike Joy

The advancement of the web and information technology has contributed to the rapid growth of digital libraries and automatic machine translation tools which easily translate texts from one language into another. These have increased the content accessible in different languages, which results in easily performing translated plagiarism, which are referred to as cross-language plagiarism. Recognition of plagiarism among texts in different languages is more challenging than identifying plagiarism within a corpus written in the same language. This paper proposes a new technique for enhancing English-Arabic cross-language plagiarism detection at the sentence level. This technique is based on semantic and syntactic feature extraction using word order, word embedding and word alignment with multilingual encoders. Those features, and their combination with different machine learning (ML) algorithms, are then used in order to aid the task of classifying sentences as either plagiarized or non-plagiarized. The proposed approach has been deployed and assessed using datasets presented at SemEval-2017. Analysis of experimental data demonstrates that utilizing extracted features and their combinations with various ML classifiers achieves promising results.

pdf bib
Enriching the Transformer with Linguistic Factors for Low-Resource Machine Translation
Jordi Armengol-Estapé | Marta R. Costa-jussà | Carlos Escolano

Introducing factors, that is to say, word features such as linguistic information referring to the source tokens, is known to improve the results of neural machine translation systems in certain settings, typically in recurrent architectures. This study proposes enhancing the current state-of-the-art neural machine translation architecture, the Transformer, so that it allows to introduce external knowledge. In particular, our proposed modification, the Factored Transformer, uses linguistic factors that insert additional knowledge into the machine translation system. Apart from using different kinds of features, we study the effect of different architectural configurations. Specifically, we analyze the performance of combining words and features at the embedding level or at the encoder level, and we experiment with two different combination strategies. With the best-found configuration, we show improvements of 0.8 BLEU over the baseline Transformer in the IWSLT German-to-English task. Moreover, we experiment with the more challenging FLoRes English-to-Nepali benchmark, which includes both extremely low-resourced and very distant languages, and obtain an improvement of 1.2 BLEU

pdf bib
A Multi-Pass Sieve Coreference Resolution for IndonesianIndonesian
Valentina Kania Prameswara Artari | Rahmad Mahendra | Meganingrum Arista Jiwanggi | Adityo Anggraito | Indra Budi

Coreference resolution is an NLP task to find out whether the set of referring expressions belong to the same concept in discourse. A multi-pass sieve is a deterministic coreference model that implements several layers of sieves, where each sieve takes a pair of correlated mentions from a collection of non-coherent mentions. The multi-pass sieve is based on the principle of high precision, followed by increased recall in each sieve. In this work, we examine the portability of the multi-pass sieve coreference resolution model to the Indonesian language. We conduct the experiment on 201 Wikipedia documents and the multi-pass sieve system yields 72.74 % of MUC F-measure and 52.18 % of BCUBED F-measure.

pdf bib
PyEuroVoc : A Tool for Multilingual Legal Document Classification with EuroVoc DescriptorsPyEuroVoc: A Tool for Multilingual Legal Document Classification with EuroVoc Descriptors
Andrei-Marius Avram | Vasile Pais | Dan Ioan Tufis

EuroVoc is a multilingual thesaurus that was built for organizing the legislative documentary of the European Union institutions. It contains thousands of categories at different levels of specificity and its descriptors are targeted by legal texts in almost thirty languages. In this work we propose a unified framework for EuroVoc classification on 22 languages by fine-tuning modern Transformer-based pretrained language models. We study extensively the performance of our trained models and show that they significantly improve the results obtained by a similar tool-JEX-on the same dataset. The code and the fine-tuned models were open sourced, together with a programmatic interface that eases the process of loading the weights of a trained model and of classifying a new document.

pdf bib
TEASER : Towards Efficient Aspect-based SEntiment Analysis and RecognitionTEASER: Towards Efficient Aspect-based SEntiment Analysis and Recognition
Vaibhav Bajaj | Kartikey Pant | Ishan Upadhyay | Srinath Nair | Radhika Mamidi

Sentiment analysis aims to detect the overall sentiment, i.e., the polarity of a sentence, paragraph, or text span, without considering the entities mentioned and their aspects. Aspect-based sentiment analysis aims to extract the aspects of the given target entities and their respective sentiments. Prior works formulate this as a sequence tagging problem or solve this task using a span-based extract-then-classify framework where first all the opinion targets are extracted from the sentence, and then with the help of span representations, the targets are classified as positive, negative, or neutral. The sequence tagging problem suffers from issues like sentiment inconsistency and colossal search space. Whereas, Span-based extract-then-classify framework suffers from issues such as half-word coverage and overlapping spans. To overcome this, we propose a similar span-based extract-then-classify framework with a novel and improved heuristic. Experiments on the three benchmark datasets (Restaurant14, Laptop14, Restaurant15) show our model consistently outperforms the current state-of-the-art. Moreover, we also present a novel supervised movie reviews dataset (Movie20) and a pseudo-labeled movie reviews dataset (moviesLarge) made explicitly for this task and report the results on the novel Movie20 dataset as well.

pdf bib
Litescale : A Lightweight Tool for Best-worst Scaling Annotation
Valerio Basile | Christian Cagnazzo

Best-worst Scaling (BWS) is a methodology for annotation based on comparing and ranking instances, rather than classifying or scoring individual instances. Studies have shown the efficacy of this methodology applied to NLP tasks in terms of a higher quality of the datasets produced by following it. In this system demonstration paper, we present Litescale, a free software library to create and manage BWS annotation tasks. Litescale computes the tuples to annotate, manages the users and the annotation process, and creates the final gold standard. The functionalities of Litescale can be accessed programmatically through a Python module, or via two alternative user interfaces, a textual console-based one and a graphical Web-based one. We further developed and deployed a fully online version of Litescale complete with multi-user support.

pdf bib
Cross-Lingual Wolastoqey-English Definition ModellingEnglish Definition Modelling
Diego Bear | Paul Cook

Definition modelling is the task of automatically generating a dictionary-style definition given a target word. In this paper, we consider cross-lingual definition generation. Specifically, we generate English definitions for Wolastoqey (Malecite-Passamaquoddy) words. Wolastoqey is an endangered, low-resource polysynthetic language. We hypothesize that sub-word representations based on byte pair encoding (Sennrich et al., 2016) can be leveraged to represent morphologically-complex Wolastoqey words and overcome the challenge of not having large corpora available for training. Our experimental results demonstrate that this approach outperforms baseline methods in terms of BLEU score.

pdf bib
On the Contribution of Per-ICD Attention Mechanisms to Classify Health Records in Languages with Fewer Resources than EnglishICD Attention Mechanisms to Classify Health Records in Languages with Fewer Resources than English
Alberto Blanco | Sonja Remmer | Alicia Pérez | Hercules Dalianis | Arantza Casillas

We introduce a multi-label text classifier with per-label attention for the classification of Electronic Health Records according to the International Classification of Diseases. We apply the model on two Electronic Health Records datasets with Discharge Summaries in two languages with fewer resources than English, Spanish and Swedish. Our model leverages the BERT Multilingual model (specifically the Wikipedia, as the model have been trained with 104 languages, including Spanish and Swedish, with the largest Wikipedia dumps) to share the language modelling capabilities across the languages. With the per-label attention, the model can compute the relevance of each word from the EHR towards the prediction of each label. For the experimental framework, we apply 157 labels from Chapter XI Diseases of the Digestive System of the ICD, which makes the attention especially important as the model has to discriminate between similar diseases. 1 https://github.com/google-research/bert/blob/master/multilingual.md#list-of-languages

pdf bib
Can the Transformer Be Used as a Drop-in Replacement for RNNs in Text-Generating GANs?RNNs in Text-Generating GANs?
Kevin Blin | Andrei Kucharavy

In this paper we address the problem of fine-tuned text generation with a limited computational budget. For that, we use a well-performing text generative adversarial network (GAN) architecture-Diversity-Promoting GAN (DPGAN), and attempted a drop-in replacement of the LSTM layer with a self-attention-based Transformer layer in order to leverage their efficiency. The resulting Self-Attention DPGAN (SADPGAN) was evaluated for performance, quality and diversity of generated text and stability. Computational experiments suggested that a transformer architecture is unable to drop-in replace the LSTM layer, under-performing during the pre-training phase and undergoing a complete mode collapse during the GAN tuning phase. Our results suggest that the transformer architecture need to be adapted before it can be used as a replacement for RNNs in text-generating GANs.

pdf bib
Predicting the Factuality of Reporting of News Media Using Observations about User Attention in Their YouTube ChannelsYouTube Channels
Krasimira Bozhanova | Yoan Dinkov | Ivan Koychev | Maria Castaldo | Tommaso Venturini | Preslav Nakov

We propose a novel framework for predicting the factuality of reporting of news media outlets by studying the user attention cycles in their YouTube channels. In particular, we design a rich set of features derived from the temporal evolution of the number of views, likes, dislikes, and comments for a video, which we then aggregate to the channel level. We develop and release a dataset for the task, containing observations of user attention on YouTube channels for 489 news media. Our experiments demonstrate both complementarity and sizable improvements over state-of-the-art textual representations.

pdf bib
A Psychologically Informed Part-of-Speech Analysis of Depression in Social Media
Ana-Maria Bucur | Ioana R. Podina | Liviu P. Dinu

In this work, we provide an extensive part-of-speech analysis of the discourse of social media users with depression. Research in psychology revealed that depressed users tend to be self-focused, more preoccupied with themselves and ruminate more about their lives and emotions. Our work aims to make use of large-scale datasets and computational methods for a quantitative exploration of discourse. We use the publicly available depression dataset from the Early Risk Prediction on the Internet Workshop (eRisk) 2018 and extract part-of-speech features and several indices based on them. Our results reveal statistically significant differences between the depressed and non-depressed individuals confirming findings from the existing psychology literature. Our work provides insights regarding the way in which depressed individuals are expressing themselves on social media platforms, allowing for better-informed computational models to help monitor and prevent mental illnesses.

pdf bib
Evaluating Recognizing Question Entailment Methods for a Portuguese Community Question-Answering System about Diabetes MellitusPortuguese Community Question-Answering System about Diabetes Mellitus
Thiago Castro Ferreira | João Victor de Pinho Costa | Isabela Rigotto | Vitoria Portella | Gabriel Frota | Ana Luisa A. R. Guimarães | Adalberto Penna | Isabela Lee | Tayane A. Soares | Sophia Rolim | Rossana Cunha | Celso França | Ariel Santos | Rivaney F. Oliveira | Abisague Langbehn | Daniel Hasan Dalip | Marcos André Gonçalves | Rodrigo Bastos Fóscolo | Adriana Pagano

This study describes the development of a Portuguese Community-Question Answering benchmark in the domain of Diabetes Mellitus using a Recognizing Question Entailment (RQE) approach. Given a premise question, RQE aims to retrieve semantically similar, already answered, archived questions. We build a new Portuguese benchmark corpus with 785 pairs between premise questions and archived answered questions marked with relevance judgments by medical experts. Based on the benchmark corpus, we leveraged and evaluated several RQE approaches ranging from traditional information retrieval methods to novel large pre-trained language models and ensemble techniques using learn-to-rank approaches. Our experimental results show that a supervised transformer-based method trained with multiple languages and for multiple tasks (MUSE) outperforms the alternatives. Our results also show that ensembles of methods (stacking) as well as a traditional (light) information retrieval method (BM25) can produce competitive results. Finally, among the tested strategies, those that exploit only the question (not the answer), provide the best effectiveness-efficiency trade-off. Code is publicly available.

pdf bib
On the Usability of Transformers-based Models for a French Question-Answering TaskFrench Question-Answering Task
Oralie Cattan | Christophe Servan | Sophie Rosset

For many tasks, state-of-the-art results have been achieved with Transformer-based architectures, resulting in a paradigmatic shift in practices from the use of task-specific architectures to the fine-tuning of pre-trained language models. The ongoing trend consists in training models with an ever-increasing amount of data and parameters, which requires considerable resources. It leads to a strong search to improve resource efficiency based on algorithmic and hardware improvements evaluated only for English. This raises questions about their usability when applied to small-scale learning problems, for which a limited amount of training data is available, especially for under-resourced languages tasks. The lack of appropriately sized corpora is a hindrance to applying data-driven and transfer learning-based approaches with strong instability cases. In this paper, we establish a state-of-the-art of the efforts dedicated to the usability of Transformer-based models and propose to evaluate these improvements on the question-answering performances of French language which have few resources. We address the instability relating to data scarcity by investigating various training strategies with data augmentation, hyperparameters optimization and cross-lingual transfer. We also introduce a new compact model for French FrALBERT which proves to be competitive in low-resource settings.

pdf bib
Character-based Thai Word Segmentation with Multiple AttentionsThai Word Segmentation with Multiple Attentions
Thodsaporn Chay-intr | Hidetaka Kamigaito | Manabu Okumura

Character-based word-segmentation models have been extensively applied to agglutinative languages, including Thai, due to their high performance. These models estimate word boundaries from a character sequence. However, a character unit in sequences has no essential meaning, compared with word, subword, and character cluster units. We propose a Thai word-segmentation model that uses various types of information, including words, subwords, and character clusters, from a character sequence. Our model applies multiple attentions to refine segmentation inferences by estimating the significant relationships among characters and various unit types. The experimental results indicate that our model can outperform other state-of-the-art Thai word-segmentation models.

pdf bib
Are Language-Agnostic Sentence Representations Actually Language-Agnostic?
Yu Chen | Tania Avgustinova

With the emergence of pre-trained multilingual models, multilingual embeddings have been widely applied in various natural language processing tasks. Language-agnostic models provide a versatile way to convert linguistic units from different languages into a shared vector representation space. The relevant work on multilingual sentence embeddings has reportedly reached low error rate in cross-lingual similarity search tasks. In this paper, we apply the pre-trained embedding models and the cross-lingual similarity search task in diverse scenarios, and observed large discrepancy in results in comparison to the original paper. Our findings on cross-lingual similarity search with different newly constructed multilingual datasets show not only correlation with observable language similarities but also strong influence from factors such as translation paths, which limits the interpretation of the language-agnostic property of the LASER model.

pdf bib
Towards an Etymological Map of RomanianRomanian
Alina Maria Cristea | Anca Dinu | Liviu P. Dinu | Simona Georgescu | Ana Sabina Uban | Laurentiu Zoicas

In this paper we investigate the etymology of Romanian words. We start from the Romanian lexicon and automatically extract information from multiple etymological dictionaries. We evaluate the results and perform extensive quantitative and qualitative analyses with the goal of building an etymological map of the language.

pdf bib
Event Prominence Extraction Combining a Knowledge-Based Syntactic Parser and a BERT Classifier for DutchBERT Classifier for Dutch
Thierry Desot | Orphee De Clercq | Veronique Hoste

A core task in information extraction is event detection that identifies event triggers in sentences that are typically classified into event types. In this study an event is considered as the unit to measure diversity and similarity in news articles in the framework of a news recommendation system. Current typology-based event detection approaches fail to handle the variety of events expressed in real-world situations. To overcome this, we aim to perform event salience classification and explore whether a transformer model is capable of classifying new information into less and more general prominence classes. After comparing a Support Vector Machine (SVM) baseline and our transformer-based classifier performances on several event span formats, we conceived multi-word event spans as syntactic clauses. Those are fed into our prominence classifier which is fine-tuned on pre-trained Dutch BERT word embeddings. On top of that we outperform a pipeline of a Conditional Random Field (CRF) approach to event-trigger word detection and the BERT-based classifier. To the best of our knowledge we present the first event extraction approach that combines an expert-based syntactic parser with a transformer-based classifier for Dutch.

pdf bib
Automatic Detection and Classification of Mental Illnesses from General Social Media Texts
Anca Dinu | Andreea-Codrina Moldovan

Mental health is getting more and more attention recently, depression being a very common illness nowadays, but also other disorders like anxiety, obsessive-compulsive disorders, feeding disorders, autism, or attention-deficit / hyperactivity disorders. The huge amount of data from social media and the recent advances of deep learning models provide valuable means to automatically detecting mental disorders from plain text. In this article, we experiment with state-of-the-art methods on the SMHD mental health conditions dataset from Reddit (Cohan et al., 2018). Our contribution is threefold : using a dataset consisting of more illnesses than most studies, focusing on general text rather than mental health support groups and classification by posts rather than individuals or groups. For the automatic classification of the diseases, we employ three deep learning models : BERT, RoBERTa and XLNET. We double the baseline established by Cohan et al. (2018), on just a sample of their dataset. We improve the results obtained by Jiang et al. (2020) on post-level classification. The accuracy obtained by the eating disorder classifier is the highest due to the pregnant presence of discussions related to calories, diets, recipes etc., whereas depression had the lowest F1 score, probably because depression is more difficult to identify in linguistic acts.

pdf bib
A Pre-trained Transformer and CNN Model with Joint Language ID and Part-of-Speech Tagging for Code-Mixed Social-Media TextCNN Model with Joint Language ID and Part-of-Speech Tagging for Code-Mixed Social-Media Text
Suman Dowlagar | Radhika Mamidi

Code-mixing (CM) is a frequently observed phenomenon that uses multiple languages in an utterance or sentence. There are no strict grammatical constraints observed in code-mixing, and it consists of non-standard variations of spelling. The linguistic complexity resulting from the above factors made the computational analysis of the code-mixed language a challenging task. Language identification (LI) and part of speech (POS) tagging are the fundamental steps that help analyze the structure of the code-mixed text. Often, the LI and POS tagging tasks are interdependent in the code-mixing scenario. We project the problem of dealing with multilingualism and grammatical structure while analyzing the code-mixed sentence as a joint learning task. In this paper, we jointly train and optimize language detection and part of speech tagging models in the code-mixed scenario. We used a Transformer with convolutional neural network architecture. We train a joint learning method by combining POS tagging and LI models on code-mixed social media text obtained from the ICON shared task.

pdf bib
Knowledge Discovery in COVID-19 Research LiteratureCOVID-19 Research Literature
Ernesto L. Estevanell-Valladares | Suilan Estevez-Velarde | Alejandro Piad-Morffis | Yoan Gutierrez | Andres Montoyo | Rafael Muñoz | Yudivián Almeida Cruz

This paper presents the preliminary results of an ongoing project that analyzes the growing body of scientific research published around the COVID-19 pandemic. In this research, a general-purpose semantic model is used to double annotate a batch of 500 sentences that were manually selected from the CORD-19 corpus. Afterwards, a baseline text-mining pipeline is designed and evaluated via a large batch of 100,959 sentences. We present a qualitative analysis of the most interesting facts automatically extracted and highlight possible future lines of development. The preliminary results show that general-purpose semantic models are a useful tool for discovering fine-grained knowledge in large corpora of scientific documents.

pdf bib
Online Learning over Time in Adaptive Neural Machine Translation
Thierry Etchegoyhen | David Ponce | Harritxu Gete | Victor Ruiz

Adaptive Machine Translation purports to dynamically include user feedback to improve translation quality. In a post-editing scenario, user corrections of machine translation output are thus continuously incorporated into translation models, reducing or eliminating repetitive error editing and increasing the usefulness of automated translation. In neural machine translation, this goal may be achieved via online learning approaches, where network parameters are updated based on each new sample. This type of adaptation typically requires higher learning rates, which can affect the quality of the models over time. Alternatively, less aggressive online learning setups may preserve model stability, at the cost of reduced adaptation to user-generated corrections. In this work, we evaluate different online learning configurations over time, measuring their impact on user-generated samples, as well as separate in-domain and out-of-domain datasets. Results in two different domains indicate that mixed approaches combining online learning with periodic batch fine-tuning might be needed to balance the benefits of online learning with model stability.

pdf bib
Cross-lingual Offensive Language Identification for Low Resource Languages : The Case of MarathiMarathi
Saurabh Sampatrao Gaikwad | Tharindu Ranasinghe | Marcos Zampieri | Christopher Homan

The widespread presence of offensive language on social media motivated the development of systems capable of recognizing such content automatically. Apart from a few notable exceptions, most research on automatic offensive language identification has dealt with English. To address this shortcoming, we introduce MOLD, the Marathi Offensive Language Dataset. MOLD is the first dataset of its kind compiled for Marathi, thus opening a new domain for research in low-resource Indo-Aryan languages. We present results from several machine learning experiments on this dataset, including zero-short and other transfer learning experiments on state-of-the-art cross-lingual transformers from existing data in Bengali, English, and Hindi.

pdf bib
Syntax and Themes : How Context Free Grammar Rules and Semantic Word Association Influence Book Success
Henry Gorelick | Biddut Sarker Bijoy | Syeda Jannatus Saba | Sudipta Kar | Md Saiful Islam | Mohammad Ruhul Amin

In this paper, we attempt to improve upon the state-of-the-art in predicting a novel’s success by modeling the lexical semantic relationships of its contents. We created the largest dataset used in such a project containing lexical data from 17,962 books from Project Gutenberg. We utilized domain specific feature reduction techniques to implement the most accurate models to date for predicting book success, with our best model achieving an average accuracy of 94.0 %. By analyzing the model parameters, we extracted the successful semantic relationships from books of 12 different genres. We finally mapped those semantic relations to a set of themes, as defined in Roget’s Thesaurus and discovered the themes that successful books of a given genre prioritize. At the end of the paper, we further showed that our model demonstrate similar performance for book success prediction even when Goodreads rating was used instead of download count to measure success.

pdf bib
Apples to Apples : A Systematic Evaluation of Topic Models
Ismail Harrando | Pasquale Lisena | Raphael Troncy

From statistical to neural models, a wide variety of topic modelling algorithms have been proposed in the literature. However, because of the diversity of datasets and metrics, there have not been many efforts to systematically compare their performance on the same benchmarks and under the same conditions. In this paper, we present a selection of 9 topic modelling techniques from the state of the art reflecting a diversity of approaches to the task, an overview of the different metrics used to compare their performance, and the challenges of conducting such a comparison. We empirically evaluate the performance of these models on different settings reflecting a variety of real-life conditions in terms of dataset size, number of topics, and distribution of topics, following identical preprocessing and evaluation processes. Using both metrics that rely on the intrinsic characteristics of the dataset (different coherence metrics), as well as external knowledge (word embeddings and ground-truth topic labels), our experiments reveal several shortcomings regarding the common practices in topic models evaluation.

pdf bib
Semi-Supervised and Unsupervised Sense Annotation via Translations
Bradley Hauer | Grzegorz Kondrak | Yixing Luan | Arnob Mallik | Lili Mou

Acquisition of multilingual training data continues to be a challenge in word sense disambiguation (WSD). To address this problem, unsupervised approaches have been proposed to automatically generate sense annotations for training supervised WSD systems. We present three new methods for creating sense-annotated corpora which leverage translations, parallel bitexts, lexical resources, as well as contextual and synset embeddings. Our semi-supervised method applies machine translation to transfer existing sense annotations to other languages. Our two unsupervised methods refine sense annotations produced by a knowledge-based WSD system via lexical translations in a parallel corpus. We obtain state-of-the-art results on standard WSD benchmarks.

pdf bib
Application of Deep Learning Methods to SNOMED CT Encoding of Clinical Texts : From Data Collection to Extreme Multi-Label Text-Based ClassificationSNOMED CT Encoding of Clinical Texts: From Data Collection to Extreme Multi-Label Text-Based Classification
Anton Hristov | Aleksandar Tahchiev | Hristo Papazov | Nikola Tulechki | Todor Primov | Svetla Boytcheva

Concept normalization of clinical texts to standard medical classifications and ontologies is a task with high importance for healthcare and medical research. We attempt to solve this problem through automatic SNOMED CT encoding, where SNOMED CT is one of the most widely used and comprehensive clinical term ontologies. Applying basic Deep Learning models, however, leads to undesirable results due to the unbalanced nature of the data and the extreme number of classes. We propose a classification procedure that features a multiple-step workflow consisting of label clustering, multi-cluster classification, and clusters-to-labels mapping. For multi-cluster classification, BioBERT is fine-tuned over our custom dataset. The clusters-to-labels mapping is carried out by a one-vs-all classifier (SVC) applied to every single cluster. We also present the steps for automatic dataset generation of textual descriptions annotated with SNOMED CT codes based on public data and linked open data. In order to cope with the problem that our dataset is highly unbalanced, some data augmentation methods are applied. The results from the conducted experiments show high accuracy and reliability of our approach for prediction of SNOMED CT codes relevant to a clinical text.

pdf bib
Transfer Learning for Czech Historical Named Entity RecognitionCzech Historical Named Entity Recognition
Helena Hubková | Pavel Kral

Nowadays, named entity recognition (NER) achieved excellent results on the standard corpora. However, big issues are emerging with a need for an application in a specific domain, because it requires a suitable annotated corpus with adapted NE tag-set. This is particularly evident in the historical document processing field. The main goal of this paper consists of proposing and evaluation of several transfer learning methods to increase the score of the Czech historical NER. We study several information sources, and we use two neural nets for NE modeling and recognition. We employ two corpora for evaluation of our transfer learning methods, namely Czech named entity corpus and Czech historical named entity corpus. We show that BERT representation with fine-tuning and only the simple classifier trained on the union of corpora achieves excellent results.

pdf bib
Domain-Specific Japanese ELECTRA Model Using a Small CorpusJapanese ELECTRA Model Using a Small Corpus
Youki Itoh | Hiroyuki Shinnou

Recently, domain shift, which affects accuracy due to differences in data between source and target domains, has become a serious issue when using machine learning methods to solve natural language processing tasks. With additional pretraining and fine-tuning using a target domain corpus, pretraining models such as BERT (Bidirectional Encoder Representations from Transformers) can address this issue. However, the additional pretraining of the BERT model is difficult because it requires significant computing resources. The efficiently learning an encoder that classifies token replacements accurately (ELECTRA) pretraining model replaces the BERT pretraining method’s masked language modeling with a method called replaced token detection, which improves the computational efficiency and allows the additional pretraining of the model to a practical extent. Herein, we propose a method for addressing the computational efficiency of pretraining models in domain shift by constructing an ELECTRA pretraining model on a Japanese dataset and additional pretraining this model in a downstream task using a corpus from the target domain. We constructed a pretraining model for ELECTRA in Japanese and conducted experiments on a document classification task using data from Japanese news articles. Results show that even a model smaller than the pretrained model performs equally well.

pdf bib
Behavior of Modern Pre-trained Language Models Using the Example of Probing Tasks
Ekaterina Kalyaeva | Oleg Durandin | Alexey Malafeev

Modern transformer-based language models are revolutionizing NLP. However, existing studies into language modelling with BERT have been mostly limited to English-language material and do not pay enough attention to the implicit knowledge of language, such as semantic roles, presupposition and negations, that can be acquired by the model during training. Thus, the aim of this study is to examine behavior of the model BERT in the task of masked language modelling and to provide linguistic interpretation to the unexpected effects and errors produced by the model. For this purpose, we used a new Russian-language dataset based on educational texts for learners of Russian and annotated with the help of the National Corpus of the Russian language. In terms of quality metrics (the proportion of words, semantically related to the target word), the multilingual BERT is recognized as the best model. Generally, each model has distinct strengths in relation to a certain linguistic phenomenon. These observations have meaningful implications for research into applied linguistics and pedagogy, contribute to dialogue system development, automatic exercise making, text generation and potentially could improve the quality of existing linguistic technologies

pdf bib
Application of Mix-Up Method in Document Classification Task Using BERTBERT
Naoki Kikuta | Hiroyuki Shinnou

The mix-up method (Zhang et al., 2017), one of the methods for data augmentation, is known to be easy to implement and highly effective. Although the mix-up method is intended for image identification, it can also be applied to natural language processing. In this paper, we attempt to apply the mix-up method to a document classification task using bidirectional encoder representations from transformers (BERT) (Devlin et al., 2018). Since BERT allows for two-sentence input, we concatenated word sequences from two documents with different labels and used the multi-class output as the supervised data with a one-hot vector. In an experiment using the livedoor news corpus, which is Japanese, we compared the accuracy of document classification using two methods for selecting documents to be concatenated with that of ordinary document classification. As a result, we found that the proposed method is better than the normal classification when the documents with labels shortages are mixed preferentially. This indicates that how to choose documents for mix-up has a significant impact on the results.

pdf bib
Neural Machine Translation for Sinhala-English Code-Mixed TextSinhala-English Code-Mixed Text
Archchana Kugathasan | Sagara Sumathipala

Code-mixing has become a moving method of communication among multilingual speakers. Most of the social media content of the multilingual societies are written in code-mixed text. However, most of the current translation systems neglect to convert code-mixed texts to a standard language. Most of the user written code-mixed content in social media remains unprocessed due to the unavailability of linguistic resource such as parallel corpus. This paper proposes a Neural Machine Translation(NMT) model to translate the Sinhala-English code-mixed text to the Sinhala language. Due to the limited resources available for Sinhala-English code-mixed(SECM) text, a parallel corpus is created with SECM sentences and Sinhala sentences. Srilankan social media sites contain SECM texts more frequently than the standard languages. The model proposed for code-mixed text translation in this study is a combination of Encoder-Decoder framework with LSTM units and Teachers Forcing Algorithm. The translated sentences from the model are evaluated using BLEU(Bilingual Evaluation Understudy) metric. Our model achieved a remarkable BLEU score for the translation.

pdf bib
Multilingual Multi-Domain NMT for Indian LanguagesNMT for Indian Languages
Sourav Kumar | Salil Aggarwal | Dipti Sharma

India is known as the land of many tongues and dialects. Neural machine translation (NMT) is the current state-of-the-art approach for machine translation (MT) but performs better only with large datasets which Indian languages usually lack, making this approach infeasible. So, in this paper, we address the problem of data scarcity by efficiently training multilingual and multilingual multi domain NMT systems involving languages of the . We are proposing the technique for using the joint domain and language tags in a multilingual setup. We draw three major conclusions from our experiments : (i) Training a multilingual system via exploiting lexical similarity based on language family helps in achieving an overall average improvement of. over bilingual baselines, (ii) Technique of incorporating domain information into the language tokens helps multilingual multi-domain system in getting a significant average improvement of over the baselines, (iii) Multistage fine-tuning further helps in getting an improvement of -. for the language pair of interest.

pdf bib
Fiction in Russian Translation : A Translationese StudyRussian Translation: A Translationese Study
Maria Kunilovskaya | Ekaterina Lapshinova-Koltunski | Ruslan Mitkov

This paper presents a translationese study based on the parallel data from the Russian National Corpus (RNC). We explored differences between literary texts originally authored in Russian and fiction translated into Russian from 11 languages. The texts are represented with frequency-based features that capture structural and lexical properties of language. Binary classification results indicate that literary translations can be distinguished from non-translations with an accuracy ranging from 82 to 92 % depending on the source language and feature set. Multiclass classification confirms that translations from distant languages are more distinct from non-translations than translations from languages that are typologically close to Russian. It also demonstrates that translations from same-family source languages share translationese properties. Structural features return more consistent results than features relying on external resources and capturing lexical properties of texts in both translationese detection and source language identification tasks.

pdf bib
Sentiment Analysis in Code-Mixed Telugu-English Text with Unsupervised Data NormalizationTelugu-English Text with Unsupervised Data Normalization
Siva Subrahamanyam Varma Kusampudi | Preetham Sathineni | Radhika Mamidi

In a multilingual society, people communicate in more than one language, leading to Code-Mixed data. Sentimental analysis on Code-Mixed Telugu-English Text (CMTET) poses unique challenges. The unstructured nature of the Code-Mixed Data is due to the informal language, informal transliterations, and spelling errors. In this paper, we introduce an annotated dataset for Sentiment Analysis in CMTET. Also, we report an accuracy of 80.22 % on this dataset using novel unsupervised data normalization with a Multilayer Perceptron (MLP) model. This proposed data normalization technique can be extended to any NLP task involving CMTET. Further, we report an increase of 2.53 % accuracy due to this data normalization approach in our best model.

pdf bib
Making Your Tweets More Fancy : Emoji Insertion to Texts
Jingun Kwon | Naoki Kobayashi | Hidetaka Kamigaito | Hiroya Takamura | Manabu Okumura

In the social media, users frequently use small images called emojis in their posts. Although using emojis in texts plays a key role in recent communication systems, less attention has been paid on their positions in the given texts, despite that users carefully choose and put an emoji that matches their post. Exploring positions of emojis in texts will enhance understanding of the relationship between emojis and texts. We extend an emoji label prediction task taking into account the information of emoji positions, by jointly learning the emoji position in a tweet to predict the emoji label. The results demonstrate that the position of emojis in texts is a good clue to boost the performance of emoji label prediction. Human evaluation validates that there exists a suitable emoji position in a tweet, and our proposed task is able to make tweets more fancy and natural. In addition, considering emoji position can further improve the performance for the irony detection task compared to the emoji label prediction. We also report the experimental results for the modified dataset, due to the problem of the original dataset for the first shared task to predict an emoji label in SemEval2018.

pdf bib
Addressing Slot-Value Changes in Task-oriented Dialogue Systems through Dialogue Domain Adaptation
Tiziano Labruna | Bernardo Magnini

Recent task-oriented dialogue systems learn a model from annotated dialogues, and such dialogues are in turn collected and annotated so that they are consistent with certain domain knowledge. However, in real scenarios, domain knowledge is subject to frequent changes, and initial training dialogues may soon become obsolete, resulting in a significant decrease in the model performance. In this paper, we investigate the relationship between training dialogues and domain knowledge, and propose Dialogue Domain Adaptation, a methodology aiming at adapting initial training dialogues to changes intervened in the domain knowledge. We focus on slot-value changes (e.g., when new slot values are available to describe domain entities) and define an experimental setting for dialogue domain adaptation. First, we show that current state-of-the-art models for dialogue state tracking are still poorly robust to slot-value changes of the domain knowledge. Then, we compare different domain adaptation strategies, showing that simple techniques are effective to reduce the gap between training dialogues and domain knowledge.

pdf bib
Developing a Clinical Language Model for Swedish : Continued Pretraining of Generic BERT with In-Domain DataSwedish: Continued Pretraining of Generic BERT with In-Domain Data
Anastasios Lamproudis | Aron Henriksson | Hercules Dalianis

The use of pretrained language models, fine-tuned to perform a specific downstream task, has become widespread in NLP. Using a generic language model in specialized domains may, however, be sub-optimal due to differences in language use and vocabulary. In this paper, it is investigated whether an existing, generic language model for Swedish can be improved for the clinical domain through continued pretraining with clinical text. The generic and domain-specific language models are fine-tuned and evaluated on three representative clinical NLP tasks : (i) identifying protected health information, (ii) assigning ICD-10 diagnosis codes to discharge summaries, and (iii) sentence-level uncertainty prediction. The results show that continued pretraining on in-domain data leads to improved performance on all three downstream tasks, indicating that there is a potential added value of domain-specific language models for clinical NLP.

pdf bib
Frustration Level Annotation in Latvian Tweets with Non-Lexical Means of ExpressionLatvian Tweets with Non-Lexical Means of Expression
Viktorija Leonova | Janis Zuters

We present a neural-network-driven model for annotating frustration intensity in customer support tweets, based on representing tweet texts using a bag-of-words encoding after processing with subword segmentation together with non-lexical features. The model was evaluated on tweets in English and Latvian languages, focusing on aspects beyond the pure bag-of-words representations used in previous research. The experimental results show that the model can be successfully applied for texts in a non-English language, and that adding non-lexical features to tweet representations significantly improves performance, while subword segmentation has a moderate but positive effect on model accuracy. Our code and training data are publicly available.

pdf bib
System Combination for Grammatical Error Correction Based on Integer Programming
Ruixi Lin | Hwee Tou Ng

In this paper, we propose a system combination method for grammatical error correction (GEC), based on nonlinear integer programming (IP). Our method optimizes a novel F score objective based on error types, and combines multiple end-to-end GEC systems. The proposed IP approach optimizes the selection of a single best system for each grammatical error type present in the data. Experiments of the IP approach on combining state-of-the-art standalone GEC systems show that the combined system outperforms all standalone systems. It improves F0.5 score by 3.61 % when combining the two best participating systems in the BEA 2019 shared task, and achieves F0.5 score of 73.08 %. We also perform experiments to compare our IP approach with another state-of-the-art system combination method for GEC, demonstrating IP’s competitive combination capability.

pdf bib
Naturalness Evaluation of Natural Language Generation in Task-oriented Dialogues Using BERTBERT
Ye Liu | Wolfgang Maier | Wolfgang Minker | Stefan Ultes

This paper presents an automatic method to evaluate the naturalness of natural language generation in dialogue systems. While this task was previously rendered through expensive and time-consuming human labor, we present this novel task of automatic naturalness evaluation of generated language. By fine-tuning the BERT model, our proposed naturalness evaluation method shows robust results and outperforms the baselines : support vector machines, bi-directional LSTMs, and BLEURT. In addition, the training speed and evaluation performance of naturalness model are improved by transfer learning from quality and informativeness linguistic knowledge.

pdf bib
Active Learning for Interactive Relation Extraction in a French Newspaper’s ArticlesFrench Newspaper’s Articles
Cyrielle Mallart | Michel Le Nouy | Guillaume Gravier | Pascale Sébillot

Relation extraction is a subtask of natural langage processing that has seen many improvements in recent years, with the advent of complex pre-trained architectures. Many of these state-of-the-art approaches are tested against benchmarks with labelled sentences containing tagged entities, and require important pre-training and fine-tuning on task-specific data. However, in a real use-case scenario such as in a newspaper company mostly dedicated to local information, relations are of varied, highly specific type, with virtually no annotated data for such relations, and many entities co-occur in a sentence without being related. We question the use of supervised state-of-the-art models in such a context, where resources such as time, computing power and human annotators are limited. To adapt to these constraints, we experiment with an active-learning based relation extraction pipeline, consisting of a binary LSTM-based lightweight model for detecting the relations that do exist, and a state-of-the-art model for relation classification. We compare several choices for classification models in this scenario, from basic word embedding averaging, to graph neural networks and Bert-based ones, as well as several active learning acquisition strategies, in order to find the most cost-efficient yet accurate approach in our French largest daily newspaper company’s use case.

pdf bib
ROFF-A Romanian Twitter Dataset for Offensive LanguageROFF - A Romanian Twitter Dataset for Offensive Language
Mihai Manolescu | Çağrı Çöltekin

This paper describes the annotation process of an offensive language data set for Romanian on social media. To facilitate comparable multi-lingual research on offensive language, the annotation guidelines follow some of the recent annotation efforts for other languages. The final corpus contains 5000 micro-blogging posts annotated by a large number of volunteer annotators. The inter-annotator agreement and the initial automatic discrimination results we present are in line with earlier annotation efforts.

pdf bib
Monitoring Fact Preservation, Grammatical Consistency and Ethical Behavior of Abstractive Summarization Neural Models
Iva Marinova | Yolina Petrova | Milena Slavcheva | Petya Osenova | Ivaylo Radev | Kiril Simov

The paper describes a system for automatic summarization in English language of online news data that come from different non-English languages. The system is designed to be used in production environment for media monitoring. Automatic summarization can be very helpful in this domain when applied as a helper tool for journalists so that they can review just the important information from the news channels. However, like every software solution, the automatic summarization needs performance monitoring and assured safe environment for the clients. In media monitoring environment the most problematic features to be addressed are : the copyright issues, the factual consistency, the style of the text and the ethical norms in journalism. Thus, the main contribution of our present work is that the above mentioned characteristics are successfully monitored in neural automatic summarization models and improved with the help of validation, fact-preserving and fact-checking procedures.

pdf bib
Improving Distantly Supervised Relation Extraction with Self-Ensemble Noise Filtering
Tapas Nayak | Navonil Majumder | Soujanya Poria

Distantly supervised models are very popular for relation extraction since we can obtain a large amount of training data using the distant supervision method without human annotation. In distant supervision, a sentence is considered as a source of a tuple if the sentence contains both entities of the tuple. However, this condition is too permissive and does not guarantee the presence of relevant relation-specific information in the sentence. As such, distantly supervised training data contains much noise which adversely affects the performance of the models. In this paper, we propose a self-ensemble filtering mechanism to filter out the noisy samples during the training process. We evaluate our proposed framework on the New York Times dataset which is obtained via distant supervision. Our experiments with multiple state-of-the-art neural relation extraction models show that our proposed filtering mechanism improves the robustness of the models and increases their F1 scores.

pdf bib
Transfer-based Enrichment of a Hungarian Named Entity DatasetHungarian Named Entity Dataset
Attila Novák | Borbála Novák

In this paper, we present a major update to the first Hungarian named entity dataset, the Szeged NER corpus. We used zero-shot cross-lingual transfer to initialize the enrichment of entity types annotated in the corpus using three neural NER models : two of them based on the English OntoNotes corpus and one based on the Czech Named Entity Corpus finetuned from multilingual neural language models. The output of the models was automatically merged with the original NER annotation, and automatically and manually corrected and further enriched with additional annotation, like qualifiers for various entity types. We present the evaluation of the zero-shot performance of the two OntoNotes-based models and a transformer-based new NER model trained on the training part of the final corpus. We release the corpus and the trained model.

pdf bib
A Call for Clarity in Contemporary Authorship Attribution Evaluation
Allen Riddell | Haining Wang | Patrick Juola

Recent research has documented that results reported in frequently-cited authorship attribution papers are difficult to reproduce. Inaccessible code and data are often proposed as factors which block successful reproductions. Even when original materials are available, problems remain which prevent researchers from comparing the effectiveness of different methods. To solve the remaining problemsthe lack of fixed test sets and the use of inappropriately homogeneous corporaour paper contributes materials for five closed-set authorship identification experiments. The five experiments feature texts from 106 distinct authors. Experiments involve a range of contemporary non-fiction American English prose. These experiments provide the foundation for comparable and reproducible authorship attribution research involving contemporary writing.

pdf bib
Varieties of Plain Language
Allen Riddell | Yohei Igarashi

Many organizations seek or need to produce documents that are written plainly. In the United States, the Plain Writing Act of 2010 requires that many federal agencies’ documents for the public are written in plain English. In particular, the government’s Plain Language Action and Information Network (PLAIN) recommends that writers use short sentences and everyday words, as does the Securities and Exchange Commission’s Plain English Rule. Since the 1970s, American plain language advocates have moved away from readability measures and favored usability testing and document design considerations. But in this paper we use quantitative measures of sentence length and word difficulty that (1) reveal stylistic variation among PLAIN’s exemplars of plain writing, and (2) help us position PLAIN’s exemplars relative to documents written in other kinds of accessible English (e.g., The New York Times, Voice of America Special English, and Wikipedia) and one academic document likely to be perceived as difficult. Uncombined measures for sentences and vocabularyleft separate, unlike in traditional readability formulascan complement usability testing and document design considerations, and advance knowledge about different types of plainer English.

pdf bib
Word Discriminations for Vocabulary Inventory Prediction
Frankie Robertson

The aim of vocabulary inventory prediction is to predict a learner’s whole vocabulary based on a limited sample of query words. This paper approaches the problem starting from the 2-parameter Item Response Theory (IRT) model, giving each word in the vocabulary a difficulty and discrimination parameter. The discrimination parameter is evaluated on the sub-problem of question item selection, familiar from the fields of Computerised Adaptive Testing (CAT) and active learning. Next, the effect of the discrimination parameter on prediction performance is examined, both in a binary classification setting, and in an information retrieval setting. Performance is compared with baselines based on word frequency. A number of different generalisation scenarios are examined, including generalising word difficulty and discrimination using word embeddings with a predictor network and testing on out-of-dataset data.

pdf bib
FrenLyS : A Tool for the Automatic Simplification of French General Language TextsFrenLyS: A Tool for the Automatic Simplification of French General Language Texts
Eva Rolin | Quentin Langlois | Patrick Watrin | Thomas François

Lexical simplification (LS) aims at replacing words considered complex in a sentence by simpler equivalents. In this paper, we present the first automatic LS service for French, FrenLys, which offers different techniques to generate, select and rank substitutes. The paper describes the different methods proposed by our tool, which includes both classical approaches (e.g. generation of candidates from lexical resources, frequency filter, etc.) and more innovative approaches such as the exploitation of CamemBERT, a model for French based on the RoBERTa architecture. To evaluate the different methods, a new evaluation dataset for French is introduced.

pdf bib
Sentiment-Aware Measure (SAM) for Evaluating Sentiment Transfer by Machine Translation SystemsSAM) for Evaluating Sentiment Transfer by Machine Translation Systems
Hadeel Saadany | Constantin Orăsan | Emad Mohamed | Ashraf Tantavy

In translating text where sentiment is the main message, human translators give particular attention to sentiment-carrying words. The reason is that an incorrect translation of such words would miss the fundamental aspect of the source text, i.e. the author’s sentiment. In the online world, MT systems are extensively used to translate User-Generated Content (UGC) such as reviews, tweets, and social media posts, where the main message is often the author’s positive or negative attitude towards the topic of the text. It is important in such scenarios to accurately measure how far an MT system can be a reliable real-life utility in transferring the correct affect message. This paper tackles an under-recognized problem in the field of machine translation evaluation which is judging to what extent automatic metrics concur with the gold standard of human evaluation for a correct translation of sentiment. We evaluate the efficacy of conventional quality metrics in spotting a mistranslation of sentiment, especially when it is the sole error in the MT output. We propose a numerical sentiment-closeness measure appropriate for assessing the accuracy of a translated affect message in UGC text by an MT system. We will show that incorporating this sentiment-aware measure can significantly enhance the correlation of some available quality metrics with the human judgement of an accurate translation of sentiment.

pdf bib
Multilingual Epidemic Event Extraction : From Simple Classification Methods to Open Information Extraction (OIE) and OntologyOIE) and Ontology
Sihem Sahnoun | Gaël Lejeune

There is an incredible amount of information available in the form of textual documents due to the growth of information sources. In order to get the information into an actionable way, it is common to use information extraction and more specifically the event extraction, it became crucial in various domains even in public health. In this paper, we address the problem of the epidemic event extraction in potentially any language, so that we tested different corpuses on an existed multilingual system for tele-epidemiology : the Data Analysis for Information Extraction in any Language(DANIEL) system. We focused on the influence of the number of documents on the performance of the system, on average results show that it is able to achieve a precision and recall around 82 %, but when we resorted to the evaluation by event by checking whether it has been really detected or not, the results are not satisfactory according to this paper’s evaluation. Our idea is to propose a system that uses an ontology which includes information in different languages and covers specific epidemiological concepts, it is also based on the multilingual open information extraction for the relation extraction step to reduce the expert intervention and to restrict the content for each text. We describe a methodology of five main stages : Pre-processing, relation extraction, named entity recognition (NER), event recognition and the matching between the information extracted and the ontology.

pdf bib
Exploiting Domain-Specific Knowledge for Judgment Prediction Is No Panacea
Olivier Salaün | Philippe Langlais | Karim Benyekhlef

Legal judgment prediction (LJP) usually consists in a text classification task aimed at predicting the verdict on the basis of the fact description. The literature shows that the use of articles as input features helps improve the classification performance. In this work, we designed a verdict prediction task based on landlord-tenant disputes and we applied BERT-based models to which we fed different article-based features. Although the results obtained are consistent with the literature, the improvements with the articles are mostly obtained with the most frequent labels, suggesting that pre-trained and fine-tuned transformer-based models are not scalable as is for legal reasoning in real life scenarios as they would only excel in accurately predicting the most recurrent verdicts to the detriment of other legal outcomes.

pdf bib
A Semi-Supervised Approach to Detect Toxic Comments
Ghivvago Damas Saraiva | Rafael Anchiêta | Francisco Assis Ricarte Neto | Raimundo Moura

Toxic comments contain forms of non-acceptable language targeted towards groups or individuals. These types of comments become a serious concern for government organizations, online communities, and social media platforms. Although there are some approaches to handle non-acceptable language, most of them focus on supervised learning and the English language. In this paper, we deal with toxic comment detection as a semi-supervised strategy over a heterogeneous graph. We evaluate the approach on a toxic dataset of the Portuguese language, outperforming several graph-based methods and achieving competitive results compared to transformer architectures.

pdf bib
A Domain-Independent Holistic Approach to Deception Detection
Sadat Shahriar | Arjun Mukherjee | Omprakash Gnawali

The deception in the text can be of different forms in different domains, including fake news, rumor tweets, and spam emails. Irrespective of the domain, the main intent of the deceptive text is to deceit the reader. Although domain-specific deception detection exists, domain-independent deception detection can provide a holistic picture, which can be crucial to understand how deception occurs in the text. In this paper, we detect deception in a domain-independent setting using deep learning architectures. Our method outperforms the State-of-the-Art performance of most benchmark datasets with an overall accuracy of 93.42 % and F1-Score of 93.22 %. The domain-independent training allows us to capture subtler nuances of deceptive writing style. Furthermore, we analyze how much in-domain data may be helpful to accurately detect deception, especially for the cases where data may not be readily available to train. Our results and analysis indicate that there may be a universal pattern of deception lying in-between the text independent of the domain, which can create a novel area of research and open up new avenues in the field of deception detection.

pdf bib
Towards Domain-Generalizable Paraphrase Identification by Avoiding the Shortcut Learning
Xin Shen | Wai Lam

In this paper, we investigate the Domain Generalization (DG) problem for supervised Paraphrase Identification (PI). We observe that the performance of existing PI models deteriorates dramatically when tested in an out-of-distribution (OOD) domain. We conjecture that it is caused by shortcut learning, i.e., these models tend to utilize the cue words that are unique for a particular dataset or domain. To alleviate this issue and enhance the DG ability, we propose a PI framework based on Optimal Transport (OT). Our method forces the network to learn the necessary features for all the words in the input, which alleviates the shortcut learning problem. Experimental results show that our method improves the DG ability for the PI models.

pdf bib
How to Obtain Reliable Labels for MBTI Classification from Texts?MBTI Classification from Texts?
Sanja Stajner | Seren Yenikent

Automatic detection of the Myers-Briggs Type Indicator (MBTI) from short posts attracted noticeable attention in the last few years. Recent studies showed that this is quite a difficult task, especially on commonly used Twitter data. Obtaining MBTI labels is also difficult, as human annotation requires trained psychologists, and automatic way of obtaining them is through long questionnaires of questionable usability for the task. In this paper, we present a method for collecting reliable MBTI labels via only four carefully selected questions that can be applied to any type of textual data.

pdf bib
Does BERT Understand Idioms? A Probing-Based Empirical Study of BERT Encodings of IdiomsBERT Understand Idioms? A Probing-Based Empirical Study of BERT Encodings of Idioms
Minghuan Tan | Jing Jiang

Understanding idioms is important in NLP. In this paper, we study to what extent pre-trained BERT model can encode the meaning of a potentially idiomatic expression (PIE) in a certain context. We make use of a few existing datasets and perform two probing tasks : PIE usage classification and idiom paraphrase identification. Our experiment results suggest that BERT indeed can separate the literal and idiomatic usages of a PIE with high accuracy. It is also able to encode the idiomatic meaning of a PIE to some extent.

pdf bib
TR-SEQ : Named Entity Recognition Dataset for Turkish Search Engine QueriesTR-SEQ: Named Entity Recognition Dataset for Turkish Search Engine Queries
Berkay Topçu | İlknur Durgar El-Kahlout

Recognizing named entities in short search engine queries is a difficult task due to their weaker contextual information compared to long sentences. Standard named entity recognition (NER) systems that are trained on grammatically correct and long sentences fail to perform well on such queries. In this study, we share our efforts towards creating a cleaned and labeled dataset of real Turkish search engine queries (TR-SEQ) and introduce an extended label set to satisfy the search engine needs. A NER system is trained by applying the state-of-the-art deep learning method BERT to the collected data and its high performance on search engine queries is reported. Moreover, we compare our results with the state-of-the-art Turkish NER systems.

pdf bib
Contextual-Lexicon Approach for Abusive Language Detection
Francielle Vargas | Fabiana Rodrigues de Góes | Isabelle Carvalho | Fabrício Benevenuto | Thiago Pardo

Since a lexicon-based approach is more elegant scientifically, explaining the solution components and being easier to generalize to other applications, this paper provides a new approach for offensive language and hate speech detection on social media, which embodies a lexicon of implicit and explicit offensive and swearing expressions annotated with contextual information. Due to the severity of the social media abusive comments in Brazil, and the lack of research in Portuguese, Brazilian Portuguese is the language used to validate the models. Nevertheless, our method may be applied to any other language. The conducted experiments show the effectiveness of the proposed approach, outperforming the current baseline methods for the Portuguese language.

pdf bib
Comparative Analysis of Fine-tuned Deep Learning Language Models for ICD-10 Classification Task for Bulgarian LanguageICD-10 Classification Task for Bulgarian Language
Boris Velichkov | Sylvia Vassileva | Simeon Gerginov | Boris Kraychev | Ivaylo Ivanov | Philip Ivanov | Ivan Koychev | Svetla Boytcheva

The task of automatic diagnosis encoding into standard medical classifications and ontologies, is of great importance in medicine-both to support the daily tasks of physicians in the preparation and reporting of clinical documentation, and for automatic processing of clinical reports. In this paper we investigate the application and performance of different deep learning transformers for automatic encoding in ICD-10 of clinical texts in Bulgarian. The comparative analysis attempts to find which approach is more efficient to be used for fine-tuning of pretrained BERT family transformer to deal with a specific domain terminology on a rare language as Bulgarian. On the one side are used SlavicBERT and MultiligualBERT, that are pretrained for common vocabulary in Bulgarian, but lack medical terminology. On the other hand in the analysis are used BioBERT, ClinicalBERT, SapBERT, BlueBERT, that are pretrained for medical terminology in English, but lack training for language models in Bulgarian, and more over for vocabulary in Cyrillic. In our research study all BERT models are fine-tuned with additional medical texts in Bulgarian and then applied to the classification task for encoding medical diagnoses in Bulgarian into ICD-10 codes. Big corpora of diagnosis in Bulgarian annotated with ICD-10 codes is used for the classification task. Such an analysis gives a good idea of which of the models would be suitable for tasks of a similar type and domain. The experiments and evaluation results show that both approaches have comparable accuracy.

pdf bib
Mistake Captioning : A Machine Learning Approach for Detecting Mistakes and Generating Instructive Feedback
Anton Vinogradov | Andrew Miles Byrd | Brent Harrison

Giving feedback to students is not just about marking their answers as correct or incorrect, but also finding mistakes in their thought process that led them to that incorrect answer. In this paper, we introduce a machine learning technique for mistake captioning, a task that attempts to identify mistakes and provide feedback meant to help learners correct these mistakes. We do this by training a sequence-to-sequence network to generate this feedback based on domain experts. To evaluate this system, we explore how it can be used on a Linguistics assignment studying Grimm’s Law. We show that our approach generates feedback that outperforms a baseline on a set of automated NLP metrics. In addition, we perform a series of case studies in which we examine successful and unsuccessful system outputs.

pdf bib
A Novel Machine Learning Based Approach for Post-OCR Error DetectionOCR Error Detection
Shafqat Mumtaz Virk | Dana Dannélls | Azam Sheikh Muhammad

Post processing is the most conventional approach for correcting errors that are caused by Optical Character Recognition(OCR) systems. Two steps are usually taken to correct OCR errors : detection and corrections. For the first task, supervised machine learning methods have shown state-of-the-art performances. Previously proposed approaches have focused most prominently on combining lexical, contextual and statistical features for detecting errors. In this study, we report a novel system to error detection which is based merely on the n-gram counts of a candidate token. In addition to being simple and computationally less expensive, our proposed system beats previous systems reported in the ICDAR2019 competition on OCR-error detection with notable margins. We achieved state-of-the-art F1-scores for eight out of the ten involved European languages. The maximum improvement is for Spanish which improved from 0.69 to 0.90, and the minimum for Polish from 0.82 to 0.84.

pdf bib
ComboNER : A Lightweight All-In-One POS Tagger, Dependency Parser and NERComboNER: A Lightweight All-In-One POS Tagger, Dependency Parser and NER
Aleksander Wawer

The current natural language processing is strongly focused on raising accuracy. The progress comes at a cost of super-heavy models with hundreds of millions or even billions of parameters. However, simple syntactic tasks such as part-of-speech (POS) tagging, dependency parsing or named entity recognition (NER) do not require the largest models to achieve acceptable results. In line with this assumption we try to minimize the size of the model that jointly performs all three tasks. We introduce ComboNER : a lightweight tool, orders of magnitude smaller than state-of-the-art transformers. It is based on pre-trained subword embeddings and recurrent neural network architecture. ComboNER operates on Polish language data. The model has outputs for POS tagging, dependency parsing and NER. Our paper contains some insights from fine-tuning of the model and reports its overall results.

pdf bib
Investigating Annotator Bias in Abusive Language Datasets
Maximilian Wich | Christian Widmer | Gerhard Hagerer | Georg Groh

Nowadays, social media platforms use classification models to cope with hate speech and abusive language. The problem of these models is their vulnerability to bias. A prevalent form of bias in hate speech and abusive language datasets is annotator bias caused by the annotator’s subjective perception and the complexity of the annotation task. In our paper, we develop a set of methods to measure annotator bias in abusive language datasets and to identify different perspectives on abusive language. We apply these methods to four different abusive language datasets. Our proposed approach supports annotation processes of such datasets and future research addressing different perspectives on the perception of abusive language.

pdf bib
Transformer with Syntactic Position Encoding for Machine Translation
Yikuan Xie | Wenyong Wang | Mingqian Du | Qing He

It has been widely recognized that syntax information can help end-to-end neural machine translation (NMT) systems to achieve better translation. In order to integrate dependency information into Transformer based NMT, existing approaches either exploit words’ local head-dependent relations, ignoring their non-local neighbors carrying important context ; or approximate two words’ syntactic relation by their relative distance on the dependency tree, sacrificing exactness. To address these issues, we propose global positional encoding for dependency tree, a new scheme that facilitates syntactic relation modeling between any two words with keeping exactness and without immediate neighbor constraint. Experiment results on NC11 GermanEnglish, EnglishGerman and WMT EnglishGerman datasets show that our approach is more effective than the above two strategies. In addition, our experiments quantitatively show that compared with higher layers, lower layers of the model are more proper places to incorporate syntax information in terms of each layer’s preference to the syntactic pattern and the final performance.

pdf bib
Towards Sentiment Analysis of Tobacco Products’ Usage in Social Media
Venkata Himakar Yanamandra | Kartikey Pant | Radhika Mamidi

Contemporary tobacco-related studies are mostly concerned with a single social media platform while missing out on a broader audience. Moreover, they are heavily reliant on labeled datasets, which are expensive to make. In this work, we explore sentiment and product identification on tobacco-related text from two social media platforms. We release SentiSmoke-Twitter and SentiSmoke-Reddit datasets, along with a comprehensive annotation schema for identifying tobacco products’ sentiment. We then perform benchmarking text classification experiments using state-of-the-art models, including BERT, RoBERTa, and DistilBERT. Our experiments show F1 scores as high as 0.72 for sentiment identification in the Twitter dataset, 0.46 for sentiment identification, and 0.57 for product identification using semi-supervised learning for Reddit.

pdf bib
Sentence Structure and Word Relationship Modeling for Emphasis Selection
Haoran Yang | Wai Lam

Emphasis Selection is a newly proposed task which focuses on choosing words for emphasis in short sentences. Traditional methods only consider the sequence information of a sentence while ignoring the rich sentence structure and word relationship information. In this paper, we propose a new framework that considers sentence structure via a sentence structure graph and word relationship via a word similarity graph. The sentence structure graph is derived from the parse tree of a sentence. The word similarity graph allows nodes to share information with their neighbors since we argue that in emphasis selection, similar words are more likely to be emphasized together. Graph neural networks are employed to learn the representation of each node of these two graphs. Experimental results demonstrate that our framework can achieve superior performance.

pdf bib
Utterance Position-Aware Dialogue Act Recognition
Yuki Yano | Akihiro Tamura | Takashi Ninomiya | Hiroaki Obayashi

This study proposes an utterance position-aware approach for a neural network-based dialogue act recognition (DAR) model, which incorporates positional encoding for utterance’s absolute or relative position. The proposed approach is inspired by the observation that some dialogue acts have tendencies of occurrence positions. The evaluations on the Switchboard corpus show that the proposed positional encoding of utterances statistically significantly improves the performance of DAR.

pdf bib
Generic Mechanism for Reducing Repetitions in Encoder-Decoder Models
Ying Zhang | Hidetaka Kamigaito | Tatsuya Aoki | Hiroya Takamura | Manabu Okumura

Encoder-decoder models have been commonly used for many tasks such as machine translation and response generation. As previous research reported, these models suffer from generating redundant repetition. In this research, we propose a new mechanism for encoder-decoder models that estimates the semantic difference of a source sentence before and after being fed into the encoder-decoder model to capture the consistency between two sides. This mechanism helps reduce repeatedly generated tokens for a variety of tasks. Evaluation results on publicly available machine translation and response generation datasets demonstrate the effectiveness of our proposal.

pdf bib
Delexicalized Cross-lingual Dependency Parsing for XibeXibe
He Zhou | Sandra Kübler

Manually annotating a treebank is time-consuming and labor-intensive. We conduct delexicalized cross-lingual dependency parsing experiments, where we train the parser on one language and test on our target language. As our test case, we use Xibe, a severely under-resourced Tungusic language. We assume that choosing a closely related language as the source language will provide better results than more distant relatives. However, it is not clear how to determine those closely related languages. We investigate three different methods : choosing the typologically closest language, using LangRank, and choosing the most similar language based on perplexity. We train parsing models on the selected languages using UDify and test on different genres of Xibe data. The results show that languages selected based on typology and perplexity scores outperform those predicted by LangRank ; Japanese is the optimal source language. In determining the source language, proximity to the target language is more important than large training sizes. Parsing is also influenced by genre differences, but they have little influence as long as the training data is at least as complex as the target.

up

bib (full) Proceedings of the Student Research Workshop Associated with RANLP 2021

pdf bib
Proceedings of the Student Research Workshop Associated with RANLP 2021
Souhila Djabri | Dinara Gimadi | Tsvetomila Mihaylova | Ivelina Nikolova-Koleva

pdf bib
Towards Code-Mixed Hinglish Dialogue GenerationHinglish Dialogue Generation
Vibhav Agarwal | Pooja Rao | Dinesh Babu Jayagopi

Code-mixed language plays a crucial role in communication in multilingual societies. Though the recent growth of web users has greatly boosted the use of such mixed languages, the current generation of dialog systems is primarily monolingual. This increase in usage of code-mixed language has prompted dialog systems in a similar language. We present our work in Code-Mixed Dialog Generation, an unexplored task in code-mixed languages, generating utterances in code-mixed language rather than a single language that is more often just English. We present a new synthetic corpus in code-mix for dialogs, CM-DailyDialog, by converting an existing English-only dialog corpus to a mixed Hindi-English corpus. We then propose a baseline approach where we show the effectiveness of using mBART like multilingual sequence-to-sequence transformers for code-mixed dialog generation. Our best performing dialog models can conduct coherent conversations in Hindi-English mixed language as evaluated by human and automatic metrics setting new benchmarks for the Code-Mixed Dialog Generation task.

pdf bib
Bilingual Terminology Extraction Using Neural Word Embeddings on Comparable Corpora
Darya Filippova | Burcu Can | Gloria Corpas Pastor

Term and glossary management are vital steps of preparation of every language specialist, and they play a very important role at the stage of education of translation professionals. The growing trend of efficient time management and constant time constraints we may observe in every job sector increases the necessity of the automatic glossary compilation. Many well-performing bilingual AET systems are based on processing parallel data, however, such parallel corpora are not always available for a specific domain or a language pair. Domain-specific, bilingual access to information and its retrieval based on comparable corpora is a very promising area of research that requires a detailed analysis of both available data sources and the possible extraction techniques. This work focuses on domain-specific automatic terminology extraction from comparable corpora for the English Russian language pair by utilizing neural word embeddings.

pdf bib
Web-sentiment analysis of public comments (public reviews) for languages with limited resources such as the Kazakh languageKazakh language
Dinara Gimadi

In the pandemic period, the stay-at-home trend forced businesses to switch their activities to digital mode, for example, app-based payment methods, social distancing via social media platforms, and other digital means have become an integral part of our lives. Sentiment analysis of textual information in user comments is a topical task in emotion AI because user comments or reviews are not homogeneous, they contain sparse context behind, and are misleading both for human and computer. Barriers arise from the emotional language enriched with slang, peculiar spelling, transliteration, use of emoji and their symbolic counterparts, and code-switching. For low resource languages sentiment analysis has not been worked upon extensively, because of an absence of ready-made tools and linguistic resources for sentiment analysis. This research focuses on developing a method for aspect-based sentiment analysis for Kazakh-language reviews in Android Google Play Market.

pdf bib
Compiling a specialised corpus for translation research in the environmental domain
Anastasiia Laktionova

The present study is an ongoing research that aims to investigate lexico-grammatical and stylistic features of texts in the environmental domain in English, their implications for translation into Ukrainian as well as the translation of key terminological units based on a specialised parallel and comparable corpora.

pdf bib
Paragraph Similarity Matches for Generating Multiple-choice Test Items
Halyna Maslak | Ruslan Mitkov

Multiple-choice questions (MCQs) are widely used in knowledge assessment in educational institutions, during work interviews, in entertainment quizzes and games. Although the research on the automatic or semi-automatic generation of multiple-choice test items has been conducted since the beginning of this millennium, most approaches focus on generating questions from a single sentence. In this research, a state-of-the-art method of creating questions based on multiple sentences is introduced. It was inspired by semantic similarity matches used in the translation memory component of translation management systems. The performance of two deep learning algorithms, doc2vec and SBERT, is compared for the paragraph similarity task. The experiments are performed on the ad-hoc corpus within the EU domain. For the automatic evaluation, a smaller corpus of manually selected matching paragraphs has been compiled. The results prove the good performance of Sentence Embeddings for the given task.

pdf bib
Does local pruning offer task-specific models to learn effectively?
Abhishek Kumar Mishra | Mohna Chakraborty

The need to deploy large-scale pre-trained models on edge devices under limited computational resources has led to substantial research to compress these large models. However, less attention has been given to compress the task-specific models. In this work, we investigate the different methods of unstructured pruning on task-specific models for Aspect-based Sentiment Analysis (ABSA) tasks. Specifically, we analyze differences in the learning dynamics of pruned models by using the standard pruning techniques to achieve high-performing sparse networks. We develop a hypothesis to demonstrate the effectiveness of local pruning over global pruning considering a simple CNN model. Later, we utilize the hypothesis to demonstrate the efficacy of the pruned state-of-the-art model compared to the over-parameterized state-of-the-art model under two settings, the first considering the baselines for the same task used for generating the hypothesis, i.e., aspect extraction and the second considering a different task, i.e., sentiment analysis. We also provide discussion related to the generalization of the pruning hypothesis.

pdf bib
A Dataset for Research on Modelling Depression Severity in Online Forum Data
Isuri Anuradha Nanomi Arachchige | Vihangi Himaya Jayasuriya | Ruvan Weerasinghe

People utilize online forums to either look for information or to contribute it. Because of their growing popularity, certain online forums have been created specifically to provide support, assistance, and opinions for people suffering from mental illness. Depression is one of the most frequent psychological illnesses worldwide. People communicate more with online forums to find answers for their psychological disease. However, there is no mechanism to measure the severity of depression in each post and give higher importance to those who are diagnosed more severely depressed. Despite the fact that numerous researches based on online forum data and the identification of depression have been conducted, the severity of depression is rarely explored. In addition, the absence of datasets will stymie the development of novel diagnostic procedures for practitioners. From this study, we offer a dataset to support research on depression severity evaluation. The computational approach to measure an automatic process, identified severity of depression here is quite novel approach. Nonetheless, this elaborate measuring severity of depression in online forum posts is needed to ensure the measurement scales used in our research meets the expected norms of scientific research.

pdf bib
On the Evolution of Word Order
Idan Rejwan | Avi Caciularu

Most natural languages have a predominant or fixed word order. For example in English the word order is usually Subject-Verb-Object. This work attempts to explain this phenomenon as well as other typological findings regarding word order from a functional perspective. In particular, we examine whether fixed word order provides a functional advantage, explaining why these languages are prevalent. To this end, we consider an evolutionary model of language and demonstrate, both theoretically and using genetic algorithms, that a language with a fixed word order is optimal. We also show that adding information to the sentence, such as case markers and noun-verb distinction, reduces the need for fixed word order, in accordance with the typological findings.

pdf bib
EmoPars : A Collection of 30 K Emotion-Annotated Persian Social Media TextsEmoPars: A Collection of 30K Emotion-Annotated Persian Social Media Texts
Nazanin Sabri | Reyhane Akhavan | Behnam Bahrak

The wide reach of social media platforms, such as Twitter, have enabled many users to share their thoughts, opinions and emotions on various topics online. The ability to detect these emotions automatically would allow social scientists, as well as, businesses to better understand responses from nations and costumers. In this study we introduce a dataset of 30,000 Persian Tweets labeled with Ekman’s six basic emotions (Anger, Fear, Happiness, Sadness, Hatred, and Wonder). This is the first publicly available emotion dataset in the Persian language. In this paper, we explain the data collection and labeling scheme used for the creation of this dataset. We also analyze the created dataset, showing the different features and characteristics of the data. Among other things, we investigate co-occurrence of different emotions in the dataset, and the relationship between sentiment and emotion of textual instances. The dataset is publicly available at https://github.com/nazaninsbr/Persian-Emotion-Detection.

pdf bib
A Review on Document Information Extraction Approaches
Kanishka Silva | Thushari Silva

Information extraction from documents has become great use of novel natural language processing areas. Most of the entity extraction methodologies are variant in a context such as medical area, financial area, also come even limited to the given language. It is better to have one generic approach applicable for any document type to extract entity information regardless of language, context, and structure. Also, another issue in such research is structural analysis while keeping the hierarchical, semantic, and heuristic features. Another problem identified is that usually, it requires a massive training corpus. Therefore, this research focus on mitigating such barriers. Several approaches have been identifying towards building document information extractors focusing on different disciplines. This research area involves natural language processing, semantic analysis, information extraction, and conceptual modelling. This paper presents a review of the information extraction mechanism to construct a generic framework for document extraction with aim of providing a solid base for upcoming research.

pdf bib
Question answering in Natural Language : the Special Case of Temporal Expressions
Armand Stricker

Although general question answering has been well explored in recent years, temporal question answering is a task which has not received as much focus. Our work aims to leverage a popular approach used for general question answering, answer extraction, in order to find answers to temporal questions within a paragraph. To train our model, we propose a new dataset, inspired by SQuAD, a state-of-the-art question answering corpus, specifically tailored to provide rich temporal information by adapting the corpus WikiWars, which contains several documents on history’s greatest conflicts. Our evaluation shows that a pattern matching deep learning model, often used in general question answering, can be adapted to temporal question answering, if we accept to ask questions whose answers must be directly present within a text.

pdf bib
Building A Corporate Corpus For Threads Constitution
Lionel Tadonfouet Tadjou | Fabrice Bourge | Tiphaine Marie | Laurent Romary | Éric de la Clergerie

In this paper we describe the process of build-ing a corporate corpus that will be used as a ref-erence for modelling and computing threadsfrom conversations generated using commu-nication and collaboration tools. The overallgoal of the reconstruction of threads is to beable to provide value to the collorator in var-ious use cases, such as higlighting the impor-tant parts of a running discussion, reviewingthe upcoming commitments or deadlines, etc. Since, to our knowledge, there is no avail-able corporate corpus for the French languagewhich could allow us to address this prob-lem of thread constitution, we present here amethod for building such corpora includingdifferent aspects and steps which allowed thecreation of a pipeline to pseudo-anonymisedata. Such a pipeline is a response to theconstraints induced by the General Data Pro-tection Regulation GDPR in Europe and thecompliance to the secrecy of correspondence.

pdf bib
Toward Discourse-Aware Models for Multilingual Fake News Detection
Francielle Vargas | Fabrício Benevenuto | Thiago Pardo

Statements that are intentionally misstated (or manipulated) are of considerable interest to researchers, government, security, and financial systems. According to deception literature, there are reliable cues for detecting deception and the belief that liars give off cues that may indicate their deception is near-universal. Therefore, given that deceiving actions require advanced cognitive development that honesty simply does not require, as well as people’s cognitive mechanisms have promising guidance for deception detection, in this Ph.D. ongoing research, we propose to examine discourse structure patterns in multilingual deceptive news corpora using the Rhetorical Structure Theory framework. Considering that our work is the first to exploit multilingual discourse-aware strategies for fake news detection, the research community currently lacks multilingual deceptive annotated corpora. Accordingly, this paper describes the current progress in this thesis, including (i) the construction of the first multilingual deceptive corpus, which was annotated by specialists according to the Rhetorical Structure Theory framework, and (ii) the introduction of two new proposed rhetorical relations : INTERJECTION and IMPERATIVE, which we assume to be relevant for the fake news detection task.

up

bib (full) Proceedings of the 14th Workshop on Building and Using Comparable Corpora (BUCC 2021)

pdf bib
Proceedings of the 14th Workshop on Building and Using Comparable Corpora (BUCC 2021)
Reinhard Rapp | Serge Sharoff | Pierre Zweigenbaum

pdf bib
Syntax-aware Transformers for Neural Machine Translation : The Case of Text to Sign Gloss Translation
Santiago Egea Gómez | Euan McGill | Horacio Saggion

It is well-established that the preferred mode of communication of the deaf and hard of hearing (DHH) community are Sign Languages (SLs), but they are considered low resource languages where natural language processing technologies are of concern. In this paper we study the problem of text to SL gloss Machine Translation (MT) using Transformer-based architectures. Despite the significant advances of MT for spoken languages in the recent couple of decades, MT is in its infancy when it comes to SLs. We enrich a Transformer-based architecture aggregating syntactic information extracted from a dependency parser to word-embeddings. We test our model on a well-known dataset showing that the syntax-aware model obtains performance gains in terms of MT evaluation metrics.

pdf bib
Employing Wikipedia as a resource for Named Entity Recognition in Morphologically complex under-resourced languagesWikipedia as a resource for Named Entity Recognition in Morphologically complex under-resourced languages
Aravind Krishnan | Stefan Ziehe | Franziska Pannach | Caroline Sporleder

We propose a novel approach for rapid prototyping of named entity recognisers through the development of semi-automatically annotated datasets. We demonstrate the proposed pipeline on two under-resourced agglutinating languages : the Dravidian language Malayalam and the Bantu language isiZulu. Our approach is weakly supervised and bootstraps training data from Wikipedia and Google Knowledge Graph. Moreover, our approach is relatively language independent and can consequently be ported quickly (and hence cost-effectively) from one language to another, requiring only minor language-specific tailoring.

pdf bib
Majority Voting with Bidirectional Pre-translation For Bitext Retrieval
Alexander Jones | Derry Tanti Wijaya

Obtaining high-quality parallel corpora is of paramount importance for training NMT systems. However, as many language pairs lack adequate gold-standard training data, a popular approach has been to mine so-called pseudo-parallel sentences from paired documents in two languages. In this paper, we outline some drawbacks with current methods that rely on an embedding similarity threshold, and propose a heuristic method in its place. Our method involves translating both halves of a paired corpus before mining, and then performing a majority vote on sentence pairs mined in three ways : after translating documents in language x to language y, after translating language y to x, and using the original documents in languages x and y. We demonstrate success with this novel approach on the Tatoeba similarity search benchmark in 64 low-resource languages, and on NMT in Kazakh and Gujarati. We also uncover the effect of resource-related factors (i.e. how much monolingual / bilingual data is available for a given language) on the optimal choice of bitext mining method, demonstrating that there is currently no one-size-fits-all approach for this task. We make the code and data used in our experiments publicly available.

pdf bib
EM Corpus : a comparable corpus for a less-resourced language pair Manipuri-EnglishEM Corpus: a comparable corpus for a less-resourced language pair Manipuri-English
Rudali Huidrom | Yves Lepage | Khogendra Khomdram

In this paper, we introduce a sentence-level comparable text corpus crawled and created for the less-resourced language pair, Manipuri(mni) and English (eng). Our monolingual corpora comprise 1.88 million Manipuri sentences and 1.45 million English sentences, and our parallel corpus comprises 124,975 Manipuri-English sentence pairs. These data were crawled and collected over a year from August 2020 to March 2021 from a local newspaper website called ‘The Sangai Express.’ The resources reported in this paper are made available to help the low-resourced languages community for MT / NLP tasks.

pdf bib
On Pronunciations in Wiktionary : Extraction and Experiments on Multilingual Syllabification and Stress PredictionWiktionary: Extraction and Experiments on Multilingual Syllabification and Stress Prediction
Winston Wu | David Yarowsky

We constructed parsers for five non-English editions of Wiktionary, which combined with pronunciations from the English edition, comprises over 5.3 million IPA pronunciations, the largest pronunciation lexicon of its kind. This dataset is a unique comparable corpus of IPA pronunciations annotated from multiple sources. We analyze the dataset, noting the presence of machine-generated pronunciations. We develop a novel visualization method to quantify syllabification. We experiment on the new combined task of multilingual IPA syllabification and stress prediction, finding that training a massively multilingual neural sequence-to-sequence model with copy attention can improve performance on both high- and low-resource languages, and multi-task training on stress prediction helps with syllabification.

pdf bib
A Dutch Dataset for Cross-lingual Multilabel Toxicity DetectionDutch Dataset for Cross-lingual Multilabel Toxicity Detection
Ben Burtenshaw | Mike Kestemont

Multi-label toxicity detection is highly prominent, with many research groups, companies, and individuals engaging with it through shared tasks and dedicated venues. This paper describes a cross-lingual approach to annotating multi-label text classification on a newly developed Dutch language dataset, using a model trained on English data. We present an ensemble model of one Transformer model and an LSTM using Multilingual embeddings. The combination of multilingual embeddings and the Transformer model improves performance in a cross-lingual setting.

up

bib (full) Proceedings of the First Workshop on Multimodal Machine Translation for Low Resource Languages (MMTLRL 2021)

pdf bib
Proceedings of the First Workshop on Multimodal Machine Translation for Low Resource Languages (MMTLRL 2021)
Thoudam Doren Singh | Cristina España i Bonet | Sivaji Bandyopadhyay | Josef van Genabith

pdf bib
Models and Tasks for Human-Centered Machine Translation
Marine Carpuat

In this talk, I will describe current research directions in my group that aim to make machine translation (MT) more human-centered. Instead of viewing MT solely as a task that aims to transduce a source sentence into a well-formed target language equivalent, we revisit all steps of the MT research and development lifecycle with the goal of designing MT systems that are able to help people communicate across language barriers. I will present methods to better characterize the parallel training data that powers MT systems, and how the degree of equivalence impacts translation quality. I will introduce models that enable flexible conditional language generation, and will discuss recent work on framing machine translation tasks and evaluation to center human factors.

pdf bib
Multimodal Simultaneous Machine Translation
Lucia Specia

Simultaneous machine translation (SiMT) aims to translate a continuous input text stream into another language with the lowest latency and highest quality possible. Therefore, translation has to start with an incomplete source text, which is read progressively, creating the need for anticipation. In this talk I will present work where we seek to understand whether the addition of visual information can compensate for the missing source context. We analyse the impact of different multimodal approaches and visual features on state-of-the-art SiMT frameworks, including fixed and dynamic policy approaches using reinforcement learning. Our results show that visual context is helpful and that visually-grounded models based on explicit object region information perform the best. Our qualitative analysis illustrates cases where only the multimodal systems are able to translate correctly from English into gender-marked languages, as well as deal with differences in word order, such as adjective-noun placement between English and French.

pdf bib
Multimodal Neural Machine Translation System for English to BengaliEnglish to Bengali
Shantipriya Parida | Subhadarshi Panda | Satya Prakash Biswal | Ketan Kotwal | Arghyadeep Sen | Satya Ranjan Dash | Petr Motlicek

Multimodal Machine Translation (MMT) systems utilize additional information from other modalities beyond text to improve the quality of machine translation (MT). The additional modality is typically in the form of images. Despite proven advantages, it is indeed difficult to develop an MMT system for various languages primarily due to the lack of a suitable multimodal dataset. In this work, we develop an MMT for English- Bengali using a recently published Bengali Visual Genome (BVG) dataset that contains images with associated bilingual textual descriptions. Through a comparative study of the developed MMT system vis-a-vis a Text-to-text translation, we demonstrate that the use of multimodal data not only improves the translation performance improvement in BLEU score of +1.3 on the development set, +3.9 on the evaluation test, and +0.9 on the challenge test set but also helps to resolve ambiguities in the pure text description. As per best of our knowledge, our English-Bengali MMT system is the first attempt in this direction, and thus, can act as a baseline for the subsequent research in MMT for low resource languages.