Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects

Marcos Zampieri, Preslav Nakov, Nikola Ljubešić, Jörg Tiedemann, Yves Scherrer (Editors)


Anthology ID:
2020.vardial-1
Month:
December
Year:
2020
Address:
Barcelona, Spain (Online)
Venues:
COLING | VarDial
SIG:
Publisher:
International Committee on Computational Linguistics (ICCL)
URL:
https://aclanthology.org/2020.vardial-1
DOI:
Bib Export formats:
BibTeX MODS XML EndNote

pdf bib
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects
Marcos Zampieri | Preslav Nakov | Nikola Ljubešić | Jörg Tiedemann | Yves Scherrer

pdf bib
ASR for Non-standardised Languages with Dialectal Variation : the case of Swiss GermanASR for Non-standardised Languages with Dialectal Variation: the case of Swiss German
Iuliia Nigmatulina | Tannon Kew | Tanja Samardzic

Strong regional variation, together with the lack of standard orthography, makes Swiss German automatic speech recognition (ASR) particularly difficult in a multi-dialectal setting. This paper focuses on one of the many challenges, namely, the choice of the output text to represent non-standardised Swiss German. We investigate two potential options : a) dialectal writing approximate phonemic transcriptions that provide close correspondence between grapheme labels and the acoustic signal but are highly inconsistent and b) normalised writing transcriptions resembling standard German that are relatively consistent but distant from the acoustic signal. To find out which writing facilitates Swiss German ASR, we build several systems using the Kaldi toolkit and a dataset covering 14 regional varieties. A formal comparison shows that the system trained on the normalised transcriptions achieves better results in word error rate (WER) (29.39 %) but underperforms at the character level, suggesting dialectal transcriptions offer a viable solution for downstream applications where dialectal differences are important. To better assess word-level performance for dialectal transcriptions, we use a flexible WER measure (FlexWER). When evaluated with this metric, the system trained on dialectal transcriptions outperforms that trained on the normalised writing. Besides establishing a benchmark for Swiss German multi-dialectal ASR, our findings can be helpful in designing ASR systems for other languages without standard orthography.

pdf bib
Machine-oriented NMT Adaptation for Zero-shot NLP tasks : Comparing the Usefulness of Close and Distant LanguagesNMT Adaptation for Zero-shot NLP tasks: Comparing the Usefulness of Close and Distant Languages
Amirhossein Tebbifakhr | Matteo Negri | Marco Turchi

Neural Machine Translation (NMT) models are typically trained by considering humans as end-users and maximizing human-oriented objectives. However, in some scenarios, their output is consumed by automatic NLP components rather than by humans. In these scenarios, translations’ quality is measured in terms of their fitness for purpose (i.e. maximizing performance of external NLP tools) rather than in terms of standard human fluency / adequacy criteria. Recently, reinforcement learning techniques exploiting the feedback from downstream NLP tools have been proposed for machine-oriented NMT adaptation. In this work, we tackle the problem in a multilingual setting where a single NMT model translates from multiple languages for downstream automatic processing in the target language. Knowledge sharing across close and distant languages allows to apply our machine-oriented approach in the zero-shot setting where no labeled data for the test language is seen at training time. Moreover, we incorporate multi-lingual BERT in the source side of our NMT system to benefit from the knowledge embedded in this model. Our experiments show coherent performance gains, for different language directions over both i) generic NMT models (trained for human consumption), and ii) fine-tuned multilingual BERT. This gain for zero-shot language directions (e.g. SpanishEnglish) is higher when the models are fine-tuned on a closely-related source language (Italian) than a distant one (German).

pdf bib
Character Alignment in Morphologically Complex Translation Sets for Related Languages
Michael Gasser | Binyam Ephrem Seyoum | Nazareth Amlesom Kifle

For languages with complex morphology, word-to-word translation is a task with various potential applications, for example, in information retrieval, language instruction, and dictionary creation, as well as in machine translation. In this paper, we confine ourselves to the subtask of character alignment for the particular case of families of related languages with very few resources for most or all members. There are many such families ; we focus on the subgroup of Semitic languages spoken in Ethiopia and Eritrea. We begin with an adaptation of the familiar alignment algorithms behind statistical machine translation, modifying them as appropriate for our task. We show how character alignment can reveal morphological, phonological, and orthographic correspondences among related languages.

pdf bib
Bilingual Lexicon Induction across Orthographically-distinct Under-Resourced Dravidian LanguagesDravidian Languages
Bharathi Raja Chakravarthi | Navaneethan Rajasekaran | Mihael Arcan | Kevin McGuinness | Noel E. O’Connor | John P. McCrae

Bilingual lexicons are a vital tool for under-resourced languages and recent state-of-the-art approaches to this leverage pretrained monolingual word embeddings using supervised or semi-supervised approaches. However, these approaches require cross-lingual information such as seed dictionaries to train the model and find a linear transformation between the word embedding spaces. Especially in the case of low-resourced languages, seed dictionaries are not readily available, and as such, these methods produce extremely weak results on these languages. In this work, we focus on the Dravidian languages, namely Tamil, Telugu, Kannada, and Malayalam, which are even more challenging as they are written in unique scripts. To take advantage of orthographic information and cognates in these languages, we bring the related languages into a single script. Previous approaches have used linguistically sub-optimal measures such as the Levenshtein edit distance to detect cognates, whereby we demonstrate that the longest common sub-sequence is linguistically more sound and improves the performance of bilingual lexicon induction. We show that our approach can increase the accuracy of bilingual lexicon induction methods on these languages many times, making bilingual lexicon induction approaches feasible for such under-resourced languages.

pdf bib
Recycling and Comparing Morphological Annotation Models for Armenian Diachronic-Variational Corpus ProcessingArmenian Diachronic-Variational Corpus Processing
Chahan Vidal-Gorène | Victoria Khurshudyan | Anaïd Donabédian-Demopoulos

Armenian is a language with significant variation and unevenly distributed NLP resources for different varieties. An attempt is made to process an RNN model for morphological annotation on the basis of different Armenian data (provided or not with morphologically annotated corpora), and to compare the annotation results of RNN and rule-based models. Different tests were carried out to evaluate the reuse of an unspecialized model of lemmatization and POS-tagging for under-resourced language varieties. The research focused on three dialects and further extended to Western Armenian with a mean accuracy of 94,00 % in lemmatization and 97,02 % in POS-tagging, as well as a possible reusability of models to cover different other Armenian varieties. Interestingly, the comparison of an RNN model trained on Eastern Armenian with the Eastern Armenian National Corpus rule-based model applied to Western Armenian showed an enhancement of 19 % in parsing. This model covers 88,79 % of a short heterogeneous dataset in Western Armenian, and could be a baseline for a massive corpus annotation in that standard. It is argued that an RNN-based model can be a valid alternative to a rule-based one giving consideration to such factors as time-consumption, reusability for different varieties of a target language and significant qualitative results in morphological annotation.

pdf bib
Uralic Language Identification (ULI) 2020 shared task dataset and the Wanca 2017 corporaULI) 2020 shared task dataset and the Wanca 2017 corpora
Tommi Jauhiainen | Heidi Jauhiainen | Niko Partanen | Krister Lindén

This article introduces the Wanca 2017 web corpora from which the sentences written in minor Uralic languages were collected for the test set of the Uralic Language Identification (ULI) 2020 shared task. We describe the ULI shared task and how the test set was constructed using the Wanca 2017 corpora and texts in different languages from the Leipzig corpora collection. We also provide the results of a baseline language identification experiment conducted using the ULI 2020 dataset.

pdf bib
HeLju@VarDial 2020 : Social Media Variety Geolocation with BERT ModelsHeLju@VarDial 2020: Social Media Variety Geolocation with BERT Models
Yves Scherrer | Nikola Ljubešić

This paper describes the Helsinki-Ljubljana contribution to the VarDial shared task on social media variety geolocation. Our solutions are based on the BERT Transformer models, the constrained versions of our models reaching 1st place in two subtasks and 3rd place in one subtask, while our unconstrained models outperform all the constrained systems by a large margin. We show in our analyses that Transformer-based models outperform traditional models by far, and that improvements obtained by pre-training models on large quantities of (mostly standard) text are significant, but not drastic, with single-language models also outperforming multilingual models. Our manual analysis shows that two types of signals are the most crucial for a (mis)prediction : named entities and dialectal features, both of which are handled well by our models.

pdf bib
Experiments in Language Variety Geolocation and Dialect Identification
Tommi Jauhiainen | Heidi Jauhiainen | Krister Lindén

In this paper we describe the systems we used when participating in the VarDial Evaluation Campaign organized as part of the 7th workshop on NLP for similar languages, varieties and dialects. The shared tasks we participated in were the second edition of the Romanian Dialect Identification (RDI) and the first edition of the Social Media Variety Geolocation (SMG). The submissions of our SUKI team used generative language models based on Naive Bayes and character n-grams.

pdf bib
Exploring the Power of Romanian BERT for Dialect IdentificationRomanian BERT for Dialect Identification
George-Eduard Zaharia | Andrei-Marius Avram | Dumitru-Clementin Cercel | Traian Rebedea

Dialect identification represents a key aspect for improving a series of tasks, for example, opinion mining, considering that the location of the speaker can greatly influence the attitude towards a subject. In this work, we describe the systems developed by our team for VarDial 2020 : Romanian Dialect Identification, a task specifically created for challenging participants to solve the previously mentioned issue. More specifically, we introduce a series of neural systems based on Transformers, that combine a BERT model exclusively pre-trained on the Romanian language with techniques such as adversarial training or character-level embeddings. By using these approaches, we were able to obtain a 0.6475 macro F1 score on the test dataset, thus allowing us to be ranked 5th out of 8 participant teams.

pdf bib
Geolocation of Tweets with a BiLSTM Regression ModelBiLSTM Regression Model
Piyush Mishra

Identifying a user’s location can be useful for recommendation systems, demographic analyses, and disaster outbreak monitoring. Although Twitter allows users to voluntarily reveal their location, such information is n’t universally available. Analyzing a tweet can provide a general estimation of a tweet location while giving insight into the dialect of the user and other linguistic markers. Such linguistic attributes can be used to provide a regional approximation of tweet origins. In this paper, we present a neural regression model that can identify the linguistic intricacies of a tweet to predict the location of the user. The final model identifies the dialect embedded in the tweet and predicts the location of the tweet.