Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects

Marcos Zampieri, Preslav Nakov, Nikola Ljubešić, Jörg Tiedemann, Yves Scherrer, Tommi Jauhiainen (Editors)

Anthology ID:
Kiyv, Ukraine
EACL | VarDial
Association for Computational Linguistics
Bib Export formats:

pdf bib
Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects
Marcos Zampieri | Preslav Nakov | Nikola Ljubešić | Jörg Tiedemann | Yves Scherrer | Tommi Jauhiainen

pdf bib
Hierarchical Transformer for Multilingual Machine Translation
Albina Khusainova | Adil Khan | Adín Ramírez Rivera | Vitaly Romanov

The choice of parameter sharing strategy in multilingual machine translation models determines how optimally parameter space is used and hence, directly influences ultimate translation quality. Inspired by linguistic trees that show the degree of relatedness between different languages, the new general approach to parameter sharing in multilingual machine translation was suggested recently. The main idea is to use these expert language hierarchies as a basis for multilingual architecture : the closer two languages are, the more parameters they share. In this work, we test this idea using the Transformer architecture and show that despite the success in previous work there are problems inherent to training such hierarchical models. We demonstrate that in case of carefully chosen training strategy the hierarchical architecture can outperform bilingual models and multilingual models with full parameter sharing.

pdf bib
Representations of Language Varieties Are Reliable Given Corpus Similarity Measures
Jonathan Dunn

This paper measures similarity both within and between 84 language varieties across nine languages. These corpora are drawn from digital sources (the web and tweets), allowing us to evaluate whether such geo-referenced corpora are reliable for modelling linguistic variation. The basic idea is that, if each source adequately represents a single underlying language variety, then the similarity between these sources should be stable across all languages and countries. The paper shows that there is a consistent agreement between these sources using frequency-based corpus similarity measures. This provides further evidence that digital geo-referenced corpora consistently represent local language varieties.

pdf bib
Whit’s the Richt Pairt o Speech : PoS tagging for ScotsPoS tagging for Scots
Harm Lameris | Sara Stymne

In this paper we explore PoS tagging for the Scots language. Scots is spoken in Scotland and Northern Ireland, and is closely related to English. As no linguistically annotated Scots data were available, we manually PoS tagged a small set that is used for evaluation and training. We use English as a transfer language to examine zero-shot transfer and transfer learning methods. We find that training on a very small amount of Scots data was superior to zero-shot transfer from English. Combining the Scots and English data led to further improvements, with a concatenation method giving the best results. We also compared the use of two different English treebanks and found that a treebank containing web data was superior in the zero-shot setting, while it was outperformed by a treebank containing a mix of genres when combined with Scots data.

pdf bib
Discriminating Between Similar Nordic Languages
René Haas | Leon Derczynski

Automatic language identification is a challenging problem. Discriminating between closely related languages is especially difficult. This paper presents a machine learning approach for automatic language identification for the Nordic languages, which often suffer miscategorisation by existing state-of-the-art tools. Concretely we will focus on discrimination between six Nordic languages : Danish, Swedish, Norwegian (Nynorsk), Norwegian (Bokml), Faroese and Icelandic.

pdf bib
Optimizing a Supervised Classifier for a Difficult Language Identification Problem
Yves Bestgen

This paper describes the system developed by the Laboratoire d’analyse statistique des textes for the Dravidian Language Identification (DLI) shared task of VarDial 2021. This task is particularly difficult because the materials consists of short YouTube comments, written in Roman script, from three closely related Dravidian languages, and a fourth category consisting of several other languages in varying proportions, all mixed with English. The proposed system is made up of a logistic regression model which uses as only features n-grams of characters with a maximum length of 5. After its optimization both in terms of the feature weighting and the classifier parameters, it ranked first in the challenge. The additional analyses carried out underline the importance of optimization, especially when the measure of effectiveness is the Macro-F1.

pdf bib
Comparing Approaches to Dravidian Language IdentificationDravidian Language Identification
Tommi Jauhiainen | Tharindu Ranasinghe | Marcos Zampieri

This paper describes the submissions by team HWR to the Dravidian Language Identification (DLI) shared task organized at VarDial 2021 workshop. The DLI training set includes 16,674 YouTube comments written in Roman script containing code-mixed text with English and one of the three South Dravidian languages : Kannada, Malayalam, and Tamil. We submitted results generated using two models, a Naive Bayes classifier with adaptive language models, which has shown to obtain competitive performance in many language and dialect identification tasks, and a transformer-based model which is widely regarded as the state-of-the-art in a number of NLP tasks. Our first submission was sent in the closed submission track using only the training set provided by the shared task organisers, whereas the second submission is considered to be open as it used a pretrained model trained with external data. Our team attained shared second position in the shared task with the submission based on Naive Bayes. Our results reinforce the idea that deep learning methods are not as competitive in language identification related tasks as they are in many other text classification tasks.