Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing

Bogdan Babych, Olga Kanishcheva, Preslav Nakov, Jakub Piskorski, Lidia Pivovarova, Vasyl Starko, Josef Steinberger, Roman Yangarber, Michał Marcińczuk, Senja Pollak, Pavel Přibáň, Marko Robnik-Šikonja (Editors)


Anthology ID:
2021.bsnlp-1
Month:
April
Year:
2021
Address:
Kiyv, Ukraine
Venues:
BSNLP | EACL
SIG:
SIGSLAV
Publisher:
Association for Computational Linguistics
URL:
https://aclanthology.org/2021.bsnlp-1
DOI:
Bib Export formats:
BibTeX MODS XML EndNote

pdf bib
Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing
Bogdan Babych | Olga Kanishcheva | Preslav Nakov | Jakub Piskorski | Lidia Pivovarova | Vasyl Starko | Josef Steinberger | Roman Yangarber | Michał Marcińczuk | Senja Pollak | Pavel Přibáň | Marko Robnik-Šikonja

pdf bib
HerBERT : Efficiently Pretrained Transformer-based Language Model for PolishHerBERT: Efficiently Pretrained Transformer-based Language Model for Polish
Robert Mroczkowski | Piotr Rybak | Alina Wróblewska | Ireneusz Gawlik

BERT-based models are currently used for solving nearly all Natural Language Processing (NLP) tasks and most often achieve state-of-the-art results. Therefore, the NLP community conducts extensive research on understanding these models, but above all on designing effective and efficient training procedures. Several ablation studies investigating how to train BERT-like models have been carried out, but the vast majority of them concerned only the English language. A training procedure designed for English does not have to be universal and applicable to other especially typologically different languages. Therefore, this paper presents the first ablation study focused on Polish, which, unlike the isolating English language, is a fusional language. We design and thoroughly evaluate a pretraining procedure of transferring knowledge from multilingual to monolingual BERT-based models. In addition to multilingual model initialization, other factors that possibly influence pretraining are also explored, i.e. training objective, corpus size, BPE-Dropout, and pretraining length. Based on the proposed procedure, a Polish BERT-based language model HerBERT is trained. This model achieves state-of-the-art results on multiple downstream tasks.

pdf bib
Detecting Inappropriate Messages on Sensitive Topics that Could Harm a Company’s Reputation
Nikolay Babakov | Varvara Logacheva | Olga Kozlova | Nikita Semenov | Alexander Panchenko

Not all topics are equally flammable in terms of toxicity : a calm discussion of turtles or fishing less often fuels inappropriate toxic dialogues than a discussion of politics or sexual minorities. We define a set of sensitive topics that can yield inappropriate and toxic messages and describe the methodology of collecting and labelling a dataset for appropriateness. While toxicity in user-generated data is well-studied, we aim at defining a more fine-grained notion of inappropriateness. The core of inappropriateness is that it can harm the reputation of a speaker. This is different from toxicity in two respects : (i) inappropriateness is topic-related, and (ii) inappropriate message is not toxic but still unacceptable. We collect and release two datasets for Russian : a topic-labelled dataset and an appropriateness-labelled dataset. We also release pre-trained classification models trained on this data.

pdf bib
RuSentEval : Linguistic Source, Encoder Force !RuSentEval: Linguistic Source, Encoder Force!
Vladislav Mikhailov | Ekaterina Taktasheva | Elina Sigdel | Ekaterina Artemova

The success of pre-trained transformer language models has brought a great deal of interest on how these models work, and what they learn about language. However, prior research in the field is mainly devoted to English, and little is known regarding other languages. To this end, we introduce RuSentEval, an enhanced set of 14 probing tasks for Russian, including ones that have not been explored yet. We apply a combination of complementary probing methods to explore the distribution of various linguistic properties in five multilingual transformers for two typologically contrasting languages Russian and English. Our results provide intriguing findings that contradict the common understanding of how linguistic knowledge is represented, and demonstrate that some properties are learned in a similar manner despite the language differences.

pdf bib
Exploratory Analysis of News Sentiment Using Subgroup Discovery
Anita Valmarska | Luis Adrián Cabrera-Diego | Elvys Linhares Pontes | Senja Pollak

In this study, we present an exploratory analysis of a Slovenian news corpus, in which we investigate the association between named entities and sentiment in the news. We propose a methodology that combines Named Entity Recognition and Subgroup Discovery-a descriptive rule learning technique for identifying groups of examples that share the same class label (sentiment) and pattern (features-Named Entities). The approach is used to induce the positive and negative sentiment class rules that reveal interesting patterns related to different Slovenian and international politicians, organizations, and locations.

pdf bib
Creating an Aligned Russian Text Simplification Dataset from Language Learner DataRussian Text Simplification Dataset from Language Learner Data
Anna Dmitrieva | Jörg Tiedemann

Parallel language corpora where regular texts are aligned with their simplified versions can be used in both natural language processing and theoretical linguistic studies. They are essential for the task of automatic text simplification, but can also provide valuable insights into the characteristics that make texts more accessible and reveal strategies that human experts use to simplify texts. Today, there exist a few parallel datasets for English and Simple English, but many other languages lack such data. In this paper we describe our work on creating an aligned Russian-Simple Russian dataset composed of Russian literature texts adapted for learners of Russian as a foreign language. This will be the first parallel dataset in this domain, and one of the first Simple Russian datasets in general.

pdf bib
Multilingual Named Entity Recognition and Matching Using BERT and Dedupe for Slavic LanguagesBERT and Dedupe for Slavic Languages
Marko Prelevikj | Slavko Zitnik

This paper describes the University of Ljubljana (UL FRI) Group’s submissions to the shared task at the Balto-Slavic Natural Language Processing (BSNLP) 2021 Workshop. We experiment with multiple BERT-based models, pre-trained in multi-lingual, Croatian-Slovene-English and Slovene-only data. We perform training iteratively and on the concatenated data of previously available NER datasets. For the normalization task we use Stanza lemmatizer, while for entity matching we implemented a baseline using the Dedupe library. The performance of evaluations suggests that multi-source settings outperform less-resourced approaches. The best NER models achieve 0.91 F-score on Slovene training data splits while the best official submission achieved F-scores of 0.84 and 0.78 for relaxed partial matching and strict settings, respectively. In multi-lingual NER setting we achieve F-scores of 0.82 and 0.74.

pdf bib
Benchmarking Pre-trained Language Models for Multilingual NER : TraSpaS at the BSNLP2021 Shared TaskNER: TraSpaS at the BSNLP2021 Shared Task
Marek Suppa | Ondrej Jariabka

In this paper we describe TraSpaS, a submission to the third shared task on named entity recognition hosted as part of the Balto-Slavic Natural Language Processing (BSNLP) Workshop. In it we evaluate various pre-trained language models on the NER task using three open-source NLP toolkits : character level language model with Stanza, language-specific BERT-style models with SpaCy and Adapter-enabled XLM-R with Trankit. Our results show that the Trankit-based models outperformed those based on the other two toolkits, even when trained on smaller amounts of data. Our code is available at.https://github.com/NaiveNeuron/slavner-2021.

pdf bib
Named Entity Recognition and Linking Augmented with Large-Scale Structured Data
Paweł Rychlikowski | Bartłomiej Najdecki | Adrian Lancucki | Adam Kaczmarek

In this paper we describe our submissions to the 2nd and 3rd SlavNER Shared Tasks held at BSNLP 2019 and BSNLP 2021, respectively. The tasks focused on the analysis of Named Entities in multilingual Web documents in Slavic languages with rich inflection. Our solution takes advantage of large collections of both unstructured and structured documents. The former serve as data for unsupervised training of language models and embeddings of lexical units. The latter refers to Wikipedia and its structured counterpart-Wikidata, our source of lemmatization rules, and real-world entities. With the aid of those resources, our system could recognize, normalize and link entities, while being trained with only small amounts of labeled data.

pdf bib
Slav-NER : the 3rd Cross-lingual Challenge on Recognition, Normalization, Classification, and Linking of Named Entities across Slavic LanguagesNER: the 3rd Cross-lingual Challenge on Recognition, Normalization, Classification, and Linking of Named Entities across Slavic Languages
Jakub Piskorski | Bogdan Babych | Zara Kancheva | Olga Kanishcheva | Maria Lebedeva | Michał Marcińczuk | Preslav Nakov | Petya Osenova | Lidia Pivovarova | Senja Pollak | Pavel Přibáň | Ivaylo Radev | Marko Robnik-Sikonja | Vasyl Starko | Josef Steinberger | Roman Yangarber

This paper describes Slav-NER : the 3rd Multilingual Named Entity Challenge in Slavic languages. The tasks involve recognizing mentions of named entities in Web documents, normalization of the names, and cross-lingual linking. The Challenge covers six languages and five entity types, and is organized as part of the 8th Balto-Slavic Natural Language Processing Workshop, co-located with the EACL 2021 Conference. Ten teams participated in the competition. Performance for the named entity recognition task reached 90 % F-measure, much higher than reported in the first edition of the Challenge. Seven teams covered all six languages, and five teams participated in the cross-lingual entity linking task. Detailed valuation information is available on the shared task web page.