Varvara Logacheva


pdf bib
ParaDetox Detoxification with Parallel DataParaDetox: Detoxification with Parallel Data
Varvara Logacheva | Daryna Dementieva | Sergey Ustyantsev | Daniil Moskovskiy | David Dale | Irina Krotova | Nikita Semenov | Alexander Panchenko
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We present a novel pipeline for the collection of parallel data for the detoxification task We collect non toxic paraphrases for over 10,000 English toxic sentences We also show that this pipeline can be used to distill a large existing corpus of paraphrases to get toxic neutral sentence pairs We release two parallel corpora which can be used for the training of detoxification models To the best of our knowledge these are the first parallel datasets for this task We describe our pipeline in detail to make it fast to set up for a new language or domain thus contributing to faster and easier development of new parallel resources We train several detoxification models on the collected data and compare them with several baselines and state of the art unsupervised approaches We conduct both automatic and manual evaluations All models trained on parallel data outperform the state of the art unsupervised models by a large margin This suggests that our novel datasets can boost the performance of detoxification systems


pdf bib
Which is Better for Deep Learning : Python or MATLAB? Answering Comparative Questions in Natural LanguageMATLAB? Answering Comparative Questions in Natural Language
Viktoriia Chekalina | Alexander Bondarenko | Chris Biemann | Meriem Beloucif | Varvara Logacheva | Alexander Panchenko
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations

We present a system for answering comparative questions (Is X better than Y with respect to Z?) in natural language. Answering such questions is important for assisting humans in making informed decisions. The key component of our system is a natural language interface for comparative QA that can be used in personal assistants, chatbots, and similar NLP devices. Comparative QA is a challenging NLP task, since it requires collecting support evidence from many different sources, and direct comparisons of rare objects may be not available even on the entire Web. We take the first step towards a solution for such a task offering a testbed for comparative QA in natural language by probing several methods, making the three best ones available as an online demo.

pdf bib
Text Detoxification using Large Pre-trained Neural Models
David Dale | Anton Voronov | Daryna Dementieva | Varvara Logacheva | Olga Kozlova | Nikita Semenov | Alexander Panchenko
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

We present two novel unsupervised methods for eliminating toxicity in text. Our first method combines two recent ideas : (1) guidance of the generation process with small style-conditional language models and (2) use of paraphrasing models to perform style transfer. We use a well-performing paraphraser guided by style-trained language models to keep the text content and remove toxicity. Our second method uses BERT to replace toxic words with their non-offensive synonyms. We make the method more flexible by enabling BERT to replace mask tokens with a variable number of words. Finally, we present the first large-scale comparative study of style transfer models on the task of toxicity removal. We compare our models with a number of methods for style transfer. The models are evaluated in a reference-free way using a combination of unsupervised style transfer metrics. Both methods we suggest yield new SOTA results.

pdf bib
Detecting Inappropriate Messages on Sensitive Topics that Could Harm a Company’s Reputation
Nikolay Babakov | Varvara Logacheva | Olga Kozlova | Nikita Semenov | Alexander Panchenko
Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing

Not all topics are equally flammable in terms of toxicity : a calm discussion of turtles or fishing less often fuels inappropriate toxic dialogues than a discussion of politics or sexual minorities. We define a set of sensitive topics that can yield inappropriate and toxic messages and describe the methodology of collecting and labelling a dataset for appropriateness. While toxicity in user-generated data is well-studied, we aim at defining a more fine-grained notion of inappropriateness. The core of inappropriateness is that it can harm the reputation of a speaker. This is different from toxicity in two respects : (i) inappropriateness is topic-related, and (ii) inappropriate message is not toxic but still unacceptable. We collect and release two datasets for Russian : a topic-labelled dataset and an appropriateness-labelled dataset. We also release pre-trained classification models trained on this data.


pdf bib
Proceedings of Knowledgeable NLP: the First Workshop on Integrating Structured Knowledge and Neural Networks for NLP
Oren Sar Shalom | Alexander Panchenko | Cicero dos Santos | Varvara Logacheva | Alessandro Moschitti | Ido Dagan
Proceedings of Knowledgeable NLP: the First Workshop on Integrating Structured Knowledge and Neural Networks for NLP

pdf bib
Word Sense Disambiguation for 158 Languages using Word Embeddings Only
Varvara Logacheva | Denis Teslenko | Artem Shelmanov | Steffen Remus | Dmitry Ustalov | Andrey Kutuzov | Ekaterina Artemova | Chris Biemann | Simone Paolo Ponzetto | Alexander Panchenko
Proceedings of the 12th Language Resources and Evaluation Conference

Disambiguation of word senses in context is easy for humans, but is a major challenge for automatic approaches. Sophisticated supervised and knowledge-based models were developed to solve this task. However, (i) the inherent Zipfian distribution of supervised training instances for a given word and/or (ii) the quality of linguistic knowledge representations motivate the development of completely unsupervised and knowledge-free approaches to word sense disambiguation (WSD). They are particularly useful for under-resourced languages which do not have any resources for building either supervised and/or knowledge-based models. In this paper, we present a method that takes as input a standard pre-trained word embedding model and induces a fully-fledged word sense inventory, which can be used for disambiguation in context. We use this method to induce a collection of sense inventories for 158 languages on the basis of the original pre-trained fastText word embeddings by Grave et al., (2018), enabling WSD in these languages. Models and system are available online.

pdf bib
Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages
Alina Karakanta | Atul Kr. Ojha | Chao-Hong Liu | Jade Abbott | John Ortega | Jonathan Washington | Nathaniel Oco | Surafel Melaku Lakew | Tommi A Pirinen | Valentin Malykh | Varvara Logacheva | Xiaobing Zhao
Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages


pdf bib
MIPT System for World-Level Quality EstimationMIPT System for World-Level Quality Estimation
Mikhail Mosyagin | Varvara Logacheva
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

We explore different model architectures for the WMT 19 shared task on word-level quality estimation of automatic translation. We start with a model similar to Shef-bRNN, which we modify by using conditional random fields for sequence labelling. Additionally, we use a different approach for labelling gaps and source words. We further develop this model by including features from different sources such as BERT, baseline features for the task and transformer encoders. We evaluate the performance of our models on the English-German dataset for the corresponding shared task.


pdf bib
DeepPavlov : Open-Source Library for Dialogue SystemsDeepPavlov: Open-Source Library for Dialogue Systems
Mikhail Burtsev | Alexander Seliverstov | Rafael Airapetyan | Mikhail Arkhipov | Dilyara Baymurzina | Nickolay Bushkov | Olga Gureenkova | Taras Khakhulin | Yuri Kuratov | Denis Kuznetsov | Alexey Litinsky | Varvara Logacheva | Alexey Lymar | Valentin Malykh | Maxim Petrov | Vadim Polulyakh | Leonid Pugachev | Alexey Sorokin | Maria Vikhreva | Marat Zaynutdinov
Proceedings of ACL 2018, System Demonstrations

Adoption of messaging communication and voice assistants has grown rapidly in the last years. This creates a demand for tools that speed up prototyping of feature-rich dialogue systems. An open-source library DeepPavlov is tailored for development of conversational agents. The library prioritises efficiency, modularity, and extensibility with the goal to make it easier to develop dialogue systems from scratch and with limited data available. It supports modular as well as end-to-end approaches to implementation of conversational agents. Conversational agent consists of skills and every skill can be decomposed into components. Components are usually models which solve typical NLP tasks such as intent classification, named entity recognition or pre-trained word vectors. Sequence-to-sequence chit-chat skill, question answering skill or task-oriented skill can be assembled from components provided in the library.