Yves Scherrer


2022

pdf bib
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts
Luciana Benotti | Naoaki Okazaki | Yves Scherrer | Marcos Zampieri
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Tutorial Abstracts

2021

pdf bib
Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects
Marcos Zampieri | Preslav Nakov | Nikola Ljubešić | Jörg Tiedemann | Yves Scherrer | Tommi Jauhiainen
Proceedings of the Eighth Workshop on NLP for Similar Languages, Varieties and Dialects

pdf bib
The Helsinki submission to the AmericasNLP shared taskHelsinki submission to the AmericasNLP shared task
Raúl Vázquez | Yves Scherrer | Sami Virpioja | Jörg Tiedemann
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

The University of Helsinki participated in the AmericasNLP shared task for all ten language pairs. Our multilingual NMT models reached the first rank on all language pairs in track 1, and first rank on nine out of ten language pairs in track 2. We focused our efforts on three aspects : (1) the collection of additional data from various sources such as Bibles and political constitutions, (2) the cleaning and filtering of training data with the OpusFilter toolkit, and (3) different multilingual training techniques enabled by the latest version of the OpenNMT-py toolkit to make the most efficient use of the scarce data. This paper describes our efforts in detail.

pdf bib
Sesame Street to Mount Sinai : BERT-constrained character-level Moses models for multilingual lexical normalizationBERT-constrained character-level Moses models for multilingual lexical normalization
Yves Scherrer | Nikola Ljubešić
Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)

This paper describes the HEL-LJU submissions to the MultiLexNorm shared task on multilingual lexical normalization. Our system is based on a BERT token classification preprocessing step, where for each token the type of the necessary transformation is predicted (none, uppercase, lowercase, capitalize, modify), and a character-level SMT step where the text is translated from original to normalized given the BERT-predicted transformation constraints. For some languages, depending on the results on development data, the training data was extended by back-translating OpenSubtitles data. In the final ordering of the ten participating teams, the HEL-LJU team has taken the second place, scoring better than the previous state-of-the-art.

2020

pdf bib
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects
Marcos Zampieri | Preslav Nakov | Nikola Ljubešić | Jörg Tiedemann | Yves Scherrer
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects

pdf bib
HeLju@VarDial 2020 : Social Media Variety Geolocation with BERT ModelsHeLju@VarDial 2020: Social Media Variety Geolocation with BERT Models
Yves Scherrer | Nikola Ljubešić
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects

This paper describes the Helsinki-Ljubljana contribution to the VarDial shared task on social media variety geolocation. Our solutions are based on the BERT Transformer models, the constrained versions of our models reaching 1st place in two subtasks and 3rd place in one subtask, while our unconstrained models outperform all the constrained systems by a large margin. We show in our analyses that Transformer-based models outperform traditional models by far, and that improvements obtained by pre-training models on large quantities of (mostly standard) text are significant, but not drastic, with single-language models also outperforming multilingual models. Our manual analysis shows that two types of signals are the most crucial for a (mis)prediction : named entities and dialectal features, both of which are handled well by our models.

pdf bib
The University of Helsinki and Aalto University submissions to the WMT 2020 news and low-resource translation tasksUniversity of Helsinki and Aalto University submissions to the WMT 2020 news and low-resource translation tasks
Yves Scherrer | Stig-Arne Grönroos | Sami Virpioja
Proceedings of the Fifth Conference on Machine Translation

This paper describes the joint participation of University of Helsinki and Aalto University to two shared tasks of WMT 2020 : the news translation between Inuktitut and English and the low-resource translation between German and Upper Sorbian. For both tasks, our efforts concentrate on efficient use of monolingual and related bilingual corpora with scheduled multi-task learning as well as an optimized subword segmentation with sampling. Our submission obtained the highest score for Upper Sorbian-German and was ranked second for German-Upper Sorbian according to BLEU scores. For EnglishInuktitut, we reached ranks 8 and 10 out of 11 according to BLEU scores.

pdf bib
TaPaCo : A Corpus of Sentential Paraphrases for 73 LanguagesTaPaCo: A Corpus of Sentential Paraphrases for 73 Languages
Yves Scherrer
Proceedings of the 12th Language Resources and Evaluation Conference

This paper presents TaPaCo, a freely available paraphrase corpus for 73 languages extracted from the Tatoeba database. Tatoeba is a crowdsourcing project mainly geared towards language learners. Its aim is to provide example sentences and translations for particular linguistic constructions and words. The paraphrase corpus is created by populating a graph with Tatoeba sentences and equivalence links between sentences meaning the same thing. This graph is then traversed to extract sets of paraphrases. Several language-independent filters and pruning steps are applied to remove uninteresting sentences. A manual evaluation performed on three languages shows that between half and three quarters of inferred paraphrases are correct and that most remaining ones are either correct but trivial, or near-paraphrases that neutralize a morphological distinction. The corpus contains a total of 1.9 million sentences, with 200-250 000 sentences per language. It covers a range of languages for which, to our knowledge, no other paraphrase dataset exists. The dataset is available at https://doi.org/10.5281/zenodo.3707949.

2019

pdf bib
Analysing concatenation approaches to document-level NMT in two different domainsNMT in two different domains
Yves Scherrer | Jörg Tiedemann | Sharid Loáiciga
Proceedings of the Fourth Workshop on Discourse in Machine Translation (DiscoMT 2019)

In this paper, we investigate how different aspects of discourse context affect the performance of recent neural MT systems. We describe two popular datasets covering news and movie subtitles and we provide a thorough analysis of the distribution of various document-level features in their domains. Furthermore, we train a set of context-aware MT models on both datasets and propose a comparative evaluation scheme that contrasts coherent context with artificially scrambled documents and absent context, arguing that the impact of discourse-aware MT models will become visible in this way. Our results show that the models are indeed affected by the manipulation of the test data, providing a different view on document-level translation quality than absolute sentence-level scores.

pdf bib
Measuring Semantic Abstraction of Multilingual NMT with Paraphrase Recognition and Generation TasksNMT with Paraphrase Recognition and Generation Tasks
Jörg Tiedemann | Yves Scherrer
Proceedings of the 3rd Workshop on Evaluating Vector Space Representations for NLP

In this paper, we investigate whether multilingual neural translation models learn stronger semantic abstractions of sentences than bilingual ones. We test this hypotheses by measuring the perplexity of such models when applied to paraphrases of the source language. The intuition is that an encoder produces better representations if a decoder is capable of recognizing synonymous sentences in the same language even though the model is never trained for that task. In our setup, we add 16 different auxiliary languages to a bidirectional bilingual baseline model (English-French) and test it with in-domain and out-of-domain paraphrases in English. The results show that the perplexity is significantly reduced in each of the cases, indicating that meaning can be grounded in translation. This is further supported by a study on paraphrase generation that we also include at the end of the paper.

pdf bib
The University of Helsinki Submissions to the WMT19 News Translation TaskUniversity of Helsinki Submissions to the WMT19 News Translation Task
Aarne Talman | Umut Sulubacak | Raúl Vázquez | Yves Scherrer | Sami Virpioja | Alessandro Raganato | Arvi Hurskainen | Jörg Tiedemann
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

In this paper we present the University of Helsinki submissions to the WMT 2019 shared news translation task in three language pairs : English-German, English-Finnish and Finnish-English. This year we focused first on cleaning and filtering the training data using multiple data-filtering approaches, resulting in much smaller and cleaner training sets. For English-German we trained both sentence-level transformer models as well as compared different document-level translation approaches. For Finnish-English and English-Finnish we focused on different segmentation approaches and we also included a rule-based system for English-Finnish.

2017

pdf bib
Findings of the VarDial Evaluation Campaign 2017VarDial Evaluation Campaign 2017
Marcos Zampieri | Shervin Malmasi | Nikola Ljubešić | Preslav Nakov | Ahmed Ali | Jörg Tiedemann | Yves Scherrer | Noëmi Aepli
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)

We present the results of the VarDial Evaluation Campaign on Natural Language Processing (NLP) for Similar Languages, Varieties and Dialects, which we organized as part of the fourth edition of the VarDial workshop at EACL’2017. This year, we included four shared tasks : Discriminating between Similar Languages (DSL), Arabic Dialect Identification (ADI), German Dialect Identification (GDI), and Cross-lingual Dependency Parsing (CLP). A total of 19 teams submitted runs across the four tasks, and 15 of them wrote system description papers.

pdf bib
Multi-source morphosyntactic tagging for spoken RusynRusyn
Yves Scherrer | Achim Rabus
Proceedings of the Fourth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial)

This paper deals with the development of morphosyntactic taggers for spoken varieties of the Slavic minority language Rusyn. As neither annotated corpora nor parallel corpora are electronically available for Rusyn, we propose to combine existing resources from the etymologically close Slavic languages Russian, Ukrainian, Slovak, and Polish and adapt them to Rusyn. Using MarMoT as tagging toolkit, we show that a tagger trained on a balanced set of the four source languages outperforms single language taggers by about 9 %, and that additional automatically induced morphosyntactic lexicons lead to further improvements. The best observed accuracies for Rusyn are 82.4 % for part-of-speech tagging and 75.5 % for full morphological tagging.

pdf bib
Lexicon Induction for Spoken Rusyn Challenges and ResultsRusyn – Challenges and Results
Achim Rabus | Yves Scherrer
Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing

This paper reports on challenges and results in developing NLP resources for spoken Rusyn. Being a Slavic minority language, Rusyn does not have any resources to make use of. We propose to build a morphosyntactic dictionary for Rusyn, combining existing resources from the etymologically close Slavic languages Russian, Ukrainian, Slovak, and Polish. We adapt these resources to Rusyn by using vowel-sensitive Levenshtein distance, hand-written language-specific transformation rules, and combinations of the two. Compared to an exact match baseline, we increase the coverage of the resulting morphological dictionary by up to 77.4 % relative (42.9 % absolute), which results in a tagging recall increased by 11.6 % relative (9.1 % absolute). Our research confirms and expands the results of previous studies showing the efficiency of using NLP resources from neighboring languages for low-resourced languages.

pdf bib
Neural Machine Translation with Extended Context
Jörg Tiedemann | Yves Scherrer
Proceedings of the Third Workshop on Discourse in Machine Translation

We investigate the use of extended context in attention-based neural machine translation. We base our experiments on translated movie subtitles and discuss the effect of increasing the segments beyond single translation units. We study the use of extended source language context as well as bilingual context extensions. The models learn to distinguish between information from different segments and are surprisingly robust with respect to translation quality. In this pilot study, we observe interesting cross-sentential attention patterns that improve textual coherence in translation at least in some selected cases.