Dana Ruiter


2021

pdf bib
The Effect of Domain and Diacritics in YorubaEnglish Neural Machine TranslationYoruba–English Neural Machine Translation
David Adelani | Dana Ruiter | Jesujoba Alabi | Damilola Adebonojo | Adesina Ayeni | Mofe Adeyemi | Ayodele Esther Awokoya | Cristina España-Bonet
Proceedings of Machine Translation Summit XVIII: Research Track

Massively multilingual machine translation (MT) has shown impressive capabilities and including zero and few-shot translation between low-resource language pairs. However and these models are often evaluated on high-resource languages with the assumption that they generalize to low-resource ones. The difficulty of evaluating MT models on low-resource pairs is often due to lack of standardized evaluation datasets. In this paper and we present MENYO-20k and the first multi-domain parallel corpus with a especially curated orthography for YorubaEnglish with standardized train-test splits for benchmarking. We provide several neural MT benchmarks and compare them to the performance of popular pre-trained (massively multilingual) MT models both for the heterogeneous test set and its subdomains. Since these pre-trained models use huge amounts of data with uncertain quality and we also analyze the effect of diacritics and a major characteristic of Yoruba and in the training data. We investigate how and when this training condition affects the final quality of a translation and its understandability. Our models outperform massively multilingual models such as Google (+8.7 BLEU) and Facebook M2 M (+9.1) when translating to Yoruba and setting a high quality benchmark for future research.+8.7 BLEU) and Facebook M2M (+9.1) when translating to Yoruba and setting a high quality benchmark for future research.

2020

pdf bib
Label Propagation-Based Semi-Supervised Learning for Hate Speech Classification
Ashwin Geet D’Sa | Irina Illina | Dominique Fohr | Dietrich Klakow | Dana Ruiter
Proceedings of the First Workshop on Insights from Negative Results in NLP

Research on hate speech classification has received increased attention. In real-life scenarios, a small amount of labeled hate speech data is available to train a reliable classifier. Semi-supervised learning takes advantage of a small amount of labeled data and a large amount of unlabeled data. In this paper, label propagation-based semi-supervised learning is explored for the task of hate speech classification. The quality of labeling the unlabeled set depends on the input representations. In this work, we show that pre-trained representations are label agnostic, and when used with label propagation yield poor results. Neural network-based fine-tuning can be adopted to learn task-specific representations using a small amount of labeled data. We show that fully fine-tuned representations may not always be the best representations for the label propagation and intermediate representations may perform better in a semi-supervised setup.

pdf bib
UdS-DFKI@WMT20 : Unsupervised MT and Very Low Resource Supervised MT for German-Upper SorbianUdS-DFKI@WMT20: Unsupervised MT and Very Low Resource Supervised MT for German-Upper Sorbian
Sourav Dutta | Jesujoba Alabi | Saptarashmi Bandyopadhyay | Dana Ruiter | Josef van Genabith
Proceedings of the Fifth Conference on Machine Translation

This paper describes the UdS-DFKI submission to the shared task for unsupervised machine translation (MT) and very low-resource supervised MT between German (de) and Upper Sorbian (hsb) at the Fifth Conference of Machine Translation (WMT20). We submit systems for both the supervised and unsupervised tracks. Apart from various experimental approaches like bitext mining, model pre-training, and iterative back-translation, we employ a factored machine translation approach on a small BPE vocabulary.

2019

pdf bib
UdS-DFKI Participation at WMT 2019 : Low-Resource (en-gu) and Coreference-Aware (en-de) SystemsUdS-DFKI Participation at WMT 2019: Low-Resource (en-gu) and Coreference-Aware (en-de) Systems
Cristina España-Bonet | Dana Ruiter
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

This paper describes the UdS-DFKI submission to the WMT2019 news translation task for GujaratiEnglish (low-resourced pair) and GermanEnglish (document-level evaluation). Our systems rely on the on-line extraction of parallel sentences from comparable corpora for the first scenario and on the inclusion of coreference-related information in the training data in the second one.

pdf bib
Self-Supervised Neural Machine Translation
Dana Ruiter | Cristina España-Bonet | Josef van Genabith
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We present a simple new method where an emergent NMT system is used for simultaneously selecting training data and learning internal NMT representations. This is done in a self-supervised way without parallel data, in such a way that both tasks enhance each other during training. The method is language independent, introduces no additional hyper-parameters, and achieves BLEU scores of 29.21 (en2fr) and 27.36 (fr2en) on newstest2014 using English and French Wikipedia data for training.