Tom Kocmi


2021

pdf bib
Proceedings of the Sixth Conference on Machine Translation
Loic Barrault | Ondrej Bojar | Fethi Bougares | Rajen Chatterjee | Marta R. Costa-jussa | Christian Federmann | Mark Fishel | Alexander Fraser | Markus Freitag | Yvette Graham | Roman Grundkiewicz | Paco Guzman | Barry Haddow | Matthias Huck | Antonio Jimeno Yepes | Philipp Koehn | Tom Kocmi | Andre Martins | Makoto Morishita | Christof Monz
Proceedings of the Sixth Conference on Machine Translation

2020

pdf bib
Efficiently Reusing Old Models Across Languages via Transfer Learning
Tom Kocmi | Ondřej Bojar
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation

Recent progress in neural machine translation (NMT) is directed towards larger neural networks trained on an increasing amount of hardware resources. As a result, NMT models are costly to train, both financially, due to the electricity and hardware cost, and environmentally, due to the carbon footprint. It is especially true in transfer learning for its additional cost of training the parent model before transferring knowledge and training the desired child model. In this paper, we propose a simple method of re-using an already trained model for different language pairs where there is no need for modifications in model architecture. Our approach does not need a separate parent model for each investigated language pair, as it is typical in NMT transfer learning. To show the applicability of our method, we recycle a Transformer model trained by different researchers and use it to seed models for different language pairs. We achieve better translation quality and shorter convergence times than when training from random initialization.

pdf bib
CUNI Submission for the Inuktitut Language in WMT News 2020CUNI Submission for the Inuktitut Language in WMT News 2020
Tom Kocmi
Proceedings of the Fifth Conference on Machine Translation

This paper describes CUNI submission to the WMT 2020 News Translation Shared Task for the low-resource scenario InuktitutEnglish in both translation directions. Our system combines transfer learning from a CzechEnglish high-resource language pair and backtranslation. We notice surprising behaviour when using synthetic data, which can be possibly attributed to a narrow domain of training and test data. We are using the Transformer model in a constrained submission.

pdf bib
Gender Coreference and Bias Evaluation at WMT 2020WMT 2020
Tom Kocmi | Tomasz Limisiewicz | Gabriel Stanovsky
Proceedings of the Fifth Conference on Machine Translation

Gender bias in machine translation can manifest when choosing gender inflections based on spurious gender correlations. For example, always translating doctors as men and nurses as women. This can be particularly harmful as models become more popular and deployed within commercial systems. Our work presents the largest evidence for the phenomenon in more than 19 systems submitted to the WMT over four diverse target languages : Czech, German, Polish, and Russian. To achieve this, we use WinoMT, a recent automatic test suite which examines gender coreference and bias when translating from English to languages with grammatical gender. We extend WinoMT to handle two new languages tested in WMT : Polish and Czech. We find that all systems consistently use spurious correlations in the data rather than meaningful contextual information.

pdf bib
CUNI Systems for the Unsupervised and Very Low Resource Translation Task in WMT20CUNI Systems for the Unsupervised and Very Low Resource Translation Task in WMT20
Ivana Kvapilíková | Tom Kocmi | Ondřej Bojar
Proceedings of the Fifth Conference on Machine Translation

This paper presents a description of CUNI systems submitted to the WMT20 task on unsupervised and very low-resource supervised machine translation between German and Upper Sorbian. We experimented with training on synthetic data and pre-training on a related language pair. In the fully unsupervised scenario, we achieved 25.5 and 23.7 BLEU translating from and into Upper Sorbian, respectively. Our low-resource systems relied on transfer learning from German-Czech parallel data and achieved 57.4 BLEU and 56.1 BLEU, which is an improvement of 10 BLEU points over the baseline trained only on the available small German-Upper Sorbian parallel corpus.

2019

pdf bib
CUNI Submission for Low-Resource Languages in WMT News 2019CUNI Submission for Low-Resource Languages in WMT News 2019
Tom Kocmi | Ondřej Bojar
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

This paper describes the CUNI submission to the WMT 2019 News Translation Shared Task for the low-resource languages : Gujarati-English and Kazakh-English. We participated in both language pairs in both translation directions. Our system combines transfer learning from a different high-resource language pair followed by training on backtranslated monolingual data. Thanks to the simultaneous training in both directions, we can iterate the backtranslation process. We are using the Transformer model in a constrained submission.

2018

pdf bib
Trivial Transfer Learning for Low-Resource Neural Machine Translation
Tom Kocmi | Ondřej Bojar
Proceedings of the Third Conference on Machine Translation: Research Papers

Transfer learning has been proven as an effective technique for neural machine translation under low-resource conditions. Existing methods require a common target language, language relatedness, or specific training tricks and regimes. We present a simple transfer learning method, where we first train a parent model for a high-resource language pair and then continue the training on a low-resource pair only by replacing the training corpus. This child model performs significantly better than the baseline trained for low-resource pair only. We are the first to show this for targeting different languages, and we observe the improvements even for unrelated languages with different alphabets.

pdf bib
CUNI Submissions in WMT18CUNI Submissions in WMT18
Tom Kocmi | Roman Sudarikov | Ondřej Bojar
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

We participated in the WMT 2018 shared news translation task in three language pairs : English-Estonian, English-Finnish, and English-Czech. Our main focus was the low-resource language pair of Estonian and English for which we utilized Finnish parallel data in a simple method. We first train a parent model for the high-resource language pair followed by adaptation on the related low-resource language pair. This approach brings a substantial performance boost over the baseline system trained only on Estonian-English parallel data. Our systems are based on the Transformer architecture. For the English to Czech translation, we have evaluated our last year models of hybrid phrase-based approach and neural machine translation mainly for comparison purposes.

pdf bib
CUNI Basque-to-English Submission in IWSLT18CUNI Basque-to-English Submission in IWSLT18
Tom Kocmi | Dušan Variš | Ondřej Bojar
Proceedings of the 15th International Conference on Spoken Language Translation

We present our submission to the IWSLT18 Low Resource task focused on the translation from Basque-to-English. Our submission is based on the current state-of-the-art self-attentive neural network architecture, Transformer. We further improve this strong baseline by exploiting available monolingual data using the back-translation technique. We also present further improvements gained by a transfer learning, a technique that trains a model using a high-resource language pair (Czech-English) and then fine-tunes the model using the target low-resource language pair (Basque-English).

2017

pdf bib
CUNI NMT System for WAT 2017 Translation TasksCUNI NMT System for WAT 2017 Translation Tasks
Tom Kocmi | Dušan Variš | Ondřej Bojar
Proceedings of the 4th Workshop on Asian Translation (WAT2017)

The paper presents this year’s CUNI submissions to the WAT 2017 Translation Task focusing on the Japanese-English translation, namely Scientific papers subtask, Patents subtask and Newswire subtask. We compare two neural network architectures, the standard sequence-to-sequence with attention (Seq2Seq) and an architecture using convolutional sentence encoder (FBConv2Seq), both implemented in the NMT framework Neural Monkey that we currently participate in developing. We also compare various types of preprocessing of the source Japanese sentences and their impact on the overall results. Furthermore, we include the results of our experiments with out-of-domain data obtained by combining the corpora provided for each subtask.

pdf bib
LanideNN : Multilingual Language Identification on Character WindowLanideNN: Multilingual Language Identification on Character Window
Tom Kocmi | Ondřej Bojar
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

In language identification, a common first step in natural language processing, we want to automatically determine the language of some input text. Monolingual language identification assumes that the given document is written in one language. In multilingual language identification, the document is usually in two or three languages and we just want their names. We aim one step further and propose a method for textual language identification where languages can change arbitrarily and the goal is to identify the spans of each of the languages. Our method is based on Bidirectional Recurrent Neural Networks and it performs well in monolingual and multilingual language identification tasks on six datasets covering 131 languages. The method keeps the accuracy also for short documents and across domains, so it is ideal for off-the-shelf use without preparation of training data.