Philipp Koehn


2021

pdf bib
Learning Curricula for Multilingual Neural Machine Translation Training
Gaurav Kumar | Philipp Koehn | Sanjeev Khudanpur
Proceedings of Machine Translation Summit XVIII: Research Track

Low-resource Multilingual Neural Machine Translation (MNMT) is typically tasked with improving the translation performance on one or more language pairs with the aid of high-resource language pairs. In this paper and we propose two simple search based curricula orderings of the multilingual training data which help improve translation performance in conjunction with existing techniques such as fine-tuning. Additionally and we attempt to learn a curriculum for MNMT from scratch jointly with the training of the translation system using contextual multi-arm bandits. We show on the FLORES low-resource translation dataset that these learned curricula can provide better starting points for fine tuning and improve overall performance of the translation system.

pdf bib
Adapting High-resource NMT Models to Translate Low-resource Related Languages without Parallel DataNMT Models to Translate Low-resource Related Languages without Parallel Data
Wei-Jen Ko | Ahmed El-Kishky | Adithya Renduchintala | Vishrav Chaudhary | Naman Goyal | Francisco Guzmán | Pascale Fung | Philipp Koehn | Mona Diab
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

The scarcity of parallel data is a major obstacle for training high-quality machine translation systems for low-resource languages. Fortunately, some low-resource languages are linguistically related or similar to high-resource languages ; these related languages may share many lexical or syntactic structures. In this work, we exploit this linguistic overlap to facilitate translating to and from a low-resource language with only monolingual data, in addition to any parallel data in the related high-resource language. Our method, NMT-Adapt, combines denoising autoencoding, back-translation and adversarial objectives to utilize monolingual data for low-resource adaptation. We experiment on 7 languages from three different language families and show that our technique significantly improves translation into low-resource language compared to other translation baselines.

pdf bib
XLEnt : Mining a Large Cross-lingual Entity Dataset with Lexical-Semantic-Phonetic Word AlignmentXLEnt: Mining a Large Cross-lingual Entity Dataset with Lexical-Semantic-Phonetic Word Alignment
Ahmed El-Kishky | Adithya Renduchintala | James Cross | Francisco Guzmán | Philipp Koehn
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Cross-lingual named-entity lexica are an important resource to multilingual NLP tasks such as machine translation and cross-lingual wikification. While knowledge bases contain a large number of entities in high-resource languages such as English and French, corresponding entities for lower-resource languages are often missing. To address this, we propose Lexical-Semantic-Phonetic Align (LSP-Align), a technique to automatically mine cross-lingual entity lexica from mined web data. We demonstrate LSP-Align outperforms baselines at extracting cross-lingual entity pairs and mine 164 million entity pairs from 120 different languages aligned with English. We release these cross-lingual entity pairs along with the massively multilingual tagged named entity corpus as a resource to the NLP community.

pdf bib
Proceedings of the Sixth Conference on Machine Translation
Loic Barrault | Ondrej Bojar | Fethi Bougares | Rajen Chatterjee | Marta R. Costa-jussa | Christian Federmann | Mark Fishel | Alexander Fraser | Markus Freitag | Yvette Graham | Roman Grundkiewicz | Paco Guzman | Barry Haddow | Matthias Huck | Antonio Jimeno Yepes | Philipp Koehn | Tom Kocmi | Andre Martins | Makoto Morishita | Christof Monz
Proceedings of the Sixth Conference on Machine Translation

pdf bib
The JHU-Microsoft Submission for WMT21 Quality Estimation Shared TaskJHU-Microsoft Submission for WMT21 Quality Estimation Shared Task
Shuoyang Ding | Marcin Junczys-Dowmunt | Matt Post | Christian Federmann | Philipp Koehn
Proceedings of the Sixth Conference on Machine Translation

This paper presents the JHU-Microsoft joint submission for WMT 2021 quality estimation shared task. We only participate in Task 2 (post-editing effort estimation) of the shared task, focusing on the target-side word-level quality estimation. The techniques we experimented with include Levenshtein Transformer training and data augmentation with a combination of forward, backward, round-trip translation, and pseudo post-editing of the MT output. We demonstrate the competitiveness of our system compared to the widely adopted OpenKiwi-XLM baseline. Our system is also the top-ranking system on the MT MCC metric for the English-German language pair.

pdf bib
Zero-Shot Cross-Lingual Dependency Parsing through Contextual Embedding Transformation
Haoran Xu | Philipp Koehn
Proceedings of the Second Workshop on Domain Adaptation for NLP

Linear embedding transformation has been shown to be effective for zero-shot cross-lingual transfer tasks and achieve surprisingly promising results. However, cross-lingual embedding space mapping is usually studied in static word-level embeddings, where a space transformation is derived by aligning representations of translation pairs that are referred from dictionaries. We move further from this line and investigate a contextual embedding alignment approach which is sense-level and dictionary-free. To enhance the quality of the mapping, we also provide a deep view of properties of contextual embeddings, i.e., the anisotropy problem and its solution. Experiments on zero-shot dependency parsing through the concept-shared space built by our embedding transformation substantially outperform state-of-the-art methods using multilingual embeddings.

2020

pdf bib
Simulated multiple reference training improves low-resource machine translation
Huda Khayrallah | Brian Thompson | Matt Post | Philipp Koehn
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Many valid translations exist for a given sentence, yet machine translation (MT) is trained with a single reference translation, exacerbating data sparsity in low-resource settings. We introduce Simulated Multiple Reference Training (SMRT), a novel MT training method that approximates the full space of possible translations by sampling a paraphrase of the reference sentence from a paraphraser and training the MT model to predict the paraphraser’s distribution over possible tokens. We demonstrate the effectiveness of SMRT in low-resource settings when translating to English, with improvements of 1.2 to 7.0 BLEU. We also find SMRT is complementary to back-translation.

pdf bib
ParaCrawl : Web-Scale Acquisition of Parallel CorporaParaCrawl: Web-Scale Acquisition of Parallel Corpora
Marta Bañón | Pinzhen Chen | Barry Haddow | Kenneth Heafield | Hieu Hoang | Miquel Esplà-Gomis | Mikel L. Forcada | Amir Kamran | Faheem Kirefu | Philipp Koehn | Sergio Ortiz Rojas | Leopoldo Pla Sempere | Gema Ramírez-Sánchez | Elsa Sarrías | Marek Strelec | Brian Thompson | William Waites | Dion Wiggins | Jaume Zaragoza
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We report on methods to create the largest publicly available parallel corpora by crawling the web, using open source software. We empirically compare alternative methods and publish benchmark data sets for sentence alignment and sentence pair filtering. We also describe the parallel corpora released and evaluate their quality and their usefulness to create machine translation systems.

pdf bib
Proceedings of the Fifth Conference on Machine Translation
Loïc Barrault | Ondřej Bojar | Fethi Bougares | Rajen Chatterjee | Marta R. Costa-jussà | Christian Federmann | Mark Fishel | Alexander Fraser | Yvette Graham | Paco Guzman | Barry Haddow | Matthias Huck | Antonio Jimeno Yepes | Philipp Koehn | André Martins | Makoto Morishita | Christof Monz | Masaaki Nagata | Toshiaki Nakazawa | Matteo Negri
Proceedings of the Fifth Conference on Machine Translation

pdf bib
An exploratory approach to the Parallel Corpus Filtering shared task WMT20WMT20
Ankur Kejriwal | Philipp Koehn
Proceedings of the Fifth Conference on Machine Translation

In this document we describe our submission to the parallel corpus filtering task using multilingual word embedding, language models and an ensemble of pre and post filtering rules. We use the norms of embedding and the perplexities of language models along with pre / post filtering rules to complement the LASER baseline scores and in the end get an improvement on the dev set in both language pairs.

2019

pdf bib
Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers)
Ondřej Bojar | Rajen Chatterjee | Christian Federmann | Mark Fishel | Yvette Graham | Barry Haddow | Matthias Huck | Antonio Jimeno Yepes | Philipp Koehn | André Martins | Christof Monz | Matteo Negri | Aurélie Névéol | Mariana Neves | Matt Post | Marco Turchi | Karin Verspoor
Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers)

pdf bib
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)
Ondřej Bojar | Rajen Chatterjee | Christian Federmann | Mark Fishel | Yvette Graham | Barry Haddow | Matthias Huck | Antonio Jimeno Yepes | Philipp Koehn | André Martins | Christof Monz | Matteo Negri | Aurélie Névéol | Mariana Neves | Matt Post | Marco Turchi | Karin Verspoor
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

pdf bib
Findings of the First Shared Task on Machine Translation Robustness
Xian Li | Paul Michel | Antonios Anastasopoulos | Yonatan Belinkov | Nadir Durrani | Orhan Firat | Philipp Koehn | Graham Neubig | Juan Pino | Hassan Sajjad
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

We share the findings of the first shared task on improving robustness of Machine Translation (MT). The task provides a testbed representing challenges facing MT models deployed in the real world, and facilitates new approaches to improve models’ robustness to noisy input and domain mismatch. We focus on two language pairs (English-French and English-Japanese), and the submitted systems are evaluated on a blind test set consisting of noisy comments on Reddit and professionally sourced translations. As a new task, we received 23 submissions by 11 participating teams from universities, companies, national labs, etc. All submitted systems achieved large improvements over baselines, with the best improvement having +22.33 BLEU. We evaluated submissions by both human judgment and automatic evaluation (BLEU), which shows high correlations (Pearson’s r = 0.94 and 0.95). Furthermore, we conducted a qualitative analysis of the submitted systems using compare-mt, which revealed their salient differences in handling challenges in this task. Such analysis provides additional insights when there is occasional disagreement between human judgment and BLEU, e.g. systems better at producing colloquial expressions received higher score from human judgment.

pdf bib
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)
Ondřej Bojar | Rajen Chatterjee | Christian Federmann | Mark Fishel | Yvette Graham | Barry Haddow | Matthias Huck | Antonio Jimeno Yepes | Philipp Koehn | André Martins | Christof Monz | Matteo Negri | Aurélie Névéol | Mariana Neves | Matt Post | Marco Turchi | Karin Verspoor
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

pdf bib
Findings of the WMT 2019 Shared Task on Parallel Corpus Filtering for Low-Resource ConditionsWMT 2019 Shared Task on Parallel Corpus Filtering for Low-Resource Conditions
Philipp Koehn | Francisco Guzmán | Vishrav Chaudhary | Juan Pino
Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2)

Following the WMT 2018 Shared Task on Parallel Corpus Filtering, we posed the challenge of assigning sentence-level quality scores for very noisy corpora of sentence pairs crawled from the web, with the goal of sub-selecting 2 % and 10 % of the highest-quality data to be used to train machine translation systems. This year, the task tackled the low resource condition of Nepali-English and Sinhala-English. Eleven participants from companies, national research labs, and universities participated in this task.

pdf bib
Overcoming Catastrophic Forgetting During Domain Adaptation of Neural Machine Translation
Brian Thompson | Jeremy Gwinnup | Huda Khayrallah | Kevin Duh | Philipp Koehn
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Continued training is an effective method for domain adaptation in neural machine translation. However, in-domain gains from adaptation come at the expense of general-domain performance. In this work, we interpret the drop in general-domain performance as catastrophic forgetting of general-domain knowledge. To mitigate it, we adapt Elastic Weight Consolidation (EWC)a machine learning method for learning a new task without forgetting previous tasks. Our method retains the majority of general-domain performance lost in continued training without degrading in-domain performance, outperforming the previous state-of-the-art. We also explore the full range of general-domain performance available when some in-domain degradation is acceptable.

pdf bib
De-Mixing Sentiment from Code-Mixed Text
Yash Kumar Lal | Vaibhav Kumar | Mrinal Dhar | Manish Shrivastava | Philipp Koehn
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop

Code-mixing is the phenomenon of mixing the vocabulary and syntax of multiple languages in the same sentence. It is an increasingly common occurrence in today’s multilingual society and poses a big challenge when encountered in different downstream tasks. In this paper, we present a hybrid architecture for the task of Sentiment Analysis of English-Hindi code-mixed data. Our method consists of three components, each seeking to alleviate different issues. We first generate subword level representations for the sentences using a CNN architecture. The generated representations are used as inputs to a Dual Encoder Network which consists of two different BiLSTMs-the Collective and Specific Encoder. The Collective Encoder captures the overall sentiment of the sentence, while the Specific Encoder utilizes an attention mechanism in order to focus on individual sentiment-bearing sub-words. This, combined with a Feature Network consisting of orthographic features and specially trained word embeddings, achieves state-of-the-art results-83.54 % accuracy and 0.827 F1 score-on a benchmark dataset.

2018

pdf bib
Context and Copying in Neural Machine Translation
Rebecca Knowles | Philipp Koehn
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Neural machine translation systems with subword vocabularies are capable of translating or copying unknown words. In this work, we show that they learn to copy words based on both the context in which the words appear as well as features of the words themselves. In contexts that are particularly copy-prone, they even copy words that they have already learned they should translate. We examine the influence of context and subword features on this and other types of copying behavior.

pdf bib
Regularized Training Objective for Continued Training for Domain Adaptation in Neural Machine Translation
Huda Khayrallah | Brian Thompson | Kevin Duh | Philipp Koehn
Proceedings of the 2nd Workshop on Neural Machine Translation and Generation

Supervised domain adaptationwhere a large generic corpus and a smaller in-domain corpus are both available for trainingis a challenge for neural machine translation (NMT). Standard practice is to train a generic model and use it to initialize a second model, then continue training the second model on in-domain data to produce an in-domain model. We add an auxiliary term to the training objective during continued training that minimizes the cross entropy between the in-domain model’s output word distribution and that of the out-of-domain model to prevent the model’s output from differing too much from the original out-of-domain model. We perform experiments on EMEA (descriptions of medicines) and TED (rehearsed presentations), initialized from a general domain (WMT) model. Our method shows improvements over standard continued training by up to 1.5 BLEU.

pdf bib
On the Impact of Various Types of Noise on Neural Machine Translation
Huda Khayrallah | Philipp Koehn
Proceedings of the 2nd Workshop on Neural Machine Translation and Generation

We examine how various types of noise in the parallel training data impact the quality of neural machine translation systems. We create five types of artificial noise and analyze how they degrade performance in neural and statistical machine translation. We find that neural models are generally more harmed by noise than statistical models. For one especially egregious type of noise they learn to just copy the input sentence.

pdf bib
Proceedings of the Third Conference on Machine Translation: Research Papers
Ondřej Bojar | Rajen Chatterjee | Christian Federmann | Mark Fishel | Yvette Graham | Barry Haddow | Matthias Huck | Antonio Jimeno Yepes | Philipp Koehn | Christof Monz | Matteo Negri | Aurélie Névéol | Mariana Neves | Matt Post | Lucia Specia | Marco Turchi | Karin Verspoor
Proceedings of the Third Conference on Machine Translation: Research Papers

pdf bib
Freezing Subnetworks to Analyze Domain Adaptation in Neural Machine Translation
Brian Thompson | Huda Khayrallah | Antonios Anastasopoulos | Arya D. McCarthy | Kevin Duh | Rebecca Marvin | Paul McNamee | Jeremy Gwinnup | Tim Anderson | Philipp Koehn
Proceedings of the Third Conference on Machine Translation: Research Papers

To better understand the effectiveness of continued training, we analyze the major components of a neural machine translation system (the encoder, decoder, and each embedding space) and consider each component’s contribution to, and capacity for, domain adaptation. We find that freezing any single component during continued training has minimal impact on performance, and that performance is surprisingly good when a single component is adapted while holding the rest of the model fixed. We also find that continued training does not move the model very far from the out-of-domain model, compared to a sensitivity analysis metric, suggesting that the out-of-domain model can provide a good generic initialization for the new domain.

bib
Proceedings of the Third Conference on Machine Translation: Shared Task Papers
Ondřej Bojar | Rajen Chatterjee | Christian Federmann | Mark Fishel | Yvette Graham | Barry Haddow | Matthias Huck | Antonio Jimeno Yepes | Philipp Koehn | Christof Monz | Matteo Negri | Aurélie Névéol | Mariana Neves | Matt Post | Lucia Specia | Marco Turchi | Karin Verspoor
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

2017

pdf bib
Proceedings of the Second Conference on Machine Translation
Ondřej Bojar | Christian Buck | Rajen Chatterjee | Christian Federmann | Yvette Graham | Barry Haddow | Matthias Huck | Antonio Jimeno Yepes | Philipp Koehn | Julia Kreutzer
Proceedings of the Second Conference on Machine Translation

pdf bib
Zipporah : a Fast and Scalable Data Cleaning System for Noisy Web-Crawled Parallel CorporaZipporah: a Fast and Scalable Data Cleaning System for Noisy Web-Crawled Parallel Corpora
Hainan Xu | Philipp Koehn
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

We introduce Zipporah, a fast and scalable data cleaning system. We propose a novel type of bag-of-words translation feature, and train logistic regression models to classify good data and synthetic noisy data in the proposed feature space. The trained model is used to score parallel sentences in the data pool for selection. As shown in experiments, Zipporah selects a high-quality parallel corpus from a large, mixed quality data pool. In particular, for one noisy dataset, Zipporah achieves a 2.1 BLEU score improvement with using 1/5 of the data over using the entire corpus.