François Yvon

Also published as: Francois Yvon


2022

pdf bib
Graph Neural Networks for Multiparallel Word Alignment
Ayyoob Imani | Lütfi Kerem Senel | Masoud Jalili Sabet | François Yvon | Hinrich Schuetze
Findings of the Association for Computational Linguistics: ACL 2022

After a period of decrease interest in word alignments is increasing again for their usefulness in domains such as typological research cross lingual annotation projection and machine translation Generally alignment algorithms only use bitext and do not make use of the fact that many parallel corpora are multiparallel Here we compute high quality word alignments between multiple language pairs by considering all language pairs together First we create a multiparallel word alignment graph joining all bilingual word alignment pairs in one graph Next we use graph neural networks GNNs to exploit the graph structure Our GNN approach i utilizes information about the meaning position and language of the input words ii incorporates information from multiple parallel sentences iii adds and removes edges from the initial alignments and iv yields a prediction model that can generalize beyond the training sentences We show that community detection algorithms can provide valuable information for multiparallel word alignment Our method outperforms previous work on three word alignment datasets and on a downstream task

2021

pdf bib
One Source, Two Targets : Challenges and Rewards of Dual DecodingChallenges and Rewards of Dual Decoding
Jitao Xu | François Yvon
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Machine translation is generally understood as generating one target text from an input source document. In this paper, we consider a stronger requirement : to jointly generate two texts so that each output side effectively depends on the other. As we discuss, such a device serves several practical purposes, from multi-target machine translation to the generation of controlled variations of the target text. We present an analysis of possible implementations of dual decoding, and experiment with four applications. Viewing the problem from multiple angles allows us to better highlight the challenges of dual decoding and to also thoroughly analyze the benefits of generating matched, rather than independent, translations.

pdf bib
Revisiting Multi-Domain Machine Translation
MinhQuang Pham | Josep Maria Crego | François Yvon
Transactions of the Association for Computational Linguistics, Volume 9

When building machine translation systems, one often needs to make the best out of heterogeneous sets of parallel data in training, and to robustly handle inputs from unexpected domains in testing. This multi-domain scenario has attracted a lot of recent work that fall under the general umbrella of transfer learning. In this study, we revisit multi-domain machine translation, with the aim to formulate the motivations for developing such systems and the associated expectations with respect to performance. Our experiments with a large sample of multi-domain systems show that most of these expectations are hardly met and suggest that further work is needed to better analyze the current behaviour of multi-domain systems and to make them fully hold their promises.

2020

pdf bib
SimAlign : High Quality Word Alignments Without Parallel Training Data Using Static and Contextualized EmbeddingsSimAlign: High Quality Word Alignments Without Parallel Training Data Using Static and Contextualized Embeddings
Masoud Jalili Sabet | Philipp Dufter | François Yvon | Hinrich Schütze
Findings of the Association for Computational Linguistics: EMNLP 2020

Word alignments are useful for tasks like statistical and neural machine translation (NMT) and cross-lingual annotation projection. Statistical word aligners perform well, as do methods that extract alignments jointly with translations in NMT. However, most approaches require parallel training data and quality decreases as less training data is available. We propose word alignment methods that require no parallel data. The key idea is to leverage multilingual word embeddings both static and contextualized for word alignment. Our multilingual embeddings are created from monolingual data only without relying on any parallel data or dictionaries. We find that alignments created from embeddings are superior for four and comparable for two language pairs compared to those produced by traditional statistical aligners even with abundant parallel data ; e.g., contextualized embeddings achieve a word alignment F1 for English-German that is 5 percentage points higher than eflomal, a high-quality statistical aligner, trained on 100k parallel sentences.

pdf bib
LIMSI @ WMT 2020LIMSI @ WMT 2020
Sadaf Abdul Rauf | José Carlos Rosales Núñez | Minh Quang Pham | François Yvon
Proceedings of the Fifth Conference on Machine Translation

This paper describes LIMSI’s submissions to the translation shared tasks at WMT’20. This year we have focused our efforts on the biomedical translation task, developing a resource-heavy system for the translation of medical abstracts from English into French, using back-translated texts, terminological resources as well as multiple pre-processing pipelines, including pre-trained representations. Systems were also prepared for the robustness task for translating from English into German ; for this large-scale task we developed multi-domain, noise-robust, translation systems aim to handle the two test conditions : zero-shot and few-shot domain adaptation.

pdf bib
The European Language Technology Landscape in 2020 : Language-Centric and Human-Centric AI for Cross-Cultural Communication in Multilingual EuropeEuropean Language Technology Landscape in 2020: Language-Centric and Human-Centric AI for Cross-Cultural Communication in Multilingual Europe
Georg Rehm | Katrin Marheinecke | Stefanie Hegele | Stelios Piperidis | Kalina Bontcheva | Jan Hajič | Khalid Choukri | Andrejs Vasiļjevs | Gerhard Backfried | Christoph Prinz | José Manuel Gómez-Pérez | Luc Meertens | Paul Lukowicz | Josef van Genabith | Andrea Lösch | Philipp Slusallek | Morten Irgens | Patrick Gatellier | Joachim Köhler | Laure Le Bars | Dimitra Anastasiou | Albina Auksoriūtė | Núria Bel | António Branco | Gerhard Budin | Walter Daelemans | Koenraad De Smedt | Radovan Garabík | Maria Gavriilidou | Dagmar Gromann | Svetla Koeva | Simon Krek | Cvetana Krstev | Krister Lindén | Bernardo Magnini | Jan Odijk | Maciej Ogrodniczuk | Eiríkur Rögnvaldsson | Mike Rosner | Bolette Pedersen | Inguna Skadiņa | Marko Tadić | Dan Tufiș | Tamás Váradi | Kadri Vider | Andy Way | François Yvon
Proceedings of the 12th Language Resources and Evaluation Conference

Multilingualism is a cultural cornerstone of Europe and firmly anchored in the European treaties including full language equality. However, language barriers impacting business, cross-lingual and cross-cultural communication are still omnipresent. Language Technologies (LTs) are a powerful means to break down these barriers. While the last decade has seen various initiatives that created a multitude of approaches and technologies tailored to Europe’s specific needs, there is still an immense level of fragmentation. At the same time, AI has become an increasingly important concept in the European Information and Communication Technology area. For a few years now, AI including many opportunities, synergies but also misconceptions has been overshadowing every other topic. We present an overview of the European LT landscape, describing funding programmes, activities, actions and challenges in the different countries with regard to LT, including the current state of play in industry and the LT market. We present a brief overview of the main LT-related activities on the EU level in the last ten years and develop strategic guidance with regard to four key dimensions.

pdf bib
Proceedings of the 17th International Conference on Spoken Language Translation
Marcello Federico | Alex Waibel | Kevin Knight | Satoshi Nakamura | Hermann Ney | Jan Niehues | Sebastian Stüker | Dekai Wu | Joseph Mariani | Francois Yvon
Proceedings of the 17th International Conference on Spoken Language Translation

2019

pdf bib
Generic and Specialized Word Embeddings for Multi-Domain Machine Translation
MinhQuang Pham | Josep Crego | François Yvon | Jean Senellart
Proceedings of the 16th International Conference on Spoken Language Translation

Supervised machine translation works well when the train and test data are sampled from the same distribution. When this is not the case, adaptation techniques help ensure that the knowledge learned from out-of-domain texts generalises to in-domain sentences. We study here a related setting, multi-domain adaptation, where the number of domains is potentially large and adapting separately to each domain would waste training resources. Our proposal transposes to neural machine translation the feature expansion technique of (Daum III, 2007): it isolates domain-agnostic from domain-specific lexical representations, while sharing the most of the network across domains. Our experiments use two architectures and two language pairs : they show that our approach, while simple and computationally inexpensive, outperforms several strong baselines and delivers a multi-domain system that successfully translates texts from diverse sources.

2018

pdf bib
Fixing Translation Divergences in Parallel Corpora for Neural MTMT
MinhQuang Pham | Josep Crego | Jean Senellart | François Yvon
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Corpus-based approaches to machine translation rely on the availability of clean parallel corpora. Such resources are scarce, and because of the automatic processes involved in their preparation, they are often noisy. This paper describes an unsupervised method for detecting translation divergences in parallel sentences. We rely on a neural network that computes cross-lingual sentence similarity scores, which are then used to effectively filter out divergent translations. Furthermore, similarity scores predicted by the network are used to identify and fix some partial divergences, yielding additional parallel segments. We evaluate these methods for English-French and English-German machine translation tasks, and show that using filtered / corrected corpora actually improves MT performance.

pdf bib
Adaptor Grammars for the Linguist : Word Segmentation Experiments for Very Low-Resource LanguagesAdaptor Grammars for the Linguist: Word Segmentation Experiments for Very Low-Resource Languages
Pierre Godard | Laurent Besacier | François Yvon | Martine Adda-Decker | Gilles Adda | Hélène Maynard | Annie Rialland
Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology

Computational Language Documentation attempts to make the most recent research in speech and language technologies available to linguists working on language preservation and documentation. In this paper, we pursue two main goals along these lines. The first is to improve upon a strong baseline for the unsupervised word discovery task on two very low-resource Bantu languages, taking advantage of the expertise of linguists on these particular languages. The second consists in exploring the Adaptor Grammar framework as a decision and prediction tool for linguists studying a new language. We experiment 162 grammar configurations for each language and show that using Adaptor Grammars for word segmentation enables us to test hypotheses about a language. Specializing a generic grammar with language specific knowledge leads to great improvements for the word discovery task, ultimately achieving a leap of about 30 % token F-score from the results of a strong baseline.

pdf bib
Using Monolingual Data in Neural Machine Translation : a Systematic Study
Franck Burlot | François Yvon
Proceedings of the Third Conference on Machine Translation: Research Papers

Neural Machine Translation (MT) has radically changed the way systems are developed. A major difference with the previous generation (Phrase-Based MT) is the way monolingual target data, which often abounds, is used in these two paradigms. While Phrase-Based MT can seamlessly integrate very large language models trained on billions of sentences, the best option for Neural MT developers seems to be the generation of artificial parallel data through back-translation-a technique that fails to fully take advantage of existing datasets. In this paper, we conduct a systematic study of back-translation, comparing alternative uses of monolingual data, as well as multiple data generation procedures. Our findings confirm that back-translation is very effective and give new explanations as to why this is the case. We also introduce new data simulation techniques that are almost as effective, yet much cheaper to implement.

pdf bib
Automatically Selecting the Best Dependency Annotation Design with Dynamic Oracles
Guillaume Wisniewski | Ophélie Lacroix | François Yvon
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

This work introduces a new strategy to compare the numerous conventions that have been proposed over the years for expressing dependency structures and discover the one for which a parser will achieve the highest parsing performance. Instead of associating each sentence in the training set with a single gold reference we propose to consider a set of references encoding alternative syntactic representations. Training a parser with a dynamic oracle will then automatically select among all alternatives the reference that will be predicted with the highest accuracy. Experiments on the UD corpora show the validity of this approach.

pdf bib
Exploiting Dynamic Oracles to Train Projective Dependency Parsers on Non-Projective Trees
Lauriane Aufrant | Guillaume Wisniewski | François Yvon
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

Because the most common transition systems are projective, training a transition-based dependency parser often implies to either ignore or rewrite the non-projective training examples, which has an adverse impact on accuracy. In this work, we propose a simple modification of dynamic oracles, which enables the use of non-projective data when training projective parsers. Evaluation on 73 treebanks shows that our method achieves significant gains (+2 to +7 UAS for the most non-projective languages) and consistently outperforms traditional projectivization and pseudo-projectivization approaches.

2017

pdf bib
LIMSI@CoNLL’17 : UD Shared TaskLIMSI@CoNLL’17: UD Shared Task
Lauriane Aufrant | Guillaume Wisniewski | François Yvon
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

This paper describes LIMSI’s submission to the CoNLL 2017 UD Shared Task, which is focused on small treebanks, and how to improve low-resourced parsing only by ad hoc combination of multiple views and resources. We present our approach for low-resourced parsing, together with a detailed analysis of the results for each test treebank. We also report extensive analysis experiments on model selection for the PUD treebanks, and on annotation consistency among UD treebanks.

pdf bib
Learning the Structure of Variable-Order CRFs : a finite-state perspectiveCRFs: a finite-state perspective
Thomas Lavergne | François Yvon
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

The computational complexity of linear-chain Conditional Random Fields (CRFs) makes it difficult to deal with very large label sets and long range dependencies. Such situations are not rare and arise when dealing with morphologically rich languages or joint labelling tasks. We extend here recent proposals to consider variable order CRFs. Using an effective finite-state representation of variable-length dependencies, we propose new ways to perform feature selection at large scale and report experimental results where we outperform strong baselines on a tagging task.

pdf bib
Do n’t Stop Me Now ! Using Global Dynamic Oracles to Correct Training Biases of Transition-Based Dependency Parsers
Lauriane Aufrant | Guillaume Wisniewski | François Yvon
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

This paper formalizes a sound extension of dynamic oracles to global training, in the frame of transition-based dependency parsers. By dispensing with the pre-computation of references, this extension widens the training strategies that can be entertained for such parsers ; we show this by revisiting two standard training procedures, early-update and max-violation, to correct some of their search space sampling biases. Experimentally, on the SPMRL treebanks, this improvement increases the similarity between the train and test distributions and yields performance improvements up to 0.7 UAS, without any computation overhead.