Proceedings of the 16th International Conference on Spoken Language Translation

Jan Niehues, Rolando Cattoni, Sebastian Stüker, Matteo Negri, Marco Turchi, Thanh-Le Ha, Elizabeth Salesky, Ramon Sanabria, Loic Barrault, Lucia Specia, Marcello Federico (Editors)


Anthology ID:
2019.iwslt-1
Month:
November 2-3
Year:
2019
Address:
Hong Kong
Venues:
EMNLP | IWSLT
SIG:
SIGSLT
Publisher:
Association for Computational Linguistics
URL:
https://aclanthology.org/2019.iwslt-1
DOI:
Bib Export formats:
BibTeX MODS XML EndNote

pdf bib
The IWSLT 2019 Evaluation CampaignIWSLT 2019 Evaluation Campaign
Jan Niehues | Rolando Cattoni | Sebastian Stüker | Matteo Negri | Marco Turchi | Thanh-Le Ha | Elizabeth Salesky | Ramon Sanabria | Loic Barrault | Lucia Specia | Marcello Federico

The IWSLT 2019 evaluation campaign featured three tasks : speech translation of (i) TED talks and (ii) How2 instructional videos from English into German and Portuguese, and (iii) text translation of TED talks from English into Czech. For the first two tasks we encouraged submissions of end- to-end speech-to-text systems, and for the second task participants could also use the video as additional input. We received submissions by 12 research teams. This overview provides detailed descriptions of the data and evaluation conditions of each task and reports results of the participating systems.

pdf bib
ESPnet How2 Speech Translation System for IWSLT 2019 : Pre-training, Knowledge Distillation, and Going DeeperESPnet How2 Speech Translation System for IWSLT 2019: Pre-training, Knowledge Distillation, and Going Deeper
Hirofumi Inaguma | Shun Kiyono | Nelson Enrique Yalta Soplin | Jun Suzuki | Kevin Duh | Shinji Watanabe

This paper describes the ESPnet submissions to the How2 Speech Translation task at IWSLT2019. In this year, we mainly build our systems based on Transformer architectures in all tasks and focus on the end-to-end speech translation (E2E-ST). We first compare RNN-based models and Transformer, and then confirm Transformer models significantly and consistently outperform RNN models in all tasks and corpora. Next, we investigate pre-training of E2E-ST models with the ASR and MT tasks. On top of the pre-training, we further explore knowledge distillation from the NMT model and the deeper speech encoder, and confirm drastic improvements over the baseline model. All of our codes are publicly available in ESPnet.

pdf bib
ON-TRAC Consortium End-to-End Speech Translation Systems for the IWSLT 2019 Shared TaskON-TRAC Consortium End-to-End Speech Translation Systems for the IWSLT 2019 Shared Task
Ha Nguyen

This paper describes the ON-TRAC Consortium translation systems developed for the end-to-end model task of IWSLT Evaluation 2019 for the English Portuguese language pair. ON-TRAC Consortium is composed of researchers from three French academic laboratories : LIA (Avignon Universit), LIG (Universit Grenoble Alpes), and LIUM (Le Mans Universit). A single end-to-end model built as a neural encoder-decoder architecture with attention mechanism was used for two primary submissions corresponding to the two EN-PT evaluations sets : (1) TED (MuST-C) and (2) How2. In this paper, we notably investigate impact of pooling heterogeneous corpora for training, impact of target tokenization (characters or BPEs), impact of speech input segmentation and we also compare our best end-to-end model (BLEU of 26.91 on MuST-C and 43.82 on How2 validation sets) to a pipeline (ASR+MT) approach.

pdf bib
Transformer-based Cascaded Multimodal Speech Translation
Zixiu Wu | Ozan Caglayan | Julia Ive | Josiah Wang | Lucia Specia

This paper describes the cascaded multimodal speech translation systems developed by Imperial College London for the IWSLT 2019 evaluation campaign. The architecture consists of an automatic speech recognition (ASR) system followed by a Transformer-based multimodal machine translation (MMT) system. While the ASR component is identical across the experiments, the MMT model varies in terms of the way of integrating the visual context (simple conditioning vs. attention), the type of visual features exploited (pooled, convolutional, action categories) and the underlying architecture. For the latter, we explore both the canonical transformer and its deliberation version with additive and cascade variants which differ in how they integrate the textual attention. Upon conducting extensive experiments, we found that (i) the explored visual integration schemes often harm the translation performance for the transformer and additive deliberation, but considerably improve the cascade deliberation ; (ii) the transformer and cascade deliberation integrate the visual modality better than the additive deliberation, as shown by the incongruence analysis.

pdf bib
The LIG system for the English-Czech Text Translation Task of IWSLT 2019LIG system for the English-Czech Text Translation Task of IWSLT 2019
Loïc Vial | Benjamin Lecouteux | Didier Schwab | Hang Le | Laurent Besacier

In this paper, we present our submission for the English to Czech Text Translation Task of IWSLT 2019. Our system aims to study how pre-trained language models, used as input embeddings, can improve a specialized machine translation system trained on few data. Therefore, we implemented a Transformer-based encoder-decoder neural system which is able to use the output of a pre-trained language model as input embeddings, and we compared its performance under three configurations : 1) without any pre-trained language model (constrained), 2) using a language model trained on the monolingual parts of the allowed English-Czech data (constrained), and 3) using a language model trained on a large quantity of external monolingual data (unconstrained). We used BERT as external pre-trained language model (configuration 3), and BERT architecture for training our own language model (configuration 2). Regarding the training data, we trained our MT system on a small quantity of parallel text : one set only consists of the provided MuST-C corpus, and the other set consists of the MuST-C corpus and the News Commentary corpus from WMT. We observed that using the external pre-trained BERT improves the scores of our system by +0.8 to +1.5 of BLEU on our development set, and +0.97 to +1.94 of BLEU on the test set. However, using our own language model trained only on the allowed parallel data seems to improve the machine translation performances only when the system is trained on the smallest dataset.

pdf bib
KIT’s Submission to the IWSLT 2019 Shared Task on Text TranslationKIT’s Submission to the IWSLT 2019 Shared Task on Text Translation
Felix Schneider | Alex Waibel

In this paper, we describe KIT’s submission for the IWSLT 2019 shared task on text translation. Our system is based on the transformer model [ 1 ] using our in-house implementation. We augment the available training data using back-translation and employ fine-tuning for the final model. For our best results, we used a 12-layer transformer-big config- uration, achieving state-of-the-art results on the WMT2018 test set. We also experiment with student-teacher models to improve performance of smaller models.

pdf bib
Adapting Multilingual Neural Machine Translation to Unseen Languages
Surafel M. Lakew | Alina Karakanta | Marcello Federico | Matteo Negri | Marco Turchi

Multilingual Neural Machine Translation (MNMT) for low- resource languages (LRL) can be enhanced by the presence of related high-resource languages (HRL), but the relatedness of HRL usually relies on predefined linguistic assumptions about language similarity. Recently, adapting MNMT to a LRL has shown to greatly improve performance. In this work, we explore the problem of adapting an MNMT model to an unseen LRL using data selection and model adapta- tion. In order to improve NMT for LRL, we employ perplexity to select HRL data that are most similar to the LRL on the basis of language distance. We extensively explore data selection in popular multilingual NMT settings, namely in (zero-shot) translation, and in adaptation from a multilingual pre-trained model, for both directions (LRLen). We further show that dynamic adaptation of the model’s vocabulary results in a more favourable segmentation for the LRL in comparison with direct adaptation. Experiments show re- ductions in training time and significant performance gains over LRL baselines, even with zero LRL data (+13.0 BLEU), up to +17.0 BLEU for pre-trained multilingual model dynamic adaptation with related data selection. Our method outperforms current approaches, such as massively multilingual models and data augmentation, on four LRL.

pdf bib
Transformers without Tears : Improving the Normalization of Self-Attention
Toan Q. Nguyen | Julian Salazar

We evaluate three simple, normalization-centric changes to improve Transformer training. First, we show that pre-norm residual connections (PRENORM) and smaller initializations enable warmup-free, validation-based training with large learning rates. Second, we propose l2 normalization with a single scale parameter (SCALENORM) for faster training and better performance. Finally, we reaffirm the effectiveness of normalizing word embeddings to a fixed length (FIXNORM). On five low-resource translation pairs from TED Talks-based corpora, these changes always converge, giving an average +1.1 BLEU over state-of-the-art bilingual baselines and a new 32.8 BLEU on IWSLT’ 15 English-Vietnamese. We ob- serve sharper performance curves, more consistent gradient norms, and a linear relationship between activation scaling and decoder depth. Surprisingly, in the high-resource setting (WMT’ 14 English-German), SCALENORM and FIXNORM remain competitive but PRENORM degrades performance.

pdf bib
Harnessing Indirect Training Data for End-to-End Automatic Speech Translation : Tricks of the Trade
Juan Pino | Liezl Puzon | Jiatao Gu | Xutai Ma | Arya D. McCarthy | Deepak Gopinath

For automatic speech translation (AST), end-to-end approaches are outperformed by cascaded models that transcribe with automatic speech recognition (ASR), then trans- late with machine translation (MT). A major cause of the performance gap is that, while existing AST corpora are small, massive datasets exist for both the ASR and MT subsystems. In this work, we evaluate several data augmentation and pretraining approaches for AST, by comparing all on the same datasets. Simple data augmentation by translating ASR transcripts proves most effective on the EnglishFrench augmented LibriSpeech dataset, closing the performance gap from 8.2 to 1.4 BLEU, compared to a very strong cascade that could directly utilize copious ASR and MT data. The same end-to-end approach plus fine-tuning closes the gap on the EnglishRomanian MuST-C dataset from 6.7 to 3.7 BLEU. In addition to these results, we present practical rec- ommendations for augmentation and pretraining approaches. Finally, we decrease the performance gap to 0.01 BLEU us- ing a Transformer-based architecture.

pdf bib
On Using SpecAugment for End-to-End Speech TranslationSpecAugment for End-to-End Speech Translation
Parnia Bahar | Albert Zeyer | Ralf Schlüter | Hermann Ney

This work investigates a simple data augmentation technique, SpecAugment, for end-to-end speech translation. SpecAugment is a low-cost implementation method applied directly to the audio input features and it consists of masking blocks of frequency channels, and/or time steps. We apply SpecAugment on end-to-end speech translation tasks and achieve up to +2.2 % BLEU on LibriSpeech Audiobooks EnFr and +1.2 % on IWSLT TED-talks EnDe by alleviating overfitting to some extent. We also examine the effectiveness of the method in a variety of data scenarios and show that the method also leads to significant improvements in various data conditions irrespective of the amount of training data.

pdf bib
Estimating post-editing effort : a study on human judgements, task-based and reference-based metrics of MT qualityMT quality
Scarton Scarton | Mikel L. Forcada | Miquel Esplà-Gomis | Lucia Specia

Devising metrics to assess translation quality has always been at the core of machine translation (MT) research. Traditional automatic reference-based metrics, such as BLEU, have shown correlations with human judgements of adequacy and fluency and have been paramount for the advancement of MT system development. Crowd-sourcing has popularised and enabled the scalability of metrics based on human judgments, such as subjective direct assessments (DA) of adequacy, that are believed to be more reliable than reference-based automatic metrics. Finally, task-based measurements, such as post-editing time, are expected to provide a more de- tailed evaluation of the usefulness of translations for a specific task. Therefore, while DA averages adequacy judgements to obtain an appraisal of (perceived) quality independently of the task, and reference-based automatic metrics try to objectively estimate quality also in a task-independent way, task-based metrics are measurements obtained either during or after performing a specific task. In this paper we argue that, although expensive, task-based measurements are the most reliable when estimating MT quality in a specific task ; in our case, this task is post-editing. To that end, we report experiments on a dataset with newly-collected post-editing indicators and show their usefulness when estimating post-editing effort. Our results show that task-based metrics comparing machine-translated and post-edited versions are the best at tracking post-editing effort, as expected.

pdf bib
Exploring Kernel Functions in the Softmax Layer for Contextual Word Classification
Yingbo Gao | Christian Herold | Weiyue Wang | Hermann Ney

Prominently used in support vector machines and logistic re-gressions, kernel functions (kernels) can implicitly map data points into high dimensional spaces and make it easier to learn complex decision boundaries. In this work, by replacing the inner product function in the softmax layer, we explore the use of kernels for contextual word classification. In order to compare the individual kernels, experiments are conducted on standard language modeling and machine translation tasks. We observe a wide range of performances across different kernel settings. Extending the results, we look at the gradient properties, investigate various mixture strategies and examine the disambiguation abilities.

pdf bib
Generic and Specialized Word Embeddings for Multi-Domain Machine Translation
MinhQuang Pham | Josep Crego | François Yvon | Jean Senellart

Supervised machine translation works well when the train and test data are sampled from the same distribution. When this is not the case, adaptation techniques help ensure that the knowledge learned from out-of-domain texts generalises to in-domain sentences. We study here a related setting, multi-domain adaptation, where the number of domains is potentially large and adapting separately to each domain would waste training resources. Our proposal transposes to neural machine translation the feature expansion technique of (Daum III, 2007): it isolates domain-agnostic from domain-specific lexical representations, while sharing the most of the network across domains. Our experiments use two architectures and two language pairs : they show that our approach, while simple and computationally inexpensive, outperforms several strong baselines and delivers a multi-domain system that successfully translates texts from diverse sources.

pdf bib
Lexical Micro-adaptation for Neural Machine Translation
Jitao Xu | Josep Crego | Jean Senellart

This work is inspired by a typical machine translation industry scenario in which translators make use of in-domain data for facilitating translation of similar or repeating sentences. We introduce a generic framework applied at inference in which a subset of segment pairs are first extracted from training data according to their similarity to the input sentences. These segments are then used to dynamically update the parameters of a generic NMT network, thus performing a lexical micro-adaptation. Our approach demonstrates strong adaptation performance to new and existing datasets including pseudo in-domain data. We evaluate our approach on a heterogeneous English-French training dataset showing accuracy gains on all evaluated domains when compared to strong adaptation baselines.

pdf bib
Efficient Bilingual Generalization from Neural Transduction Grammar Induction
Yuchen Yan | Dekai Wu | Serkan Kumyol

We introduce (1) a novel neural network structure for bilingual modeling of sentence pairs that allows efficient capturing of bilingual relationship via biconstituent composition, (2) the concept of neural network biparsing, which applies to not only machine translation (MT) but also to a variety of other bilingual research areas, and (3) the concept of a biparsing-backpropagation training loop, which we hypothesize that can efficiently learn complex biparse tree patterns. Our work distinguishes from sequential attention-based models, which are more traditionally found in neural machine translation (NMT) in three aspects. First, our model enforces compositional constraints. Second, our model has a smaller search space in terms of discovering bilingual relationships from bilingual sentence pairs. Third, our model produces explicit biparse trees, which enable transparent error analysis during evaluation and external tree constraints during training.

pdf bib
Breaking the Data Barrier : Towards Robust Speech Translation via Adversarial Stability Training
Qiao Cheng | Meiyuan Fan | Yaqian Han | Jin Huang | Yitao Duan

In a pipeline speech translation system, automatic speech recognition (ASR) system will transmit errors in recognition to the downstream machine translation (MT) system. A standard machine translation system is usually trained on parallel corpus composed of clean text and will perform poorly on text with recognition noise, a gap well known in speech translation community. In this paper, we propose a training architecture which aims at making a neural machine translation model more robust against speech recognition errors. Our approach addresses the encoder and the decoder simultaneously using adversarial learning and data augmentation, respectively. Experimental results on IWSLT2018 speech translation task show that our approach can bridge the gap between the ASR output and the MT input, outperforms the baseline by up to 2.83 BLEU on noisy ASR output, while maintaining close performance on clean text.

pdf bib
Controlling the Output Length of Neural Machine Translation
Surafel Melaku Lakew | Mattia Di Gangi | Marcello Federico

The recent advances introduced by neural machine translation (NMT) are rapidly expanding the application fields of machine translation, as well as reshaping the quality level to be targeted. In particular, if translations have to fit some given layout, quality should not only be measured in terms of adequacy and fluency, but also length. Exemplary cases are the translation of document files, subtitles, and scripts for dubbing, where the output length should ideally be as close as possible to the length of the input text. This pa-per addresses for the first time, to the best of our knowledge, the problem of controlling the output length in NMT. We investigate two methods for biasing the output length with a transformer architecture : i) conditioning the output to a given target-source length-ratio class and ii) enriching the transformer positional embedding with length information. Our experiments show that both methods can induce the network to generate shorter translations, as well as acquiring inter- pretable linguistic skills.