Lucia Specia


2022

pdf bib
Bias Mitigation in Machine Translation Quality Estimation
Hanna Behnke | Marina Fomicheva | Lucia Specia
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Machine Translation Quality Estimation QE aims to build predictive models to assess the quality of machine generated translations in the absence of reference translations While state of the art QE models have been shown to achieve good results they over rely on features that do not have a causal impact on the quality of a translation In particular there appears to be a partial input bias i.e. a tendency to assign high quality scores to translations that are fluent and grammatically correct even though they do not preserve the meaning of the source We analyse the partial input bias in further detail and evaluate four approaches to use auxiliary tasks for bias mitigation Two approaches use additional data to inform and support the main task while the other two are adversarial actively discouraging the model from learning the bias We compare the methods with respect to their ability to reduce the partial input bias while maintaining the overall performance We find that training a multitask architecture with an auxiliary binary classification task that utilises additional augmented data best achieves the desired effects and generalises well to different languages and quality metrics

pdf bib
Translation Error Detection as Rationale Extraction
Marina Fomicheva | Lucia Specia | Nikolaos Aletras
Findings of the Association for Computational Linguistics: ACL 2022

Recent Quality Estimation QE models based on multilingual pre trained representations have achieved very competitive results in predicting the overall quality of translated sentences However detecting specifically which translated words are incorrect is a more challenging task especially when dealing with limited amounts of training data We hypothesize that not unlike humans successful QE models rely on translation errors to predict overall sentence quality By exploring a set of feature attribution methods that assign relevance scores to the inputs to explain model predictions we study the behaviour of state of the art sentence level QE models and show that explanations i.e. rationales extracted from these models can indeed be used to detect translation errors We therefore i introduce a novel semi supervised method for word level QE and ii propose to use the QE task as a new benchmark for evaluating the plausibility of feature attribution i.e. how interpretable model explanations are to humans

2021

pdf bib
Continual Quality Estimation with Online Bayesian Meta-LearningBayesian Meta-Learning
Abiola Obamuyide | Marina Fomicheva | Lucia Specia
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Most current quality estimation (QE) models for machine translation are trained and evaluated in a static setting where training and test data are assumed to be from a fixed distribution. However, in real-life settings, the test data that a deployed QE model would be exposed to may differ from its training data. In particular, training samples are often labelled by one or a small set of annotators, whose perceptions of translation quality and needs may differ substantially from those of end-users, who will employ predictions in practice. To address this challenge, we propose an online Bayesian meta-learning framework for the continuous training of QE models that is able to adapt them to the needs of different users, while being robust to distributional shifts in training and test data. Experiments on data with varying number of users and language characteristics validate the effectiveness of the proposed approach.

pdf bib
Cross-lingual Visual Pre-training for Multimodal Machine Translation
Ozan Caglayan | Menekse Kuyu | Mustafa Sercan Amac | Pranava Madhyastha | Erkut Erdem | Aykut Erdem | Lucia Specia
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Pre-trained language models have been shown to improve performance in many natural language tasks substantially. Although the early focus of such models was single language pre-training, recent advances have resulted in cross-lingual and visual pre-training methods. In this paper, we combine these two approaches to learn visually-grounded cross-lingual representations. Specifically, we extend the translation language modelling (Lample and Conneau, 2019) with masked region classification and perform pre-training with three-way parallel vision & language corpora. We show that when fine-tuned for multimodal machine translation, these models obtain state-of-the-art performance. We also provide qualitative insights into the usefulness of the learned grounded representations.

pdf bib
Exploiting Multimodal Reinforcement Learning for Simultaneous Machine Translation
Julia Ive | Andy Mingren Li | Yishu Miao | Ozan Caglayan | Pranava Madhyastha | Lucia Specia
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

This paper addresses the problem of simultaneous machine translation (SiMT) by exploring two main concepts : (a) adaptive policies to learn a good trade-off between high translation quality and low latency ; and (b) visual information to support this process by providing additional (visual) contextual information which may be available before the textual input is produced. For that, we propose a multimodal approach to simultaneous machine translation using reinforcement learning, with strategies to integrate visual and textual information in both the agent and the environment. We provide an exploration on how different types of visual information and integration strategies affect the quality and latency of simultaneous translation models, and demonstrate that visual cues lead to higher quality while keeping the latency low.

pdf bib
The (Un)Suitability of Automatic Evaluation Metrics for Text Simplification
Fernando Alva-Manchego | Carolina Scarton | Lucia Specia
Computational Linguistics, Volume 47, Issue 4 - December 2021

Abstract In order to simplify sentences, several rewriting operations can be performed, such as replacing complex words per simpler synonyms, deleting unnecessary information, and splitting long sentences. Despite this multi-operation nature, evaluation of automatic simplification systems relies on metrics that moderately correlate with human judgments on the simplicity achieved by executing specific operations (e.g., simplicity gain based on lexical replacements). In this article, we investigate how well existing metrics can assess sentence-level simplifications where multiple operations may have been applied and which, therefore, require more general simplicity judgments. For that, we first collect a new and more reliable data set for evaluating the correlation of metrics and human judgments of overall simplicity. Second, we conduct the first meta-evaluation of automatic metrics in Text Simplification, using our new data set (and other existing data) to analyze the variation of the correlation between metrics’ scores and human judgments across three dimensions : the perceived simplicity level, the system type, and the set of references used for computation. We show that these three aspects affect the correlations and, in particular, highlight the limitations of commonly used operation-specific metrics. Finally, based on our findings, we propose a set of recommendations for automatic evaluation of multi-operation simplifications, suggesting which metrics to compute and how to interpret their scores.

pdf bib
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Marie-Francine Moens | Xuanjing Huang | Lucia Specia | Scott Wen-tau Yih
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

pdf bib
Classification-based Quality Estimation : Small and Efficient Models for Real-world Applications
Shuo Sun | Ahmed El-Kishky | Vishrav Chaudhary | James Cross | Lucia Specia | Francisco Guzmán
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Sentence-level Quality estimation (QE) of machine translation is traditionally formulated as a regression task, and the performance of QE models is typically measured by Pearson correlation with human labels. Recent QE models have achieved previously-unseen levels of correlation with human judgments, but they rely on large multilingual contextualized language models that are computationally expensive and make them infeasible for real-world applications. In this work, we evaluate several model compression techniques for QE and find that, despite their popularity in other NLP tasks, they lead to poor performance in this regression setting. We observe that a full model parameterization is required to achieve SoTA results in a regression task. However, we argue that the level of expressiveness of a model in a continuous range is unnecessary given the downstream applications of QE, and show that reframing QE as a classification problem and evaluating QE models using classification metrics would better reflect their actual performance in real-world applications.

pdf bib
A Generative Framework for Simultaneous Machine Translation
Yishu Miao | Phil Blunsom | Lucia Specia
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

We propose a generative framework for simultaneous machine translation. Conventional approaches use a fixed number of source words to translate or learn dynamic policies for the number of source words by reinforcement learning. Here we formulate simultaneous translation as a structural sequence-to-sequence learning problem. A latent variable is introduced to model read or translate actions at every time step, which is then integrated out to consider all the possible translation policies. A re-parameterised Poisson prior is used to regularise the policies which allows the model to explicitly balance translation quality and latency. The experiments demonstrate the effectiveness and robustness of the generative framework, which achieves the best BLEU scores given different average translation latencies on benchmark datasets.

pdf bib
Findings of the WMT 2021 Shared Task on Quality EstimationWMT 2021 Shared Task on Quality Estimation
Lucia Specia | Frédéric Blain | Marina Fomicheva | Chrysoula Zerva | Zhenhao Li | Vishrav Chaudhary | André F. T. Martins
Proceedings of the Sixth Conference on Machine Translation

We report the results of the WMT 2021 shared task on Quality Estimation, where the challenge is to predict the quality of the output of neural machine translation systems at the word and sentence levels. This edition focused on two main novel additions : (i) prediction for unseen languages, i.e. zero-shot settings, and (ii) prediction of sentences with catastrophic errors. In addition, new data was released for a number of languages, especially post-edited data. Participating teams from 19 institutions submitted altogether 1263 systems to different task variants and language pairs.

pdf bib
Backtranslation Feedback Improves User Confidence in MT, Not QualityMT, Not Quality
Vilém Zouhar | Michal Novák | Matúš Žilinec | Ondřej Bojar | Mateo Obregón | Robin L. Hill | Frédéric Blain | Marina Fomicheva | Lucia Specia | Lisa Yankovskaya
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Translating text into a language unknown to the text’s author, dubbed outbound translation, is a modern need for which the user experience has significant room for improvement, beyond the basic machine translation facility. We demonstrate this by showing three ways in which user confidence in the outbound translation, as well as its overall final quality, can be affected : backward translation, quality estimation (with alignment) and source paraphrasing. In this paper, we describe an experiment on outbound translation from English to Czech and Estonian. We examine the effects of each proposed feedback module and further focus on how the quality of machine translation systems influence these findings and the user perception of success. We show that backward translation feedback has a mixed effect on the whole process : it increases user confidence in the produced translation, but not the objective quality.

pdf bib
Multimodal Simultaneous Machine Translation
Lucia Specia
Proceedings of the First Workshop on Multimodal Machine Translation for Low Resource Languages (MMTLRL 2021)

Simultaneous machine translation (SiMT) aims to translate a continuous input text stream into another language with the lowest latency and highest quality possible. Therefore, translation has to start with an incomplete source text, which is read progressively, creating the need for anticipation. In this talk I will present work where we seek to understand whether the addition of visual information can compensate for the missing source context. We analyse the impact of different multimodal approaches and visual features on state-of-the-art SiMT frameworks, including fixed and dynamic policy approaches using reinforcement learning. Our results show that visual context is helpful and that visually-grounded models based on explicit object region information perform the best. Our qualitative analysis illustrates cases where only the multimodal systems are able to translate correctly from English into gender-marked languages, as well as deal with differences in word order, such as adjective-noun placement between English and French.

2020

pdf bib
Simultaneous Machine Translation with Visual Context
Ozan Caglayan | Julia Ive | Veneta Haralampieva | Pranava Madhyastha | Loïc Barrault | Lucia Specia
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Simultaneous machine translation (SiMT) aims to translate a continuous input text stream into another language with the lowest latency and highest quality possible. The translation thus has to start with an incomplete source text, which is read progressively, creating the need for anticipation. In this paper, we seek to understand whether the addition of visual information can compensate for the missing source context. To this end, we analyse the impact of different multimodal approaches and visual features on state-of-the-art SiMT frameworks. Our results show that visual context is helpful and that visually-grounded models based on explicit object region information are much better than commonly used global features, reaching up to 3 BLEU points improvement under low latency scenarios. Our qualitative analysis illustrates cases where only the multimodal systems are able to translate correctly from English into gender-marked languages, as well as deal with differences in word order, such as adjective-noun placement between English and French.

pdf bib
Multimodal Quality Estimation for Machine Translation
Shu Okabe | Frédéric Blain | Lucia Specia
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We propose approaches to Quality Estimation (QE) for Machine Translation that explore both text and visual modalities for Multimodal QE. We compare various multimodality integration and fusion strategies. For both sentence-level and document-level predictions, we show that state-of-the-art neural and feature-based QE frameworks obtain better results when using the additional modality.

pdf bib
Proceedings of the First International Workshop on Natural Language Processing Beyond Text
Giuseppe Castellucci | Simone Filice | Soujanya Poria | Erik Cambria | Lucia Specia
Proceedings of the First International Workshop on Natural Language Processing Beyond Text

pdf bib
An Exploratory Study on Multilingual Quality Estimation
Shuo Sun | Marina Fomicheva | Frédéric Blain | Vishrav Chaudhary | Ahmed El-Kishky | Adithya Renduchintala | Francisco Guzmán | Lucia Specia
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing

Predicting the quality of machine translation has traditionally been addressed with language-specific models, under the assumption that the quality label distribution or linguistic features exhibit traits that are not shared across languages. An obvious disadvantage of this approach is the need for labelled data for each given language pair. We challenge this assumption by exploring different approaches to multilingual Quality Estimation (QE), including using scores from translation models. We show that these outperform single-language models, particularly in less balanced quality label distributions and low-resource settings. In the extreme case of zero-shot QE, we show that it is possible to accurately predict quality for any given new language from models trained on other languages. Our findings indicate that state-of-the-art neural QE models based on powerful pre-trained representations generalise well across languages, making them more applicable in real-world settings.

pdf bib
Proceedings of the 28th International Conference on Computational Linguistics: Tutorial Abstracts
Lucia Specia | Daniel Beck
Proceedings of the 28th International Conference on Computational Linguistics: Tutorial Abstracts

pdf bib
Data-Driven Sentence Simplification : Survey and Benchmark
Fernando Alva-Manchego | Carolina Scarton | Lucia Specia
Computational Linguistics, Volume 46, Issue 1 - March 2020

Sentence Simplification (SS) aims to modify a sentence in order to make it easier to read and understand. In order to do so, several rewriting transformations can be performed such as replacement, reordering, and splitting. Executing these transformations while keeping sentences grammatical, preserving their main idea, and generating simpler output, is a challenging and still far from solved problem. In this article, we survey research on SS, focusing on approaches that attempt to learn how to simplify using corpora of aligned original-simplified sentence pairs in English, which is the dominant paradigm nowadays. We also include a benchmark of different approaches on common data sets so as to compare them and highlight their strengths and limitations. We expect that this survey will serve as a starting point for researchers interested in the task and help spark new ideas for future developments.

2019

pdf bib
Deep Copycat Networks for Text-to-Text Generation
Julia Ive | Pranava Madhyastha | Lucia Specia
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Most text-to-text generation tasks, for example text summarisation and text simplification, require copying words from the input to the output. We introduce Copycat, a transformer-based pointer network for such tasks which obtains competitive results in abstractive text summarisation and generates more abstractive summaries. We propose a further extension of this architecture for automatic post-editing, where generation is conditioned over two inputs (source language and machine translation), and the model is capable of deciding where to copy information from. This approach achieves competitive performance when compared to state-of-the-art automated post-editing systems. More importantly, we show that it addresses a well-known limitation of automatic post-editing-overcorrecting translations-and that our novel mechanism for copying source language words improves the results.

bib
Cross-Sentence Transformations in Text Simplification
Fernando Alva-Manchego | Carolina Scarton | Lucia Specia
Proceedings of the 2019 Workshop on Widening NLP

Current approaches to Text Simplification focus on simplifying sentences individually. However, certain simplification transformations span beyond single sentences (e.g. joining and re-ordering sentences). In this paper, we motivate the need for modelling the simplification task at the document level, and assess the performance of sequence-to-sequence neural models in this setup. We analyse parallel original-simplified documents created by professional editors and show that there are frequent rewriting transformations that are not restricted to sentence boundaries. We also propose strategies to automatically evaluate the performance of a simplification model on these cross-sentence transformations. Our experiments show the inability of standard sequence-to-sequence neural models to learn these transformations, and suggest directions towards document-level simplification.

pdf bib
Is artificial data useful for biomedical Natural Language Processing algorithms?
Zixu Wang | Julia Ive | Sumithra Velupillai | Lucia Specia
Proceedings of the 18th BioNLP Workshop and Shared Task

A major obstacle to the development of Natural Language Processing (NLP) methods in the biomedical domain is data accessibility. This problem can be addressed by generating medical data artificially. Most previous studies have focused on the generation of short clinical text, and evaluation of the data utility has been limited. We propose a generic methodology to guide the generation of clinical text with key phrases. We use the artificial data as additional training data in two key biomedical NLP tasks : text classification and temporal relation extraction. We show that artificially generated training data used in conjunction with real training data can lead to performance boosts for data-greedy neural network algorithms. We also demonstrate the usefulness of the generated data for NLP setups where it fully replaces real training data.

pdf bib
A Comparison on Fine-grained Pre-trained Embeddings for the WMT19Chinese-English News Translation TaskWMT19Chinese-English News Translation Task
Zhenhao Li | Lucia Specia
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

This paper describes our submission to the WMT 2019 Chinese-English (zh-en) news translation shared task. Our systems are based on RNN architectures with pre-trained embeddings which utilize character and sub-character information. We compare models with these different granularity levels using different evaluating metics. We find that a finer granularity embeddings can help the model according to character level evaluation and that the pre-trained embeddings can also be beneficial for model performance marginally when the training data is limited.

pdf bib
WMDO : Fluency-based Word Mover’s Distance for Machine Translation EvaluationWMDO: Fluency-based Word Mover’s Distance for Machine Translation Evaluation
Julian Chow | Lucia Specia | Pranava Madhyastha
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

We propose WMDO, a metric based on distance between distributions in the semantic vector space. Matching in the semantic space has been investigated for translation evaluation, but the constraints of a translation’s word order have not been fully explored. Building on the Word Mover’s Distance metric and various word embeddings, we introduce a fragmentation penalty to account for fluency of a translation. This word order extension is shown to perform better than standard WMD, with promising results against other types of metrics.

pdf bib
Probing the Need for Visual Context in Multimodal Machine Translation
Ozan Caglayan | Pranava Madhyastha | Lucia Specia | Loïc Barrault
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Current work on multimodal machine translation (MMT) has suggested that the visual modality is either unnecessary or only marginally beneficial. We posit that this is a consequence of the very simple, short and repetitive sentences used in the only available dataset for the task (Multi30 K), rendering the source text sufficient as context. In the general case, however, we believe that it is possible to combine visual and textual information in order to ground translations. In this paper we probe the contribution of the visual modality to state-of-the-art MMT models by conducting a systematic analysis where we partially deprive the models from source-side textual context. Our results show that under limited textual context, models are capable of leveraging the visual input to generate better translations. This contradicts the current belief that MMT models disregard the visual modality because of either the quality of the image features or the way they are integrated into the model.

pdf bib
The IWSLT 2019 Evaluation CampaignIWSLT 2019 Evaluation Campaign
Jan Niehues | Rolando Cattoni | Sebastian Stüker | Matteo Negri | Marco Turchi | Thanh-Le Ha | Elizabeth Salesky | Ramon Sanabria | Loic Barrault | Lucia Specia | Marcello Federico
Proceedings of the 16th International Conference on Spoken Language Translation

The IWSLT 2019 evaluation campaign featured three tasks : speech translation of (i) TED talks and (ii) How2 instructional videos from English into German and Portuguese, and (iii) text translation of TED talks from English into Czech. For the first two tasks we encouraged submissions of end- to-end speech-to-text systems, and for the second task participants could also use the video as additional input. We received submissions by 12 research teams. This overview provides detailed descriptions of the data and evaluation conditions of each task and reports results of the participating systems.

pdf bib
Transformer-based Cascaded Multimodal Speech Translation
Zixiu Wu | Ozan Caglayan | Julia Ive | Josiah Wang | Lucia Specia
Proceedings of the 16th International Conference on Spoken Language Translation

This paper describes the cascaded multimodal speech translation systems developed by Imperial College London for the IWSLT 2019 evaluation campaign. The architecture consists of an automatic speech recognition (ASR) system followed by a Transformer-based multimodal machine translation (MMT) system. While the ASR component is identical across the experiments, the MMT model varies in terms of the way of integrating the visual context (simple conditioning vs. attention), the type of visual features exploited (pooled, convolutional, action categories) and the underlying architecture. For the latter, we explore both the canonical transformer and its deliberation version with additive and cascade variants which differ in how they integrate the textual attention. Upon conducting extensive experiments, we found that (i) the explored visual integration schemes often harm the translation performance for the transformer and additive deliberation, but considerably improve the cascade deliberation ; (ii) the transformer and cascade deliberation integrate the visual modality better than the additive deliberation, as shown by the incongruence analysis.

pdf bib
Estimating post-editing effort : a study on human judgements, task-based and reference-based metrics of MT qualityMT quality
Scarton Scarton | Mikel L. Forcada | Miquel Esplà-Gomis | Lucia Specia
Proceedings of the 16th International Conference on Spoken Language Translation

Devising metrics to assess translation quality has always been at the core of machine translation (MT) research. Traditional automatic reference-based metrics, such as BLEU, have shown correlations with human judgements of adequacy and fluency and have been paramount for the advancement of MT system development. Crowd-sourcing has popularised and enabled the scalability of metrics based on human judgments, such as subjective direct assessments (DA) of adequacy, that are believed to be more reliable than reference-based automatic metrics. Finally, task-based measurements, such as post-editing time, are expected to provide a more de- tailed evaluation of the usefulness of translations for a specific task. Therefore, while DA averages adequacy judgements to obtain an appraisal of (perceived) quality independently of the task, and reference-based automatic metrics try to objectively estimate quality also in a task-independent way, task-based metrics are measurements obtained either during or after performing a specific task. In this paper we argue that, although expensive, task-based measurements are the most reliable when estimating MT quality in a specific task ; in our case, this task is post-editing. To that end, we report experiments on a dataset with newly-collected post-editing indicators and show their usefulness when estimating post-editing effort. Our results show that task-based metrics comparing machine-translated and post-edited versions are the best at tracking post-editing effort, as expected.

pdf bib
Distilling Translations with Visual Awareness
Julia Ive | Pranava Madhyastha | Lucia Specia
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Previous work on multimodal machine translation has shown that visual information is only needed in very specific cases, for example in the presence of ambiguous words where the textual context is not sufficient. As a consequence, models tend to learn to ignore this information. We propose a translate-and-refine approach to this problem where images are only used by a second stage decoder. This approach is trained jointly to generate a good first draft translation and to improve over this draft by (i) making better use of the target language textual context (both left and right-side contexts) and (ii) making use of visual context. This approach leads to the state of the art results. Additionally, we show that it has the ability to recover from erroneous or missing words in the source language.

pdf bib
VIFIDEL : Evaluating the Visual Fidelity of Image DescriptionsVIFIDEL: Evaluating the Visual Fidelity of Image Descriptions
Pranava Madhyastha | Josiah Wang | Lucia Specia
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

We address the task of evaluating image description generation systems. We propose a novel image-aware metric for this task : VIFIDEL. It estimates the faithfulness of a generated caption with respect to the content of the actual image, based on the semantic similarity between labels of objects depicted in images and words in the description. The metric is also able to take into account the relative importance of objects mentioned in human reference descriptions during evaluation. Even if these human reference descriptions are not available, VIFIDEL can still reliably evaluate system descriptions. The metric achieves high correlation with human judgments on two well-known datasets and is competitive with metrics that depend on and rely exclusively on human references.

2018

pdf bib
A Report on the Complex Word Identification Shared Task 2018
Seid Muhie Yimam | Chris Biemann | Shervin Malmasi | Gustavo Paetzold | Lucia Specia | Sanja Štajner | Anaïs Tack | Marcos Zampieri
Proceedings of the Thirteenth Workshop on Innovative Use of NLP for Building Educational Applications

We report the findings of the second Complex Word Identification (CWI) shared task organized as part of the BEA workshop co-located with NAACL-HLT’2018. The second CWI shared task featured multilingual and multi-genre datasets divided into four tracks : English monolingual, German monolingual, Spanish monolingual, and a multilingual track with a French test set, and two tasks : binary classification and probabilistic classification. A total of 12 teams submitted their results in different task / track combinations and 11 of them wrote system description papers that are referred to in this report and appear in the BEA workshop proceedings.

pdf bib
End-to-end Image Captioning Exploits Distributional Similarity in Multimodal Space
Pranava Swaroop Madhyastha | Josiah Wang | Lucia Specia
Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP

We hypothesize that end-to-end neural image captioning systems work seemingly well because they exploit and learn ‘distributional similarity’ in a multimodal feature space, by mapping a test image to similar training images in this space and generating a caption from the same space. To validate our hypothesis, we focus on the ‘image’ side of image captioning, and vary the input image representation but keep the RNN text generation model of a CNN-RNN constant. Our analysis indicates that image captioning models (i) are capable of separating structure from noisy input representations ; (ii) experience virtually no significant performance loss when a high dimensional representation is compressed to a lower dimensional space ; (iii) cluster images with similar visual and linguistic information together. Our experiments all point to one fact : that our distributional similarity hypothesis holds. We conclude that, regardless of the image representation, image captioning systems seem to match images and generate captions in a learned joint image-text semantic subspace.

pdf bib
Proceedings of the Third Conference on Machine Translation: Research Papers
Ondřej Bojar | Rajen Chatterjee | Christian Federmann | Mark Fishel | Yvette Graham | Barry Haddow | Matthias Huck | Antonio Jimeno Yepes | Philipp Koehn | Christof Monz | Matteo Negri | Aurélie Névéol | Mariana Neves | Matt Post | Lucia Specia | Marco Turchi | Karin Verspoor
Proceedings of the Third Conference on Machine Translation: Research Papers

pdf bib
Exploring gap filling as a cheaper alternative to reading comprehension questionnaires when evaluating machine translation for gisting
Mikel L. Forcada | Carolina Scarton | Lucia Specia | Barry Haddow | Alexandra Birch
Proceedings of the Third Conference on Machine Translation: Research Papers

A popular application of machine translation (MT) is gisting : MT is consumed as is to make sense of text in a foreign language. Evaluation of the usefulness of MT for gisting is surprisingly uncommon. The classical method uses reading comprehension questionnaires (RCQ), in which informants are asked to answer professionally-written questions in their language about a foreign text that has been machine-translated into their language. Recently, gap-filling (GF), a form of cloze testing, has been proposed as a cheaper alternative to RCQ. In GF, certain words are removed from reference translations and readers are asked to fill the gaps left using the machine-translated text as a hint. This paper reports, for the first time, a comparative evaluation, using both RCQ and GF, of translations from multiple MT systems for the same foreign texts, and a systematic study on the effect of variables such as gap density, gap-selection strategies, and document context in GF. The main findings of the study are : (a) both RCQ and GF clearly identify MT to be useful ; (b) global RCQ and GF rankings for the MT systems are mostly in agreement ; (c) GF scores vary very widely across informants, making comparisons among MT systems hard, and (d) unlike RCQ, which is framed around documents, GF evaluation can be framed at the sentence level. These findings support the use of GF as a cheaper alternative to RCQ.gisting: MT is consumed as is to make sense of text in a foreign language. Evaluation of the usefulness of MT for gisting is surprisingly uncommon. The classical method uses reading comprehension questionnaires (RCQ), in which informants are asked to answer professionally-written questions in their language about a foreign text that has been machine-translated into their language. Recently, gap-filling (GF), a form of cloze testing, has been proposed as a cheaper alternative to RCQ. In GF, certain words are removed from reference translations and readers are asked to fill the gaps left using the machine-translated text as a hint. This paper reports, for the first time, a comparative evaluation, using both RCQ and GF, of translations from multiple MT systems for the same foreign texts, and a systematic study on the effect of variables such as gap density, gap-selection strategies, and document context in GF. The main findings of the study are: (a) both RCQ and GF clearly identify MT to be useful; (b) global RCQ and GF rankings for the MT systems are mostly in agreement; (c) GF scores vary very widely across informants, making comparisons among MT systems hard, and (d) unlike RCQ, which is framed around documents, GF evaluation can be framed at the sentence level. These findings support the use of GF as a cheaper alternative to RCQ.

bib
Proceedings of the Third Conference on Machine Translation: Shared Task Papers
Ondřej Bojar | Rajen Chatterjee | Christian Federmann | Mark Fishel | Yvette Graham | Barry Haddow | Matthias Huck | Antonio Jimeno Yepes | Philipp Koehn | Christof Monz | Matteo Negri | Aurélie Névéol | Mariana Neves | Matt Post | Lucia Specia | Marco Turchi | Karin Verspoor
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

pdf bib
Sheffield Submissions for WMT18 Multimodal Translation Shared TaskSheffield Submissions for WMT18 Multimodal Translation Shared Task
Chiraag Lala | Pranava Swaroop Madhyastha | Carolina Scarton | Lucia Specia
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

This paper describes the University of Sheffield’s submissions to the WMT18 Multimodal Machine Translation shared task. We participated in both tasks 1 and 1b. For task 1, we build on a standard sequence to sequence attention-based neural machine translation system (NMT) and investigate the utility of multimodal re-ranking approaches. More specifically, n-best translation candidates from this system are re-ranked using novel multimodal cross-lingual word sense disambiguation models. For task 1b, we explore three approaches : (i) re-ranking based on cross-lingual word sense disambiguation (as for task 1), (ii) re-ranking based on consensus of NMT n-best lists from German-Czech, French-Czech and English-Czech systems, and (iii) data augmentation by generating English source data through machine translation from French to English and from German to English followed by hypothesis selection using a multimodal-reranker.

pdf bib
Sheffield Submissions for the WMT18 Quality Estimation Shared TaskSheffield Submissions for the WMT18 Quality Estimation Shared Task
Julia Ive | Carolina Scarton | Frédéric Blain | Lucia Specia
Proceedings of the Third Conference on Machine Translation: Shared Task Papers

In this paper we present the University of Sheffield submissions for the WMT18 Quality Estimation shared task. We discuss our submissions to all four sub-tasks, where ours is the only team to participate in all language pairs and variations (37 combinations). Our systems show competitive results and outperform the baseline in nearly all cases.

pdf bib
Object Counts ! Bringing Explicit Detections Back into Image Captioning
Josiah Wang | Pranava Swaroop Madhyastha | Lucia Specia
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

The use of explicit object detectors as an intermediate step to image captioning which used to constitute an essential stage in early work is often bypassed in the currently dominant end-to-end approaches, where the language model is conditioned directly on a mid-level image embedding. We argue that explicit detections provide rich semantic information, and can thus be used as an interpretable representation to better understand why end-to-end image captioning systems work well. We provide an in-depth analysis of end-to-end image captioning by exploring a variety of cues that can be derived from such object detections. Our study reveals that end-to-end image captioning systems rely on matching image representations to generate captions, and that encoding the frequency, size and position of objects are complementary and all play a role in forming a good image representation. It also reveals that different object categories contribute in different ways towards image captioning.

pdf bib
Learning Simplifications for Specific Target Audiences
Carolina Scarton | Lucia Specia
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Text simplification (TS) is a monolingual text-to-text transformation task where an original (complex) text is transformed into a target (simpler) text. Most recent work is based on sequence-to-sequence neural models similar to those used for machine translation (MT). Different from MT, TS data comprises more elaborate transformations, such as sentence splitting. It can also contain multiple simplifications of the same original text targeting different audiences, such as school grade levels. We explore these two features of TS to build models tailored for specific grade levels. Our approach uses a standard sequence-to-sequence architecture where the original sequence is annotated with information about the target audience and/or the (predicted) type of simplification operation. We show that it outperforms state-of-the-art TS approaches (up to 3 and 12 BLEU and SARI points, respectively), including when training data for the specific complex-simple combination of grade levels is not available, i.e. zero-shot learning.

2017

pdf bib
Learning How to Simplify From Explicit Labeling of Complex-Simplified Text Pairs
Fernando Alva-Manchego | Joachim Bingel | Gustavo Paetzold | Carolina Scarton | Lucia Specia
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Current research in text simplification has been hampered by two central problems : (i) the small amount of high-quality parallel simplification data available, and (ii) the lack of explicit annotations of simplification operations, such as deletions or substitutions, on existing data. While the recently introduced Newsela corpus has alleviated the first problem, simplifications still need to be learned directly from parallel text using black-box, end-to-end approaches rather than from explicit annotations. These complex-simple parallel sentence pairs often differ to such a high degree that generalization becomes difficult. End-to-end models also make it hard to interpret what is actually learned from data. We propose a method that decomposes the task of TS into its sub-problems. We devise a way to automatically identify operations in a parallel corpus and introduce a sequence-labeling approach based on these annotations. Finally, we provide insights on the types of transformations that different approaches can model.

pdf bib
MUSST : A Multilingual Syntactic Simplification ToolMUSST: A Multilingual Syntactic Simplification Tool
Carolina Scarton | Alessio Palmero Aprosio | Sara Tonelli | Tamara Martín Wanton | Lucia Specia
Proceedings of the IJCNLP 2017, System Demonstrations

We describe MUSST, a multilingual syntactic simplification tool. The tool supports sentence simplifications for English, Italian and Spanish, and can be easily extended to other languages. Our implementation includes a set of general-purpose simplification rules, as well as a sentence selection module (to select sentences to be simplified) and a confidence model (to select only promising simplifications). The tool was implemented in the context of the European project SIMPATICO on text simplification for Public Administration (PA) texts. Our evaluation on sentences in the PA domain shows that we obtain correct simplifications for 76 % of the simplified cases in English, 71 % of the cases in Spanish. For Italian, the results are lower (38 %) but the tool is still under development.

pdf bib
The Ultimate Presentation Makeup Tutorial : How to Polish your Posters, Slides and Presentations SkillsPolish your Posters, Slides and Presentations Skills
Gustavo Paetzold | Lucia Specia
Proceedings of the IJCNLP 2017, Tutorial Abstracts

There is no question that our research community have, and still has been producing an insurmountable amount of interesting strategies, models and tools to a wide array of problems and challenges in diverse areas of knowledge. But for as long as interesting work has existed, we’ve been plagued by a great unsolved mystery : how come there is so much interesting work being published in conferences, but not as many interesting and engaging posters and presentations being featured in them? In this tutorial, we present practical step-by-step makeup solutions for poster, slides and oral presentations in order to help researchers who feel like they are not able to convey the importance of their research to the community in conferences.

pdf bib
Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017)
Roger Levy | Lucia Specia
Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017)

pdf bib
Complex Word Identification : Challenges in Data Annotation and System Performance
Marcos Zampieri | Shervin Malmasi | Gustavo Paetzold | Lucia Specia
Proceedings of the 4th Workshop on Natural Language Processing Techniques for Educational Applications (NLPTEA 2017)

This paper revisits the problem of complex word identification (CWI) following up the SemEval CWI shared task. We use ensemble classifiers to investigate how well computational methods can discriminate between complex and non-complex words. Furthermore, we analyze the classification performance to understand what makes lexical complexity challenging. Our findings show that most systems performed poorly on the SemEval CWI dataset, and one of the reasons for that is the way in which human annotation was performed.

pdf bib
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Lucia Specia | Matt Post | Michael Paul
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

pdf bib
Personalized Machine Translation : Preserving Original Author Traits
Ella Rabinovich | Raj Nath Patel | Shachar Mirkin | Lucia Specia | Shuly Wintner
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

The language that we produce reflects our personality, and various personal and demographic characteristics can be detected in natural language texts. We focus on one particular personal trait of the author, gender, and study how it is manifested in original texts and in translations. We show that author’s gender has a powerful, clear signal in originals texts, but this signal is obfuscated in human and machine translation. We then propose simple domain-adaptation techniques that help retain the original gender traits in the translation, without harming the quality of the translation, thereby creating more personalized machine translation systems.

pdf bib
Lexical Simplification with Neural Ranking
Gustavo Paetzold | Lucia Specia
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

We present a new Lexical Simplification approach that exploits Neural Networks to learn substitutions from the Newsela corpus-a large set of professionally produced simplifications. We extract candidate substitutions by combining the Newsela corpus with a retrofitted context-aware word embeddings model and rank them using a new neural regression model that learns rankings from annotated data. This strategy leads to the highest Accuracy, Precision and F1 scores to date in standard datasets for the task.
Search
Co-authors