Graham Neubig


2022

pdf bib
Breaking Down Multilingual Machine Translation
Ting-Rui Chiang | Yi-Pei Chen | Yi-Ting Yeh | Graham Neubig
Findings of the Association for Computational Linguistics: ACL 2022

While multilingual training is now an essential ingredient in machine translation MT systems recent work has demonstrated that it has different effects in different multilingual settings such as many to one one to many and many to many learning These training settings expose the encoder and the decoder in a machine translation model with different data distributions In this paper we examine how different varieties of multilingual training contribute to learning these two components of the MT model Specifically we compare bilingual models with encoders and/or decoders initialized by multilingual training We show that multilingual training is beneficial to encoders in general while it only benefits decoders for low resource languages LRLs We further find the important attention heads for each language pair and compare their correlations during inference Our analysis sheds light on how multilingual translation models work and also enables us to propose methods to improve performance by training with highly related languages Our many to one models for high resource languages and one to many models for LRL outperform the best results reported by Aharoni et al

2021

pdf bib
CitationIE : Leveraging the Citation Graph for Scientific Information ExtractionCitationIE: Leveraging the Citation Graph for Scientific Information Extraction
Vijay Viswanathan | Graham Neubig | Pengfei Liu
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Automatically extracting key information from scientific documents has the potential to help scientists work more efficiently and accelerate the pace of scientific progress. Prior work has considered extracting document-level entity clusters and relations end-to-end from raw scientific text, which can improve literature search and help identify methods and materials for a given problem. Despite the importance of this task, most existing works on scientific information extraction (SciIE) consider extraction solely based on the content of an individual paper, without considering the paper’s place in the broader literature. In contrast to prior work, we augment our text representations by leveraging a complementary source of document context : the citation graph of referential links between citing and cited papers. On a test set of English-language scientific documents, we show that simple ways of utilizing the structure and content of the citation graph can each lead to significant gains in different scientific information extraction tasks. When these tasks are combined, we observe a sizable improvement in end-to-end information extraction over the state-of-the-art, suggesting the potential for future work along this direction. We release software tools to facilitate citation-aware SciIE development.

pdf bib
Do Context-Aware Translation Models Pay the Right Attention?
Kayo Yin | Patrick Fernandes | Danish Pruthi | Aditi Chaudhary | André F. T. Martins | Graham Neubig
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Context-aware machine translation models are designed to leverage contextual information, but often fail to do so. As a result, they inaccurately disambiguate pronouns and polysemous words that require context for resolution. In this paper, we ask several questions : What contexts do human translators use to resolve ambiguous words? Are models paying large amounts of attention to the same context? What if we explicitly train them to do so? To answer these questions, we introduce SCAT (Supporting Context for Ambiguous Translations), a new English-French dataset comprising supporting context words for 14 K translations that professional translators found useful for pronoun disambiguation. Using SCAT, we perform an in-depth analysis of the context used to disambiguate, examining positional and lexical characteristics of the supporting words. Furthermore, we measure the degree of alignment between the model’s attention scores and the supporting context from SCAT, and apply a guided attention strategy to encourage agreement between the two.

pdf bib
Measuring and Increasing Context Usage in Context-Aware Machine Translation
Patrick Fernandes | Kayo Yin | Graham Neubig | André F. T. Martins
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Recent work in neural machine translation has demonstrated both the necessity and feasibility of using inter-sentential context, context from sentences other than those currently being translated. However, while many current methods present model architectures that theoretically can use this extra context, it is often not clear how much they do actually utilize it at translation time. In this paper, we introduce a new metric, conditional cross-mutual information, to quantify usage of context by these models. Using this metric, we measure how much document-level machine translation systems use particular varieties of context. We find that target context is referenced more than source context, and that including more context has a diminishing affect on results. We then introduce a new, simple training method, context-aware word dropout, to increase the usage of context by context-aware models. Experiments show that our method not only increases context usage, but also improves the translation quality according to metrics such as BLEU and COMET, as well as performance on anaphoric pronoun resolution and lexical cohesion contrastive datasets.

pdf bib
Proceedings of the 1st Workshop on Natural Language Processing for Programming (NLP4Prog 2021)
Royi Lachmy | Ziyu Yao | Greg Durrett | Milos Gligoric | Junyi Jessy Li | Ray Mooney | Graham Neubig | Yu Su | Huan Sun | Reut Tsarfaty
Proceedings of the 1st Workshop on Natural Language Processing for Programming (NLP4Prog 2021)

pdf bib
Efficient Nearest Neighbor Language Models
Junxian He | Graham Neubig | Taylor Berg-Kirkpatrick
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Non-parametric neural language models (NLMs) learn predictive distributions of text utilizing an external datastore, which allows them to learn through explicitly memorizing the training datapoints. While effective, these models often require retrieval from a large datastore at test time, significantly increasing the inference overhead and thus limiting the deployment of non-parametric NLMs in practical applications. In this paper, we take the recently proposed k-nearest neighbors language model as an example, exploring methods to improve its efficiency along various dimensions. Experiments on the standard WikiText-103 benchmark and domain-adaptation datasets show that our methods are able to achieve up to a 6x speed-up in inference speed while retaining comparable performance. The empirical analysis we present may provide guidelines for future research seeking to develop or deploy more efficient non-parametric NLMs.

pdf bib
When is Wall a Pared and when a Muro? : Extracting Rules Governing Lexical Selection
Aditi Chaudhary | Kayo Yin | Antonios Anastasopoulos | Graham Neubig
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Learning fine-grained distinctions between vocabulary items is a key challenge in learning a new language. For example, the noun wall has different lexical manifestations in Spanish pared refers to an indoor wall while muro refers to an outside wall. However, this variety of lexical distinction may not be obvious to non-native learners unless the distinction is explained in such a way. In this work, we present a method for automatically identifying fine-grained lexical distinctions, and extracting rules explaining these distinctions in a human- and machine-readable format. We confirm the quality of these extracted rules in a language learning setup for two languages, Spanish and Greek, where we use the rules to teach non-native speakers when to translate a given ambiguous word into its different possible translations.

pdf bib
Evaluating the Morphosyntactic Well-formedness of Generated Texts
Adithya Pratapa | Antonios Anastasopoulos | Shruti Rijhwani | Aditi Chaudhary | David R. Mortensen | Graham Neubig | Yulia Tsvetkov
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Text generation systems are ubiquitous in natural language processing applications. However, evaluation of these systems remains a challenge, especially in multilingual settings. In this paper, we propose L’AMBRE a metric to evaluate the morphosyntactic well-formedness of text using its dependency parse and morphosyntactic rules of the language. We present a way to automatically extract various rules governing morphosyntax directly from dependency treebanks. To tackle the noisy outputs from text generation systems, we propose a simple methodology to train robust parsers. We show the effectiveness of our metric on the task of machine translation through a diachronic study of systems translating into morphologically-rich languages.

pdf bib
Multi-view Subword Regularization
Xinyi Wang | Sebastian Ruder | Graham Neubig
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Multilingual pretrained representations generally rely on subword segmentation algorithms to create a shared multilingual vocabulary. However, standard heuristic algorithms often lead to sub-optimal segmentation, especially for languages with limited amounts of data. In this paper, we take two major steps towards alleviating this problem. First, we demonstrate empirically that applying existing subword regularization methods (Kudo, 2018 ; Provilkov et al., 2020) during fine-tuning of pre-trained multilingual representations improves the effectiveness of cross-lingual transfer. Second, to take full advantage of different possible input segmentations, we propose Multi-view Subword Regularization (MVR), a method that enforces the consistency of predictors between using inputs tokenized by the standard and probabilistic segmentations. Results on the XTREME multilingual benchmark (Hu et al., 2020) show that MVR brings consistent improvements of up to 2.5 points over using standard segmentation algorithms.

pdf bib
MetaXL : Meta Representation Transformation for Low-resource Cross-lingual LearningMetaXL: Meta Representation Transformation for Low-resource Cross-lingual Learning
Mengzhou Xia | Guoqing Zheng | Subhabrata Mukherjee | Milad Shokouhi | Graham Neubig | Ahmed Hassan Awadallah
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

The combination of multilingual pre-trained representations and cross-lingual transfer learning is one of the most effective methods for building functional NLP systems for low-resource languages. However, for extremely low-resource languages without large-scale monolingual corpora for pre-training or sufficient annotated data for fine-tuning, transfer learning remains an understudied and challenging task. Moreover, recent work shows that multilingual representations are surprisingly disjoint across languages, bringing additional challenges for transfer onto extremely low-resource languages. In this paper, we propose MetaXL, a meta-learning based framework that learns to transform representations judiciously from auxiliary languages to a target one and brings their representation spaces closer for effective transfer. Extensive experiments on real-world low-resource languages without access to large-scale monolingual corpora or large amounts of labeled data for tasks like cross-lingual sentiment analysis and named entity recognition show the effectiveness of our approach. Code for MetaXL is publicly available at github.com/microsoft/MetaXL.

pdf bib
Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models
Po-Yao Huang | Mandela Patrick | Junjie Hu | Graham Neubig | Florian Metze | Alexander Hauptmann
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

This paper studies zero-shot cross-lingual transfer of vision-language models. Specifically, we focus on multilingual text-to-video search and propose a Transformer-based model that learns contextual multilingual multimodal embeddings. Under a zero-shot setting, we empirically demonstrate that performance degrades significantly when we query the multilingual text-video model with non-English sentences. To address this problem, we introduce a multilingual multimodal pre-training strategy, and collect a new multilingual instructional video dataset (Multi-HowTo100 M) for pre-training. Experiments on VTT show that our method significantly improves video search in non-English languages without additional annotations. Furthermore, when multilingual annotations are available, our method outperforms recent baselines by a large margin in multilingual text-to-video search on VTT and VATEX ; as well as in multilingual text-to-image search on Multi30K. Our model and Multi-HowTo100 M is available at http://github.com/berniebear/Multi-HT100M.

pdf bib
Compositional Generalization for Neural Semantic Parsing via Span-level Supervised Attention
Pengcheng Yin | Hao Fang | Graham Neubig | Adam Pauls | Emmanouil Antonios Platanios | Yu Su | Sam Thomson | Jacob Andreas
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We describe a span-level supervised attention loss that improves compositional generalization in semantic parsers. Our approach builds on existing losses that encourage attention maps in neural sequence-to-sequence models to imitate the output of classical word alignment algorithms. Where past work has used word-level alignments, we focus on spans ; borrowing ideas from phrase-based machine translation, we align subtrees in semantic parses to spans of input sentences, and encourage neural attention mechanisms to mimic these alignments. This method improves the performance of transformers, RNNs, and structured decoders on three benchmarks of compositional generalization.

pdf bib
GSum : A General Framework for Guided Neural Abstractive SummarizationGSum: A General Framework for Guided Neural Abstractive Summarization
Zi-Yi Dou | Pengfei Liu | Hiroaki Hayashi | Zhengbao Jiang | Graham Neubig
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Neural abstractive summarization models are flexible and can produce coherent summaries, but they are sometimes unfaithful and can be difficult to control. While previous studies attempt to provide different types of guidance to control the output and increase faithfulness, it is not clear how these strategies compare and contrast to each other. In this paper, we propose a general and extensible guided summarization framework (GSum) that can effectively take different kinds of external guidance as input, and we perform experiments across several different varieties. Experiments demonstrate that this model is effective, achieving state-of-the-art performance according to ROUGE on 4 popular summarization datasets when using highlighted sentences as guidance. In addition, we show that our guided model can generate more faithful summaries and demonstrate how different types of guidance generate qualitatively different summaries, lending a degree of controllability to the learned models.GSum) that can effectively take different kinds of external guidance as input, and we perform experiments across several different varieties. Experiments demonstrate that this model is effective, achieving state-of-the-art performance according to ROUGE on 4 popular summarization datasets when using highlighted sentences as guidance. In addition, we show that our guided model can generate more faithful summaries and demonstrate how different types of guidance generate qualitatively different summaries, lending a degree of controllability to the learned models.

pdf bib
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas
Manuel Mager | Arturo Oncevay | Annette Rios | Ivan Vladimir Meza Ruiz | Alexis Palmer | Graham Neubig | Katharina Kann
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

pdf bib
Findings of the AmericasNLP 2021 Shared Task on Open Machine Translation for Indigenous Languages of the AmericasAmericasNLP 2021 Shared Task on Open Machine Translation for Indigenous Languages of the Americas
Manuel Mager | Arturo Oncevay | Abteen Ebrahimi | John Ortega | Annette Rios | Angela Fan | Ximena Gutierrez-Vasques | Luis Chiruzzo | Gustavo Giménez-Lugo | Ricardo Ramos | Ivan Vladimir Meza Ruiz | Rolando Coto-Solano | Alexis Palmer | Elisabeth Mager-Hois | Vishrav Chaudhary | Graham Neubig | Ngoc Thang Vu | Katharina Kann
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

This paper presents the results of the 2021 Shared Task on Open Machine Translation for Indigenous Languages of the Americas. The shared task featured two independent tracks, and participants submitted machine translation systems for up to 10 indigenous languages. Overall, 8 teams participated with a total of 214 submissions. We provided training sets consisting of data collected from various sources, as well as manually translated sentences for the development and test sets. An official baseline trained on this data was also provided. Team submissions featured a variety of architectures, including both statistical and neural models, and for the majority of languages, many teams were able to considerably improve over the baseline. The best performing systems achieved 12.97 ChrF higher than baseline, when averaged across languages.

2020

pdf bib
Automatic Extraction of Rules Governing Morphological Agreement
Aditi Chaudhary | Antonios Anastasopoulos | Adithya Pratapa | David R. Mortensen | Zaid Sheikh | Yulia Tsvetkov | Graham Neubig
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Creating a descriptive grammar of a language is an indispensable step for language documentation and preservation. However, at the same time it is a tedious, time-consuming task. In this paper, we take steps towards automating this process by devising an automated framework for extracting a first-pass grammatical specification from raw text in a concise, human- and machine-readable format. We focus on extracting rules describing agreement, a morphosyntactic phenomenon at the core of the grammars of many of the world’s languages. We apply our framework to all languages included in the Universal Dependencies project, with promising results. Using cross-lingual transfer, even with no expert annotations in the language of interest, our framework extracts a grammatical specification which is nearly equivalent to those created with large amounts of gold-standard annotated data. We confirm this finding with human expert evaluations of the rules that our framework produces, which have an average accuracy of 78 %. We release an interface demonstrating the extracted rules at https://neulab.github.io/lase/

pdf bib
OCR Post Correction for Endangered Language TextsOCR Post Correction for Endangered Language Texts
Shruti Rijhwani | Antonios Anastasopoulos | Graham Neubig
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

There is little to no data available to build natural language processing models for most endangered languages. However, textual data in these languages often exists in formats that are not machine-readable, such as paper books and scanned images. In this work, we address the task of extracting text from these resources. We create a benchmark dataset of transcriptions for scanned books in three critically endangered languages and present a systematic analysis of how general-purpose OCR tools are not robust to the data-scarce setting of endangered languages. We develop an OCR post-correction method tailored to ease training in this data-scarce setting, reducing the recognition error rate by 34 % on average across the three languages.

pdf bib
X-FACTR : Multilingual Factual Knowledge Retrieval from Pretrained Language ModelsX-FACTR: Multilingual Factual Knowledge Retrieval from Pretrained Language Models
Zhengbao Jiang | Antonios Anastasopoulos | Jun Araki | Haibo Ding | Graham Neubig
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Language models (LMs) have proven surprisingly successful at capturing factual knowledge by completing cloze-style fill-in-the-blank questions such as Punta Cana is located in _. However, while knowledge is both written and queried in many languages, studies on LMs’ factual representation ability have almost invariably been performed on English. To assess factual knowledge retrieval in LMs in different languages, we create a multilingual benchmark of cloze-style probes for typologically diverse languages. To properly handle language variations, we expand probing methods from single- to multi-word entities, and develop several decoding algorithms to generate multi-token predictions. Extensive experimental results provide insights about how well (or poorly) current state-of-the-art LMs perform at this task in languages with more or fewer available resources. We further propose a code-switching-based method to improve the ability of multilingual LMs to access knowledge, and verify its effectiveness on several benchmark languages. Benchmark data and code have be released at https://x-factr.github.io.

pdf bib
Re-evaluating Evaluation in Text Summarization
Manik Bhandari | Pranav Narayan Gour | Atabak Ashfaq | Pengfei Liu | Graham Neubig
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Automated evaluation metrics as a stand-in for manual evaluation are an essential part of the development of text-generation tasks such as text summarization. However, while the field has progressed, our standard metrics have not for nearly 20 years ROUGE has been the standard evaluation in most summarization papers. In this paper, we make an attempt to re-evaluate the evaluation method for text summarization : assessing the reliability of automatic metrics using top-scoring system outputs, both abstractive and extractive, on recently popular datasets for both system-level and summary-level evaluation settings. We find that conclusions about evaluation metrics on older datasets do not necessarily hold on modern datasets and systems. We release a dataset of human judgments that are collected from 25 top-scoring neural summarization systems (14 abstractive and 11 extractive).

pdf bib
Politeness Transfer : A Tag and Generate Approach
Aman Madaan | Amrith Setlur | Tanmay Parekh | Barnabas Poczos | Graham Neubig | Yiming Yang | Ruslan Salakhutdinov | Alan W Black | Shrimai Prabhumoye
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

This paper introduces a new task of politeness transfer which involves converting non-polite sentences to polite sentences while preserving the meaning. We also provide a dataset of more than 1.39 instances automatically labeled for politeness to encourage benchmark evaluations on this new task. We design a tag and generate pipeline that identifies stylistic attributes and subsequently generates a sentence in the target style while preserving most of the source content. For politeness as well as five other transfer tasks, our model outperforms the state-of-the-art methods on automatic metrics for content preservation, with a comparable or better performance on style transfer accuracy. Additionally, our model surpasses existing methods on human evaluations for grammaticality, meaning preservation and transfer accuracy across all the six style transfer tasks. The data and code is located at https://github.com/tag-and-generate.

pdf bib
Incorporating External Knowledge through Pre-training for Natural Language to Code Generation
Frank F. Xu | Zhengbao Jiang | Pengcheng Yin | Bogdan Vasilescu | Graham Neubig
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Open-domain code generation aims to generate code in a general-purpose programming language (such as Python) from natural language (NL) intents. Motivated by the intuition that developers usually retrieve resources on the web when writing code, we explore the effectiveness of incorporating two varieties of external knowledge into NL-to-code generation : automatically mined NL-code pairs from the online programming QA forum StackOverflow and programming language API documentation. Our evaluations show that combining the two sources with data augmentation and retrieval-based data re-sampling improves the current state-of-the-art by up to 2.2 % absolute BLEU score on the code generation testbed CoNaLa. The code and resources are available at https://github.com/neulab/external-knowledge-codegen.

pdf bib
AlloVera : A Multilingual Allophone DatabaseAlloVera: A Multilingual Allophone Database
David R. Mortensen | Xinjian Li | Patrick Littell | Alexis Michaud | Shruti Rijhwani | Antonios Anastasopoulos | Alan W Black | Florian Metze | Graham Neubig
Proceedings of the 12th Language Resources and Evaluation Conference

We introduce a new resource, AlloVera, which provides mappings from 218 allophones to phonemes for 14 languages. Phonemes are contrastive phonological units, and allophones are their various concrete realizations, which are predictable from phonological context. While phonemic representations are language specific, phonetic representations (stated in terms of (allo)phones) are much closer to a universal (language-independent) transcription. AlloVera allows the training of speech recognition models that output phonetic transcriptions in the International Phonetic Alphabet (IPA), regardless of the input language. We show that a universal allophone model, Allosaurus, built with AlloVera, outperforms universal phonemic models and language-specific models on a speech-transcription task. We explore the implications of this technology (and related technologies) for the documentation of endangered and minority languages. We further explore other applications for which AlloVera will be suitable as it grows, including phonological typology.

pdf bib
Improving Candidate Generation for Low-resource Cross-lingual Entity Linking
Shuyan Zhou | Shruti Rijhwani | John Wieting | Jaime Carbonell | Graham Neubig
Transactions of the Association for Computational Linguistics, Volume 8

Cross-lingual entity linking (XEL) is the task of finding referents in a target-language knowledge base (KB) for mentions extracted from source-language texts. The first step of (X)EL is candidate generation, which retrieves a list of plausible candidate entities from the target-language KB for each mention. Approaches based on resources from Wikipedia have proven successful in the realm of relatively high-resource languages, but these do not extend well to low-resource languages with few, if any, Wikipedia pages. Recently, transfer learning methods have been shown to reduce the demand for resources in the low-resource languages by utilizing resources in closely related languages, but the performance still lags far behind their high-resource counterparts. In this paper, we first assess the problems faced by current entity candidate generation methods for low-resource XEL, then propose three improvements that (1) reduce the disconnect between entity mentions and KB entries, and (2) improve the robustness of the model to low-resource scenarios. The methods are simple, but effective : We experiment with our approach on seven XEL datasets and find that they yield an average gain of 16.9 % in Top-30 gold candidate recall, compared with state-of-the-art baselines. Our improved model also yields an average gain of 7.9 % in in-KB accuracy of end-to-end XEL.1

pdf bib
Proceedings of the Fourth Workshop on Neural Generation and Translation
Alexandra Birch | Andrew Finch | Hiroaki Hayashi | Kenneth Heafield | Marcin Junczys-Dowmunt | Ioannis Konstas | Xian Li | Graham Neubig | Yusuke Oda
Proceedings of the Fourth Workshop on Neural Generation and Translation

2019

pdf bib
Handling Syntactic Divergence in Low-resource Machine Translation
Chunting Zhou | Xuezhe Ma | Junjie Hu | Graham Neubig
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Despite impressive empirical successes of neural machine translation (NMT) on standard benchmarks, limited parallel data impedes the application of NMT models to many language pairs. Data augmentation methods such as back-translation make it possible to use monolingual data to help alleviate these issues, but back-translation itself fails in extreme low-resource scenarios, especially for syntactically divergent languages. In this paper, we propose a simple yet effective solution, whereby target-language sentences are re-ordered to match the order of the source and used as an additional source of training-time supervision. Experiments with simulated low-resource Japanese-to-English, and real low-resource Uyghur-to-English scenarios find significant improvements over other semi-supervised alternatives.

pdf bib
Unsupervised Domain Adaptation for Neural Machine Translation with Domain-Aware Feature Embeddings
Zi-Yi Dou | Junjie Hu | Antonios Anastasopoulos | Graham Neubig
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

The recent success of neural machine translation models relies on the availability of high quality, in-domain data. Domain adaptation is required when domain-specific data is scarce or nonexistent. Previous unsupervised domain adaptation strategies include training the model with in-domain copied monolingual or back-translated data. However, these methods use generic representations for text regardless of domain shift, which makes it infeasible for translation models to control outputs conditional on a specific domain. In this work, we propose an approach that adapts models with domain-aware feature embeddings, which are learned via an auxiliary language modeling task. Our approach allows the model to assign domain-specific representations to words and output sentences in the desired domain. Our empirical results demonstrate the effectiveness of the proposed strategy, achieving consistent improvements in multiple experimental settings. In addition, we show that combining our method with back translation can further improve the performance of the model.

pdf bib
FlowSeq : Non-Autoregressive Conditional Sequence Generation with Generative FlowFlowSeq: Non-Autoregressive Conditional Sequence Generation with Generative Flow
Xuezhe Ma | Chunting Zhou | Xian Li | Graham Neubig | Eduard Hovy
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Most sequence-to-sequence (seq2seq) models are autoregressive ; they generate each token by conditioning on previously generated tokens. In contrast, non-autoregressive seq2seq models generate all tokens in one pass, which leads to increased efficiency through parallel processing on hardware such as GPUs. However, directly modeling the joint distribution of all tokens simultaneously is challenging, and even with increasingly complex model structures accuracy lags significantly behind autoregressive models. In this paper, we propose a simple, efficient, and effective model for non-autoregressive sequence generation using latent variable models. Specifically, we turn to generative flow, an elegant technique to model complex distributions using neural networks, and design several layers of flow tailored for modeling the conditional density of sequential latent variables. We evaluate this model on three neural machine translation (NMT) benchmark datasets, achieving comparable performance with state-of-the-art non-autoregressive NMT models and almost constant decoding time w.r.t the sequence length.

pdf bib
Proceedings of the 3rd Workshop on Neural Generation and Translation
Alexandra Birch | Andrew Finch | Hiroaki Hayashi | Ioannis Konstas | Thang Luong | Graham Neubig | Yusuke Oda | Katsuhito Sudoh
Proceedings of the 3rd Workshop on Neural Generation and Translation

pdf bib
Findings of the First Shared Task on Machine Translation Robustness
Xian Li | Paul Michel | Antonios Anastasopoulos | Yonatan Belinkov | Nadir Durrani | Orhan Firat | Philipp Koehn | Graham Neubig | Juan Pino | Hassan Sajjad
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

We share the findings of the first shared task on improving robustness of Machine Translation (MT). The task provides a testbed representing challenges facing MT models deployed in the real world, and facilitates new approaches to improve models’ robustness to noisy input and domain mismatch. We focus on two language pairs (English-French and English-Japanese), and the submitted systems are evaluated on a blind test set consisting of noisy comments on Reddit and professionally sourced translations. As a new task, we received 23 submissions by 11 participating teams from universities, companies, national labs, etc. All submitted systems achieved large improvements over baselines, with the best improvement having +22.33 BLEU. We evaluated submissions by both human judgment and automatic evaluation (BLEU), which shows high correlations (Pearson’s r = 0.94 and 0.95). Furthermore, we conducted a qualitative analysis of the submitted systems using compare-mt, which revealed their salient differences in handling challenges in this task. Such analysis provides additional insights when there is occasional disagreement between human judgment and BLEU, e.g. systems better at producing colloquial expressions received higher score from human judgment.

pdf bib
Competence-based Curriculum Learning for Neural Machine Translation
Emmanouil Antonios Platanios | Otilia Stretcu | Graham Neubig | Barnabas Poczos | Tom Mitchell
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Current state-of-the-art NMT systems use large neural networks that are not only slow to train, but also often require many heuristics and optimization tricks, such as specialized learning rate schedules and large batch sizes. This is undesirable as it requires extensive hyperparameter tuning. In this paper, we propose a curriculum learning framework for NMT that reduces training time, reduces the need for specialized heuristics or large batch sizes, and results in overall better performance. Our framework consists of a principled way of deciding which training samples are shown to the model at different times during training, based on the estimated difficulty of a sample and the current competence of the model. Filtering training samples in this manner prevents the model from getting stuck in bad local optima, making it converge faster and reach a better solution than the common approach of uniformly sampling training examples. Furthermore, the proposed method can be easily applied to existing NMT models by simply modifying their input data pipelines. We show that our framework can help improve the training time and the performance of both recurrent neural network models and Transformers, achieving up to a 70 % decrease in training time, while at the same time obtaining accuracy improvements of up to 2.2 BLEU.

pdf bib
Density Matching for Bilingual Word Embedding
Chunting Zhou | Xuezhe Ma | Di Wang | Graham Neubig
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Recent approaches to cross-lingual word embedding have generally been based on linear transformations between the sets of embedding vectors in the two languages. In this paper, we propose an approach that instead expresses the two monolingual embedding spaces as probability densities defined by a Gaussian mixture model, and matches the two densities using a method called normalizing flow. The method requires no explicit supervision, and can be learned with only a seed dictionary of words that have identical strings. We argue that this formulation has several intuitively attractive properties, particularly with the respect to improving robustness and generalization to mappings between difficult language pairs or word pairs. On a benchmark data set of bilingual lexicon induction and cross-lingual word similarity, our approach can achieve competitive or superior performance compared to state-of-the-art published results, with particularly strong results being found on etymologically distant and/or morphologically rich languages.

pdf bib
Self-Attentional Models for Lattice Inputs
Matthias Sperber | Graham Neubig | Ngoc-Quan Pham | Alex Waibel
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Lattices are an efficient and effective method to encode ambiguity of upstream systems in natural language processing tasks, for example to compactly capture multiple speech recognition hypotheses, or to represent multiple linguistic analyses. Previous work has extended recurrent neural networks to model lattice inputs and achieved improvements in various tasks, but these models suffer from very slow computation speeds. This paper extends the recently proposed paradigm of self-attention to handle lattice inputs. Self-attention is a sequence modeling technique that relates inputs to one another by computing pairwise similarities and has gained popularity for both its strong results and its computational efficiency. To extend such models to handle lattices, we introduce probabilistic reachability masks that incorporate lattice structure into the model and support lattice scores if available. We also propose a method for adapting positional embeddings to lattice structures. We apply the proposed model to a speech translation task and find that it outperforms all examined baselines while being much faster to compute than previous neural lattice models during both training and inference.

pdf bib
Domain Adaptation of Neural Machine Translation by Lexicon Induction
Junjie Hu | Mengzhou Xia | Graham Neubig | Jaime Carbonell
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

It has been previously noted that neural machine translation (NMT) is very sensitive to domain shift. In this paper, we argue that this is a dual effect of the highly lexicalized nature of NMT, resulting in failure for sentences with large numbers of unknown words, and lack of supervision for domain-specific words. To remedy this problem, we propose an unsupervised adaptation method which fine-tunes a pre-trained out-of-domain NMT model using a pseudo-in-domain corpus. Specifically, we perform lexicon induction to extract an in-domain lexicon, and construct a pseudo-parallel in-domain corpus by performing word-for-word back-translation of monolingual in-domain target sentences. In five domains over twenty pairwise adaptation settings and two model architectures, our method achieves consistent improvements without using any in-domain parallel sentences, improving up to 14 BLEU over unadapted models, and up to 2 BLEU over strong back-translation baselines.

pdf bib
Choosing Transfer Languages for Cross-Lingual Learning
Yu-Hsiang Lin | Chian-Yu Chen | Jean Lee | Zirui Li | Yuyan Zhang | Mengzhou Xia | Shruti Rijhwani | Junxian He | Zhisong Zhang | Xuezhe Ma | Antonios Anastasopoulos | Patrick Littell | Graham Neubig
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Cross-lingual transfer, where a high-resource transfer language is used to improve the accuracy of a low-resource task language, is now an invaluable tool for improving performance of natural language processing (NLP) on low-resource languages. However, given a particular task language, it is not clear which language to transfer from, and the standard strategy is to select languages based on ad hoc criteria, usually the intuition of the experimenter. Since a large number of features contribute to the success of cross-lingual transfer (including phylogenetic similarity, typological properties, lexical overlap, or size of available data), even the most enlightened experimenter rarely considers all these factors for the particular task at hand. In this paper, we consider this task of automatically selecting optimal transfer languages as a ranking problem, and build models that consider the aforementioned features to perform this prediction. In experiments on representative NLP tasks, we demonstrate that our model predicts good transfer languages much better than ad hoc baselines considering single features in isolation, and glean insights on what features are most informative for each different NLP tasks, which may inform future ad hoc selection even without use of our method.

pdf bib
Cross-Lingual Syntactic Transfer through Unsupervised Adaptation of Invertible Projections
Junxian He | Zhisong Zhang | Taylor Berg-Kirkpatrick | Graham Neubig
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Cross-lingual transfer is an effective way to build syntactic analysis tools in low-resource languages. However, transfer is difficult when transferring to typologically distant languages, especially when neither annotated target data nor parallel corpora are available. In this paper, we focus on methods for cross-lingual transfer to distant languages and propose to learn a generative model with a structured prior that utilizes labeled source data and unlabeled target data jointly. The parameters of source model and target model are softly shared through a regularized log likelihood objective. An invertible projection is employed to learn a new interlingual latent embedding space that compensates for imperfect cross-lingual word embedding input. We evaluate our method on two syntactic tasks : part-of-speech (POS) tagging and dependency parsing. On the Universal Dependency Treebanks, we use English as the only source corpus and transfer to a wide range of target languages. On the 10 languages in this dataset that are distant from English, our method yields an average of 5.2 % absolute improvement on POS tagging and 8.3 % absolute improvement on dependency parsing over a direct transfer method using state-of-the-art discriminative models.

pdf bib
Reranking for Neural Semantic Parsing
Pengcheng Yin | Graham Neubig
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Semantic parsing considers the task of transducing natural language (NL) utterances into machine executable meaning representations (MRs). While neural network-based semantic parsers have achieved impressive improvements over previous methods, results are still far from perfect, and cursory manual inspection can easily identify obvious problems such as lack of adequacy or coherence of the generated MRs. This paper presents a simple approach to quickly iterate and improve the performance of an existing neural semantic parser by reranking an n-best list of predicted MRs, using features that are designed to fix observed problems with baseline models. We implement our reranker in a competitive neural semantic parser and test on four semantic parsing (GEO, ATIS) and Python code generation (Django, CoNaLa) tasks, improving the strong baseline parser by up to 5.7 % absolute in BLEU (CoNaLa) and 2.9 % in accuracy (Django), outperforming the best published neural parser results on all four datasets.

pdf bib
Improving Open Information Extraction via Iterative Rank-Aware Learning
Zhengbao Jiang | Pengcheng Yin | Graham Neubig
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Open information extraction (IE) is the task of extracting open-domain assertions from natural language sentences. A key step in open IE is confidence modeling, ranking the extractions based on their estimated quality to adjust precision and recall of extracted assertions. We found that the extraction likelihood, a confidence measure used by current supervised open IE systems, is not well calibrated when comparing the quality of assertions extracted from different sentences. We propose an additional binary classification loss to calibrate the likelihood to make it more globally comparable, and an iterative learning process, where extractions generated by the open IE model are incrementally included as training samples to help the model learn from trial and error. Experiments on OIE2016 demonstrate the effectiveness of our method. Code and data are available at https://github.com/jzbjyb/oie_rank.

pdf bib
Generalized Data Augmentation for Low-Resource Translation
Mengzhou Xia | Xiang Kong | Antonios Anastasopoulos | Graham Neubig
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Low-resource language pairs with a paucity of parallel data pose challenges for machine translation in terms of both adequacy and fluency. Data augmentation utilizing a large amount of monolingual data is regarded as an effective way to alleviate the problem. In this paper, we propose a general framework of data augmentation for low-resource machine translation not only using target-side monolingual data, but also by pivoting through a related high-resource language. Specifically, we experiment with a two-step pivoting method to convert high-resource data to the low-resource language, making best use of available resources to better approximate the true distribution of the low-resource language. First, we inject low-resource words into high-resource sentences through an induced bilingual dictionary. Second, we further edit the high-resource data injected with low-resource words using a modified unsupervised machine translation framework. Extensive experiments on four low-resource datasets show that under extreme low-resource settings, our data augmentation techniques improve translation quality by up to 1.5 to 8 BLEU points compared to supervised back-translation baselines.

pdf bib
Target Conditioned Sampling : Optimizing Data Selection for Multilingual Neural Machine Translation
Xinyi Wang | Graham Neubig
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

To improve low-resource Neural Machine Translation (NMT) with multilingual corpus, training on the most related high-resource language only is generally more effective than us- ing all data available (Neubig and Hu, 2018). However, it remains a question whether a smart data selection strategy can further improve low-resource NMT with data from other auxiliary languages. In this paper, we seek to construct a sampling distribution over all multilingual data, so that it minimizes the training loss of the low-resource language. Based on this formulation, we propose and efficient algorithm, (TCS), which first samples a target sentence, and then conditionally samples its source sentence. Experiments show TCS brings significant gains of up to 2 BLEU improvements on three of four languages we test, with minimal training overhead.

pdf bib
Comparing Top-Down and Bottom-Up Neural Generative Dependency Models
Austin Matthews | Graham Neubig | Chris Dyer
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

Recurrent neural network grammars generate sentences using phrase-structure syntax and perform very well on both parsing and language modeling. To explore whether generative dependency models are similarly effective, we propose two new generative models of dependency syntax. Both models use recurrent neural nets to avoid making explicit independence assumptions, but they differ in the order used to construct the trees : one builds the tree bottom-up and the other top-down, which profoundly changes the estimation problem faced by the learner. We evaluate the two models on three typologically different languages : English, Arabic, and Japanese. While both generative models improve parsing performance over a discriminative baseline, they are significantly less effective than non-syntactic LSTM language models. Surprisingly, little difference between the construction orders is observed for either parsing or language modeling.

2018

pdf bib
SwitchOut : an Efficient Data Augmentation Algorithm for Neural Machine TranslationSwitchOut: an Efficient Data Augmentation Algorithm for Neural Machine Translation
Xinyi Wang | Hieu Pham | Zihang Dai | Graham Neubig
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

In this work, we examine methods for data augmentation for text-based tasks such as neural machine translation (NMT). We formulate the design of a data augmentation policy with desirable properties as an optimization problem, and derive a generic analytic solution. This solution not only subsumes some existing augmentation schemes, but also leads to an extremely simple data augmentation strategy for NMT : randomly replacing words in both the source sentence and the target sentence with other random words from their corresponding vocabularies. We name this method SwitchOut. Experiments on three translation datasets of different scales show that SwitchOut yields consistent improvements of about 0.5 BLEU, achieving better or comparable performances to strong alternatives such as word dropout (Sennrich et al., 2016a). Code to implement this method is included in the appendix.

pdf bib
Rapid Adaptation of Neural Machine Translation to New Languages
Graham Neubig | Junjie Hu
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

This paper examines the problem of adapting neural machine translation systems to new, low-resourced languages (LRLs) as effectively and rapidly as possible. We propose methods based on starting with massively multilingual seed models, which can be trained ahead-of-time, and then continuing training on data related to the LRL. We contrast a number of strategies, leading to a novel, simple, yet effective method of similar-language regularization, where we jointly train on both a LRL of interest and a similar high-resourced language to prevent over-fitting to small LRL data. Experiments demonstrate that massively multilingual models, even without any explicit adaptation, are surprisingly effective, achieving BLEU scores of up to 15.5 with no data from the LRL, and that the proposed similar-language regularization method improves over other adaptation methods by 1.7 BLEU points average over 4 LRL settings.

pdf bib
Retrieval-Based Neural Code Generation
Shirley Anugrah Hayati | Raphael Olivier | Pravalika Avvaru | Pengcheng Yin | Anthony Tomasic | Graham Neubig
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

In models to generate program source code from natural language, representing this code in a tree structure has been a common approach. However, existing methods often fail to generate complex code correctly due to a lack of ability to memorize large and complex structures. We introduce RECODE, a method based on subtree retrieval that makes it possible to explicitly reference existing code examples within a neural code generation model. First, we retrieve sentences that are similar to input sentences using a dynamic-programming-based sentence similarity scoring method. Next, we extract n-grams of action sequences that build the associated abstract syntax tree. Finally, we increase the probability of actions that cause the retrieved n-gram action subtree to be in the predicted code. We show that our approach improves the performance on two code generation tasks by up to +2.6 BLEU.

pdf bib
Unsupervised Learning of Syntactic Structure with Invertible Neural Projections
Junxian He | Graham Neubig | Taylor Berg-Kirkpatrick
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Unsupervised learning of syntactic structure is typically performed using generative models with discrete latent variables and multinomial parameters. In most cases, these models have not leveraged continuous word representations. In this work, we propose a novel generative model that jointly learns discrete syntactic structure and continuous word representations in an unsupervised fashion by cascading an invertible neural network with a structured generative prior. We show that the invertibility condition allows for efficient exact inference and marginal likelihood computation in our model so long as the prior is well-behaved. In experiments we instantiate our approach with both Markov and tree-structured priors, evaluating on two tasks : part-of-speech (POS) induction, and unsupervised dependency parsing without gold POS annotation. On the Penn Treebank, our Markov-structured model surpasses state-of-the-art results on POS induction. Similarly, we find that our tree-structured model achieves state-of-the-art performance on unsupervised dependency parsing for the difficult training condition where neither gold POS annotation nor punctuation-based constraints are available.

pdf bib
A Tree-based Decoder for Neural Machine Translation
Xinyi Wang | Hieu Pham | Pengcheng Yin | Graham Neubig
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Recent advances in Neural Machine Translation (NMT) show that adding syntactic information to NMT systems can improve the quality of their translations. Most existing work utilizes some specific types of linguistically-inspired tree structures, like constituency and dependency parse trees. This is often done via a standard RNN decoder that operates on a linearized target tree structure. However, it is an open question of what specific linguistic formalism, if any, is the best structural representation for NMT. In this paper, we (1) propose an NMT model that can naturally generate the topology of an arbitrary tree structure on the target side, and (2) experiment with various target tree structures. Our experiments show the surprising result that our model delivers the best improvements with balanced binary trees constructed without any linguistic knowledge ; this model outperforms standard seq2seq models by up to 2.1 BLEU points, and other methods for incorporating target-side syntax by up to 0.7 BLEU.

pdf bib
Neural Lattice Language Models
Jacob Buckman | Graham Neubig
Transactions of the Association for Computational Linguistics, Volume 6

In this work, we propose a new language modeling paradigm that has the ability to perform both prediction and moderation of information flow at multiple granularities : neural lattice language models. These models construct a lattice of possible paths through a sentence and marginalize across this lattice to calculate sequence probabilities or optimize parameters. This approach allows us to seamlessly incorporate linguistic intuitions including polysemy and the existence of multiword lexical items into our language model. Experiments on multiple language modeling tasks show that English neural lattice language models that utilize polysemous embeddings are able to improve perplexity by 9.95 % relative to a word-level baseline, and that a Chinese model that handles multi-character tokens is able to improve perplexity by 20.94 % relative to a character-level baseline.

pdf bib
Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)
Colin Cherry | Graham Neubig
Proceedings of the 13th Conference of the Association for Machine Translation in the Americas (Volume 1: Research Track)

pdf bib
Proceedings of the 2nd Workshop on Neural Machine Translation and Generation
Alexandra Birch | Andrew Finch | Thang Luong | Graham Neubig | Yusuke Oda
Proceedings of the 2nd Workshop on Neural Machine Translation and Generation

pdf bib
Findings of the Second Workshop on Neural Machine Translation and Generation
Alexandra Birch | Andrew Finch | Minh-Thang Luong | Graham Neubig | Yusuke Oda
Proceedings of the 2nd Workshop on Neural Machine Translation and Generation

This document describes the findings of the Second Workshop on Neural Machine Translation and Generation, held in concert with the annual conference of the Association for Computational Linguistics (ACL 2018). First, we summarize the research trends of papers presented in the proceedings, and note that there is particular interest in linguistic structure, domain adaptation, data augmentation, handling inadequate resources, and analysis of models. Second, we describe the results of the workshop’s shared task on efficient neural machine translation, where participants were tasked with creating MT systems that are both accurate and efficient.

pdf bib
Multi-Source Neural Machine Translation with Missing Data
Yuta Nishimura | Katsuhito Sudoh | Graham Neubig | Satoshi Nakamura
Proceedings of the 2nd Workshop on Neural Machine Translation and Generation

Multi-source translation is an approach to exploit multiple inputs (e.g. in two different languages) to increase translation accuracy. In this paper, we examine approaches for multi-source neural machine translation (NMT) using an incomplete multilingual corpus in which some translations are missing. In practice, many multilingual corpora are not complete due to the difficulty to provide translations in all of the relevant languages (for example, in TED talks, most English talks only have subtitles for a small portion of the languages that TED supports). Existing studies on multi-source translation did not explicitly handle such situations. This study focuses on the use of incomplete multilingual corpora in multi-encoder NMT and mixture of NMT experts and examines a very simple implementation where missing source translations are replaced by a special symbol NULL. These methods allow us to use incomplete corpora both at training time and test time. In experiments with real incomplete multilingual corpora of TED Talks, the multi-source NMT with the NULL tokens achieved higher translation accuracies measured by BLEU than those by any one-to-one NMT systems.

pdf bib
Attentive Interaction Model : Modeling Changes in View in Argumentation
Yohan Jo | Shivani Poddar | Byungsoo Jeon | Qinlan Shen | Carolyn Rosé | Graham Neubig
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

We present a neural architecture for modeling argumentative dialogue that explicitly models the interplay between an Opinion Holder’s (OH’s) reasoning and a challenger’s argument, with the goal of predicting if the argument successfully changes the OH’s view. The model has two components : (1) vulnerable region detection, an attention model that identifies parts of the OH’s reasoning that are amenable to change, and (2) interaction encoding, which identifies the relationship between the content of the OH’s reasoning and that of the challenger’s argument. Based on evaluation on discussions from the Change My View forum on Reddit, the two components work together to predict an OH’s change in view, outperforming several baselines. A posthoc analysis suggests that sentences picked out by the attention model are addressed more frequently by successful arguments than by unsuccessful ones.

pdf bib
Guiding Neural Machine Translation with Retrieved Translation Pieces
Jingyi Zhang | Masao Utiyama | Eiichro Sumita | Graham Neubig | Satoshi Nakamura
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

One of the difficulties of neural machine translation (NMT) is the recall and appropriate translation of low-frequency words or phrases. In this paper, we propose a simple, fast, and effective method for recalling previously seen translation examples and incorporating them into the NMT decoding process. Specifically, for an input sentence, we use a search engine to retrieve sentence pairs whose source sides are similar with the input sentence, and then collect n-grams that are both in the retrieved target sentences and aligned with words that match in the source sentences, which we call translation pieces. We compute pseudo-probabilities for each retrieved sentence based on similarities between the input sentence and the retrieved source sentences, and use these to weight the retrieved translation pieces. Finally, an existing NMT model is used to translate the input sentence, with an additional bonus given to outputs that contain the collected translation pieces. We show our method improves NMT translation results up to 6 BLEU points on three narrow domain translation tasks where repetitiveness of the target sentences is particularly salient. It also causes little increase in the translation time, and compares favorably to another alternative retrieval-based method with respect to accuracy, speed, and simplicity of implementation.

pdf bib
Handling Homographs in Neural Machine Translation
Frederick Liu | Han Lu | Graham Neubig
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Homographs, words with different meanings but the same surface form, have long caused difficulty for machine translation systems, as it is difficult to select the correct translation based on the context. However, with the advent of neural machine translation (NMT) systems, which can theoretically take into account global sentential context, one may hypothesize that this problem has been alleviated. In this paper, we first provide empirical evidence that existing NMT systems in fact still have significant problems in properly translating ambiguous words. We then proceed to describe methods, inspired by the word sense disambiguation literature, that model the context of the input word with context-aware word embeddings that help to differentiate the word sense before feeding it into the encoder. Experiments on three language pairs demonstrate that such models improve the performance of NMT systems both in terms of BLEU score and in the accuracy of translating homographs.

pdf bib
When and Why Are Pre-Trained Word Embeddings Useful for Neural Machine Translation?
Ye Qi | Devendra Sachan | Matthieu Felix | Sarguna Padmanabhan | Graham Neubig
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

The performance of Neural Machine Translation (NMT) systems often suffers in low-resource scenarios where sufficiently large-scale parallel corpora can not be obtained. Pre-trained word embeddings have proven to be invaluable for improving performance in natural language analysis tasks, which often suffer from paucity of data. However, their utility for NMT has not been extensively explored. In this work, we perform five sets of experiments that analyze when we can expect pre-trained word embeddings to help in NMT tasks. We show that such embeddings can be surprisingly effective in some cases providing gains of up to 20 BLEU points in the most favorable setting.

pdf bib
Stack-Pointer Networks for Dependency Parsing
Xuezhe Ma | Zecong Hu | Jingzhou Liu | Nanyun Peng | Graham Neubig | Eduard Hovy
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We introduce a novel architecture for dependency parsing : stack-pointer networks (StackPtr). Combining pointer networks (Vinyals et al., 2015) with an internal stack, the proposed model first reads and encodes the whole sentence, then builds the dependency tree top-down (from root-to-leaf) in a depth-first fashion. The stack tracks the status of the depth-first search and the pointer networks select one child for the word at the top of the stack at each step. The StackPtr parser benefits from the information of whole sentence and all previously derived subtree structures, and removes the left-to-right restriction in classical transition-based parsers. Yet the number of steps for building any (non-projective) parse tree is linear in the length of the sentence just as other transition-based parsers, yielding an efficient decoding algorithm with O(n^2) time complexity. We evaluate our model on 29 treebanks spanning 20 languages and different dependency annotation schemas, and achieve state-of-the-art performances on 21 of themO(n^2) time complexity. We evaluate our model on 29 treebanks spanning 20 languages and different dependency annotation schemas, and achieve state-of-the-art performances on 21 of them

pdf bib
Learning to Generate Move-by-Move Commentary for Chess Games from Large-Scale Social Forum Data
Harsh Jhamtani | Varun Gangal | Eduard Hovy | Graham Neubig | Taylor Berg-Kirkpatrick
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

This paper examines the problem of generating natural language descriptions of chess games. We introduce a new large-scale chess commentary dataset and propose methods to generate commentary for individual moves in a chess game. The introduced dataset consists of more than 298 K chess move-commentary pairs across 11 K chess games. We highlight how this task poses unique research challenges in natural language generation : the data contain a large variety of styles of commentary and frequently depend on pragmatic context. We benchmark various baselines and propose an end-to-end trainable neural model which takes into account multiple pragmatic aspects of the game state that may be commented upon to describe a given chess move. Through a human study on predictions for a subset of the data which deals with direct move descriptions, we observe that outputs from our models are rated similar to ground truth commentary texts in terms of correctness and fluency.

pdf bib
Extreme Adaptation for Personalized Neural Machine Translation
Paul Michel | Graham Neubig
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Every person speaks or writes their own flavor of their native language, influenced by a number of factors : the content they tend to talk about, their gender, their social status, or their geographical origin. When attempting to perform Machine Translation (MT), these variations have a significant effect on how the system should perform translation, but this is not captured well by standard one-size-fits-all models. In this paper, we propose a simple and parameter-efficient adaptation technique that only requires adapting the bias of the output softmax to each particular user of the MT system, either directly or through a factored approximation. Experiments on TED talks in three languages demonstrate improvements in translation accuracy, and better reflection of speaker traits in the target text.

pdf bib
Multi-Source Neural Machine Translation with Data Augmentation
Yuta Nishimura | Katsuhito Sudoh | Graham Neubig | Satoshi Nakamura
Proceedings of the 15th International Conference on Spoken Language Translation

Multi-source translation systems translate from multiple languages to a single target language. By using information from these multiple sources, these systems achieve large gains in accuracy. To train these systems, it is necessary to have corpora with parallel text in multiple sources and the target language. However, these corpora are rarely complete in practice due to the difficulty of providing human translations in all of the relevant languages. In this paper, we propose a data augmentation approach to fill such incomplete parts using multi-source neural machine translation (NMT). In our experiments, results varied over different language combinations but significant gains were observed when using a source language similar to the target language.

2017

pdf bib
Improving Neural Machine Translation through Phrase-based Forced Decoding
Jingyi Zhang | Masao Utiyama | Eiichro Sumita | Graham Neubig | Satoshi Nakamura
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Compared to traditional statistical machine translation (SMT), neural machine translation (NMT) often sacrifices adequacy for the sake of fluency. We propose a method to combine the advantages of traditional SMT and NMT by exploiting an existing phrase-based SMT model to compute the phrase-based decoding cost for an NMT output and then using the phrase-based decoding cost to rerank the n-best NMT outputs. The main challenge in implementing this approach is that NMT outputs may not be in the search space of the standard phrase-based decoding algorithm, because the search space of phrase-based SMT is limited by the phrase-based translation rule table. We propose a soft forced decoding algorithm, which can always successfully find a decoding path for any NMT output. We show that using the forced decoding cost to rerank the NMT outputs can successfully improve translation quality on four different language pairs.

pdf bib
Multi-space Variational Encoder-Decoders for Semi-supervised Labeled Sequence Transduction
Chunting Zhou | Graham Neubig
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Labeled sequence transduction is a task of transforming one sequence into another sequence that satisfies desiderata specified by a set of labels. In this paper we propose multi-space variational encoder-decoders, a new model for labeled sequence transduction with semi-supervised learning. The generative model can use neural networks to handle both discrete and continuous latent variables to exploit various features of data. Experiments show that our model provides not only a powerful supervised framework but also can effectively take advantage of the unlabeled data. On the SIGMORPHON morphological inflection benchmark, our model outperforms single-model state-of-art results by a large margin for the majority of languages.

pdf bib
A Syntactic Neural Model for General-Purpose Code Generation
Pengcheng Yin | Graham Neubig
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We consider the problem of parsing natural language descriptions into source code written in a general-purpose programming language like Python. Existing data-driven methods treat this problem as a language generation task without considering the underlying syntax of the target programming language. Informed by previous work in semantic parsing, in this paper we propose a novel neural architecture powered by a grammar model to explicitly capture the target syntax as prior knowledge. Experiments find this an effective way to scale up to generation of complex programs from natural language descriptions, achieving state-of-the-art results that well outperform previous code generation and semantic parsing approaches.

pdf bib
Neural Machine Translation via Binary Code Prediction
Yusuke Oda | Philip Arthur | Graham Neubig | Koichiro Yoshino | Satoshi Nakamura
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In this paper, we propose a new method for calculating the output layer in neural machine translation systems. The method is based on predicting a binary code for each word and can reduce computation time / memory requirements of the output layer to be logarithmic in vocabulary size in the best case. In addition, we also introduce two advanced approaches to improve the robustness of the proposed model : using error-correcting codes and combining softmax and binary codes. Experiments on two English-Japanese bidirectional translation tasks show proposed models achieve BLEU scores that approach the softmax, while reducing memory usage to the order of less than 1/10 and improving decoding speed on CPUs by x5 to x10.

pdf bib
Learning Character-level Compositionality with Visual Features
Frederick Liu | Han Lu | Chieh Lo | Graham Neubig
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Previous work has modeled the compositionality of words by creating character-level models of meaning, reducing problems of sparsity for rare words. However, in many writing systems compositionality has an effect even on the character-level : the meaning of a character is derived by the sum of its parts. In this paper, we model this effect by creating embeddings for characters based on their visual characteristics, creating an image for the character and running it through a convolutional neural network to produce a visual character embedding. Experiments on a text classification task demonstrate that such model allows for better processing of instances with rare characters in languages such as Chinese, Japanese, and Korean. Additionally, qualitative analyses demonstrate that our proposed model learns to focus on the parts of characters that carry topical content which resulting in embeddings that are coherent in visual space.

pdf bib
Proceedings of the First Workshop on Neural Machine Translation
Thang Luong | Alexandra Birch | Graham Neubig | Andrew Finch
Proceedings of the First Workshop on Neural Machine Translation

pdf bib
Stronger Baselines for Trustable Results in Neural Machine Translation
Michael Denkowski | Graham Neubig
Proceedings of the First Workshop on Neural Machine Translation

Interest in neural machine translation has grown rapidly as its effectiveness has been demonstrated across language and data scenarios. New research regularly introduces architectural and algorithmic improvements that lead to significant gains over vanilla NMT implementations. However, these new techniques are rarely evaluated in the context of previously published techniques, specifically those that are widely used in state-of-the-art production and shared-task systems. As a result, it is often difficult to determine whether improvements from research will carry over to systems deployed for real-world use. In this work, we recommend three specific methods that are relatively easy to implement and result in much stronger experimental systems. Beyond reporting significantly higher BLEU scores, we conduct an in-depth analysis of where improvements originate and what inherent weaknesses of basic NMT models are being addressed. We then compare the relative gains afforded by several other techniques proposed in the literature when starting with vanilla systems versus our stronger baselines, showing that experimental conclusions may change depending on the baseline chosen. This indicates that choosing a strong baseline is crucial for reporting reliable experimental results.

pdf bib
How Would You Say It? Eliciting Lexically Diverse Dialogue for Supervised Semantic Parsing
Abhilasha Ravichander | Thomas Manzini | Matthias Grabmair | Graham Neubig | Jonathan Francis | Eric Nyberg
Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue

Building dialogue interfaces for real-world scenarios often entails training semantic parsers starting from zero examples. How can we build datasets that better capture the variety of ways users might phrase their queries, and what queries are actually realistic? Wang et al. (2015) proposed a method to build semantic parsing datasets by generating canonical utterances using a grammar and having crowdworkers paraphrase them into natural wording. A limitation of this approach is that it induces bias towards using similar language as the canonical utterances. In this work, we present a methodology that elicits meaningful and lexically diverse queries from users for semantic parsing tasks. Starting from a seed lexicon and a generative grammar, we pair logical forms with mixed text-image representations and ask crowdworkers to paraphrase and confirm the plausibility of the queries that they generated. We use this method to build a semantic parsing dataset from scratch for a dialog agent in a smart-home simulation. We find evidence that this dataset, which we have named SmartHome, is demonstrably more lexically diverse and difficult to parse than existing domain-specific semantic parsing datasets.

pdf bib
Learning Language Representations for Typology Prediction
Chaitanya Malaviya | Graham Neubig | Patrick Littell
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

One central mystery of neural NLP is what neural models know about their subject matter. When a neural machine translation system learns to translate from one language to another, does it learn the syntax or semantics of the languages? Can this knowledge be extracted from the system to fill holes in human scientific knowledge? Existing typological databases contain relatively full feature specifications for only a few hundred languages. Exploiting the existence of parallel texts in more than a thousand languages, we build a massive many-to-one NMT system from 1017 languages into English, and use this to predict information missing from typological databases. Experiments show that the proposed method is able to infer not only syntactic, but also phonological and phonetic inventory features, and improves over a baseline that has access to information about the languages geographic and phylogenetic neighbors.

pdf bib
Charmanteau : Character Embedding Models For Portmanteau CreationCharmanteau: Character Embedding Models For Portmanteau Creation
Varun Gangal | Harsh Jhamtani | Graham Neubig | Eduard Hovy | Eric Nyberg
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Portmanteaus are a word formation phenomenon where two words combine into a new word. We propose character-level neural sequence-to-sequence (S2S) methods for the task of portmanteau generation that are end-to-end-trainable, language independent, and do not explicitly use additional phonetic information. We propose a noisy-channel-style model, which allows for the incorporation of unsupervised word lists, improving performance over a standard source-to-target model. This model is made possible by an exhaustive candidate generation strategy specifically enabled by the features of the portmanteau task. Experiments find our approach superior to a state-of-the-art FST-based baseline with respect to ground truth accuracy and human evaluation.

pdf bib
Cross-Lingual Word Embeddings for Low-Resource Language Modeling
Oliver Adams | Adam Makarucha | Graham Neubig | Steven Bird | Trevor Cohn
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

Most languages have no established writing system and minimal written records. However, textual data is essential for natural language processing, and particularly important for training language models to support speech recognition. Even in cases where text data is missing, there are some languages for which bilingual lexicons are available, since creating lexicons is a fundamental task of documentary linguistics. We investigate the use of such lexicons to improve language models when textual training data is limited to as few as a thousand sentences. The method involves learning cross-lingual word embeddings as a preliminary step in training monolingual language models. Results across a number of languages show that language models are improved by this pre-training. Application to Yongning Na, a threatened language, highlights challenges in deploying the approach in real low-resource environments.
Search
Co-authors