Makoto Morishita


2021

pdf bib
Proceedings of the Sixth Conference on Machine Translation
Loic Barrault | Ondrej Bojar | Fethi Bougares | Rajen Chatterjee | Marta R. Costa-jussa | Christian Federmann | Mark Fishel | Alexander Fraser | Markus Freitag | Yvette Graham | Roman Grundkiewicz | Paco Guzman | Barry Haddow | Matthias Huck | Antonio Jimeno Yepes | Philipp Koehn | Tom Kocmi | Andre Martins | Makoto Morishita | Christof Monz
Proceedings of the Sixth Conference on Machine Translation

2020

pdf bib
Proceedings of the Fifth Conference on Machine Translation
Loïc Barrault | Ondřej Bojar | Fethi Bougares | Rajen Chatterjee | Marta R. Costa-jussà | Christian Federmann | Mark Fishel | Alexander Fraser | Yvette Graham | Paco Guzman | Barry Haddow | Matthias Huck | Antonio Jimeno Yepes | Philipp Koehn | André Martins | Makoto Morishita | Christof Monz | Masaaki Nagata | Toshiaki Nakazawa | Matteo Negri
Proceedings of the Fifth Conference on Machine Translation

pdf bib
Tohoku-AIP-NTT at WMT 2020 News Translation TaskAIP-NTT at WMT 2020 News Translation Task
Shun Kiyono | Takumi Ito | Ryuto Konno | Makoto Morishita | Jun Suzuki
Proceedings of the Fifth Conference on Machine Translation

In this paper, we describe the submission of Tohoku-AIP-NTT to the WMT’20 news translation task. We participated in this task in two language pairs and four language directions : English German and English Japanese. Our system consists of techniques such as back-translation and fine-tuning, which are already widely adopted in translation tasks. We attempted to develop new methods for both synthetic data filtering and reranking. However, the methods turned out to be ineffective, and they provided us with no significant improvement over the baseline. We analyze these negative results to provide insights for future studies.

pdf bib
JParaCrawl : A Large Scale Web-Based English-Japanese Parallel CorpusJParaCrawl: A Large Scale Web-Based English-Japanese Parallel Corpus
Makoto Morishita | Jun Suzuki | Masaaki Nagata
Proceedings of the 12th Language Resources and Evaluation Conference

Recent machine translation algorithms mainly rely on parallel corpora. However, since the availability of parallel corpora remains limited, only some resource-rich language pairs can benefit from them. We constructed a parallel corpus for English-Japanese, for which the amount of publicly available parallel corpora is still limited. We constructed the parallel corpus by broadly crawling the web and automatically aligning parallel sentences. Our collected corpus, called JParaCrawl, amassed over 8.7 million sentence pairs. We show how it includes a broader range of domains and how a neural machine translation model trained with it works as a good pre-trained model for fine-tuning specific domains. The pre-training and fine-tuning approaches achieved or surpassed performance comparable to model training from the initial state and reduced the training time. Additionally, we trained the model with an in-domain dataset and JParaCrawl to show how we achieved the best performance with them. JParaCrawl and the pre-trained models are freely available online for research purposes.

pdf bib
A Test Set for Discourse Translation from Japanese to EnglishJapanese to English
Masaaki Nagata | Makoto Morishita
Proceedings of the 12th Language Resources and Evaluation Conference

We made a test set for Japanese-to-English discourse translation to evaluate the power of context-aware machine translation. For each discourse phenomenon, we systematically collected examples where the translation of the second sentence depends on the first sentence. Compared with a previous study on test sets for English-to-French discourse translation (CITATION), we needed different approaches to make the data because Japanese has zero pronouns and represents different senses in different characters. We improved the translation accuracy using context-aware neural machine translation, and the improvement mainly reflects the betterment of the translation of zero pronouns.

2019

pdf bib
NTT’s Machine Translation Systems for WMT19 Robustness TaskNTT’s Machine Translation Systems for WMT19 Robustness Task
Soichiro Murakami | Makoto Morishita | Tsutomu Hirao | Masaaki Nagata
Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1)

This paper describes NTT’s submission to the WMT19 robustness task. This task mainly focuses on translating noisy text (e.g., posts on Twitter), which presents different difficulties from typical translation tasks such as news. Our submission combined techniques including utilization of a synthetic corpus, domain adaptation, and a placeholder mechanism, which significantly improved over the previous baseline. Experimental results revealed the placeholder mechanism, which temporarily replaces the non-standard tokens including emojis and emoticons with special placeholder tokens during translation, improves translation accuracy even with noisy texts.

2018

pdf bib
Improving Neural Machine Translation by Incorporating Hierarchical Subword Features
Makoto Morishita | Jun Suzuki | Masaaki Nagata
Proceedings of the 27th International Conference on Computational Linguistics

This paper focuses on subword-based Neural Machine Translation (NMT). We hypothesize that in the NMT model, the appropriate subword units for the following three modules (layers) can differ : (1) the encoder embedding layer, (2) the decoder embedding layer, and (3) the decoder output layer. We find the subword based on Sennrich et al. (2016) has a feature that a large vocabulary is a superset of a small vocabulary and modify the NMT model enables the incorporation of several different subword units in a single embedding layer. We refer these small subword features as hierarchical subword features. To empirically investigate our assumption, we compare the performance of several different subword units and hierarchical subword features for both the encoder and decoder embedding layers. We confirmed that incorporating hierarchical subword features in the encoder consistently improves BLEU scores on the IWSLT evaluation datasets.