Workshop on Technologies for MT of Low Resource Languages (2020)


bib (full) Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages

pdf bib
Proceedings of the 3rd Workshop on Technologies for MT of Low Resource Languages
Alina Karakanta | Atul Kr. Ojha | Chao-Hong Liu | Jade Abbott | John Ortega | Jonathan Washington | Nathaniel Oco | Surafel Melaku Lakew | Tommi A Pirinen | Valentin Malykh | Varvara Logacheva | Xiaobing Zhao

pdf bib
Bridging Philippine Languages With Multilingual Neural Machine TranslationPhilippine Languages With Multilingual Neural Machine Translation
Renz Iver Baliber | Charibeth Cheng | Kristine Mae Adlaon | Virgion Mamonong

The Philippines is home to more than 150 languages that is considered to be low-resourced even on its major languages. This results into a lack of pursuit in developing a translation system for the underrepresented languages. To simplify the process of developing translation system for multiple languages, and to aid in improving the translation quality of zero to low-resource languages, multilingual NMT became an active area of research. However, existing works in multilingual NMT disregards the analysis of a multilingual model on a closely related and low-resource language group in the context of pivot-based translation and zero-shot translation. In this paper, we benchmarked translation for several Philippine Languages, provided an analysis of a multilingual NMT system for morphologically rich and low-resource languages in terms of its effectiveness in translating zero-resource languages with zero-shot translations. To further evaluate the capability of the multilingual NMT model in translating unseen language pairs in training, we tested the model to translate between Tagalog and Cebuano and compared its performance with a simple NMT model that is directly trained on a parallel Tagalog and Cebuano data in which we showed that zero-shot translation outperforms a directly trained model in some instances, while utilizing English as a pivot language in translating outperform both approaches.

pdf bib
Findings of the LoResMT 2020 Shared Task on Zero-Shot for Low-Resource languagesLoResMT 2020 Shared Task on Zero-Shot for Low-Resource languages
Atul Kr. Ojha | Valentin Malykh | Alina Karakanta | Chao-Hong Liu

This paper presents the findings of the LoResMT 2020 Shared Task on zero-shot translation for low resource languages. This task was organised as part of the 3rd Workshop on Technologies for MT of Low Resource Languages (LoResMT) at AACL-IJCNLP 2020. The focus was on the zero-shot approach as a notable development in Neural Machine Translation to build MT systems for language pairs where parallel corpora are small or even non-existent. The shared task experience suggests that back-translation and domain adaptation methods result in better accuracy for small-size datasets. We further noted that, although translation between similar languages is no cakewalk, linguistically distinct languages require more data to give better results.

pdf bib
Zero-Shot Neural Machine Translation : Russian-Hindi @LoResMT 2020Russian-Hindi @LoResMT 2020
Sahinur Rahman Laskar | Abdullah Faiz Ur Rahman Khilji | Partha Pakray | Sivaji Bandyopadhyay

Neural machine translation (NMT) is a widely accepted approach in the machine translation (MT) community, translating from one natural language to another natural language. Although, NMT shows remarkable performance in both high and low resource languages, it needs sufficient training corpus. The availability of a parallel corpus in low resource language pairs is one of the challenging tasks in MT. To mitigate this issue, NMT attempts to utilize a monolingual corpus to get better at translation for low resource language pairs. Workshop on Technologies for MT of Low Resource Languages (LoResMT 2020) organized shared tasks of low resource language pair translation using zero-shot NMT. Here, the parallel corpus is not used and only monolingual corpora is allowed. We have participated in the same shared task with our team name CNLP-NITS for the Russian-Hindi language pair. We have used masked sequence to sequence pre-training for language generation (MASS) with only monolingual corpus following the unsupervised NMT architecture. The evaluated results are declared at the LoResMT 2020 shared task, which reports that our system achieves the bilingual evaluation understudy (BLEU) score of 0.59, precision score of 3.43, recall score of 5.48, F-measure score of 4.22, and rank-based intuitive bilingual evaluation score (RIBES) of 0.180147 in Russian to Hindi translation. And for Hindi to Russian translation, we have achieved BLEU, precision, recall, F-measure, and RIBES score of 1.11, 4.72, 4.41, 4.56, and 0.026842 respectively.

pdf bib
Unsupervised Approach for Zero-Shot Experiments : BhojpuriHindi and MagahiHindi@LoResMT 2020Bhojpuri–Hindi and Magahi–Hindi@LoResMT 2020
Amit Kumar | Rajesh Kumar Mundotiya | Anil Kumar Singh

This paper reports a Machine Translation (MT) system submitted by the NLPRL team for the BhojpuriHindi and MagahiHindi language pairs at LoResMT 2020 shared task. We used an unsupervised domain adaptation approach that gives promising results for zero or extremely low resource languages. Task organizers provide the development and the test sets for evaluation and the monolingual data for training. Our approach is a hybrid approach of domain adaptation and back-translation. Metrics used to evaluate the trained model are BLEU, RIBES, Precision, Recall and F-measure. Our approach gives relatively promising results, with a wide range, of 19.5, 13.71, 2.54, and 3.16 BLEU points for Bhojpuri to Hindi, Magahi to Hindi, Hindi to Bhojpuri and Hindi to Magahi language pairs, respectively.

pdf bib
Towards Machine Translation for the Kurdish LanguageKurdish Language
Sina Ahmadi | Maraim Masoud

Machine translation is the task of translating texts from one language to another using computers. It has been one of the major tasks in natural language processing and computational linguistics and has been motivating to facilitate human communication. Kurdish, an Indo-European language, has received little attention in this realm due to the language being less-resourced. Therefore, in this paper, we are addressing the main issues in creating a machine translation system for the Kurdish language, with a focus on the Sorani dialect. We describe the available scarce parallel data suitable for training a neural machine translation model for Sorani Kurdish-English translation. We also discuss some of the major challenges in Kurdish language translation and demonstrate how fundamental text processing tasks, such as tokenization, can improve translation performance.

pdf bib
Investigating Low-resource Machine Translation for English-to-TamilEnglish-to-Tamil
Akshai Ramesh | Venkatesh Balavadhani parthasa | Rejwanul Haque | Andy Way

Statistical machine translation (SMT) which was the dominant paradigm in machine translation (MT) research for nearly three decades has recently been superseded by the end-to-end deep learning approaches to MT. Although deep neural models produce state-of-the-art results in many translation tasks, they are found to under-perform on resource-poor scenarios. Despite some success, none of the present-day benchmarks that have tried to overcome this problem can be regarded as a universal solution to the problem of translation of many low-resource languages. In this work, we investigate the performance of phrase-based SMT (PB-SMT) and neural MT (NMT) on a rarely-tested low-resource language-pair, English-to-Tamil, taking a specialised data domain (software localisation) into consideration. In particular, we produce rankings of our MT systems via a social media platform-based human evaluation scheme, and demonstrate our findings in the low-resource domain-specific text translation task.