Sadao Kurohashi


2021

pdf bib
Contextualized and Generalized Sentence Representations by Contrastive Self-Supervised Learning : A Case Study on Discourse Relation Analysis
Hirokazu Kiyomaru | Sadao Kurohashi
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We propose a method to learn contextualized and generalized sentence representations using contrastive self-supervised learning. In the proposed method, a model is given a text consisting of multiple sentences. One sentence is randomly selected as a target sentence. The model is trained to maximize the similarity between the representation of the target sentence with its context and that of the masked target sentence with the same context. Simultaneously, the model minimizes the similarity between the latter representation and the representation of a random sentence with the same context. We apply our method to discourse relation analysis in English and Japanese and show that it outperforms strong baseline methods based on BERT, XLNet, and RoBERTa.

pdf bib
Proceedings of the 8th Workshop on Asian Translation (WAT2021)
Toshiaki Nakazawa | Hideki Nakayama | Isao Goto | Hideya Mino | Chenchen Ding | Raj Dabre | Anoop Kunchukuttan | Shohei Higashiyama | Hiroshi Manabe | Win Pa Pa | Shantipriya Parida | Ondřej Bojar | Chenhui Chu | Akiko Eriguchi | Kaori Abe | Yusuke Oda | Katsuhito Sudoh | Sadao Kurohashi | Pushpak Bhattacharyya
Proceedings of the 8th Workshop on Asian Translation (WAT2021)

2020

pdf bib
BERT-based Cohesion Analysis of Japanese TextsBERT-based Cohesion Analysis of Japanese Texts
Nobuhiro Ueda | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 28th International Conference on Computational Linguistics

The meaning of natural language text is supported by cohesion among various kinds of entities, including coreference relations, predicate-argument structures, and bridging anaphora relations. However, predicate-argument structures for nominal predicates and bridging anaphora relations have not been studied well, and their analyses have been still very difficult. Recent advances in neural networks, in particular, self training-based language models including BERT (Devlin et al., 2019), have significantly improved many natural language processing tasks, making it possible to dive into the study on analysis of cohesion in the whole text. In this study, we tackle an integrated analysis of cohesion in Japanese texts. Our results significantly outperformed existing studies in each task, especially about 10 to 20 point improvement both for zero anaphora and coreference resolution. Furthermore, we also showed that coreference resolution is different in nature from the other tasks and should be treated specially.

pdf bib
Proceedings of the 7th Workshop on Asian Translation
Toshiaki Nakazawa | Hideki Nakayama | Chenchen Ding | Raj Dabre | Anoop Kunchukuttan | Win Pa Pa | Ondřej Bojar | Shantipriya Parida | Isao Goto | Hidaya Mino | Hiroshi Manabe | Katsuhito Sudoh | Sadao Kurohashi | Pushpak Bhattacharyya
Proceedings of the 7th Workshop on Asian Translation

pdf bib
Meta Ensemble for Japanese-Chinese Neural Machine Translation : Kyoto-U+ECNU Participation to WAT 2020Japanese-Chinese Neural Machine Translation: Kyoto-U+ECNU Participation to WAT 2020
Zhuoyuan Mao | Yibin Shen | Chenhui Chu | Sadao Kurohashi | Cheqing Jin
Proceedings of the 7th Workshop on Asian Translation

This paper describes the Japanese-Chinese Neural Machine Translation (NMT) system submitted by the joint team of Kyoto University and East China Normal University (Kyoto-U+ECNU) to WAT 2020 (Nakazawa et al.,2020). We participate in APSEC Japanese-Chinese translation task. We revisit several techniques for NMT including various architectures, different data selection and augmentation methods, denoising pre-training, and also some specific tricks for Japanese-Chinese translation. We eventually perform a meta ensemble to combine all of the models into a single model. BLEU results of this meta ensembled model rank the first both on 2 directions of ASPEC Japanese-Chinese translation.

pdf bib
Development of a Japanese Personality Dictionary based on Psychological MethodsJapanese Personality Dictionary based on Psychological Methods
Ritsuko Iwai | Daisuke Kawahara | Takatsune Kumada | Sadao Kurohashi
Proceedings of the 12th Language Resources and Evaluation Conference

We propose a new approach to constructing a personality dictionary with psychological evidence. In this study, we collect personality words, using word embeddings, and construct a personality dictionary with weights for Big Five traits. The weights are calculated based on the responses of the large sample (N=1,938, female = 1,004, M=49.8years old:20-78, SD=16.3). All the respondents answered a 20-item personality questionnaire and 537 personality items derived from word embeddings. We present the procedures to examine the qualities of responses with psychological methods and to calculate the weights. These result in a personality dictionary with two sub-dictionaries. We also discuss an application of the acquired resources.

pdf bib
Coursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures TranslationCoursera Corpus Mining and Multistage Fine-Tuning for Improving Lectures Translation
Haiyue Song | Raj Dabre | Atsushi Fujita | Sadao Kurohashi
Proceedings of the 12th Language Resources and Evaluation Conference

Lectures translation is a case of spoken language translation and there is a lack of publicly available parallel corpora for this purpose. To address this, we examine a framework for parallel corpus mining which is a quick and effective way to mine a parallel corpus from publicly available lectures at Coursera. Our approach determines sentence alignments, relying on machine translation and cosine similarity over continuous-space sentence representations. We also show how to use the resulting corpora in a multistage fine-tuning based domain adaptation for high-quality lectures translation. For JapaneseEnglish lectures translation, we extracted parallel data of approximately 40,000 lines and created development and test sets through manual filtering for benchmarking translation performance. We demonstrate that the mined corpus greatly enhances the quality of translation when used in conjunction with out-of-domain parallel corpora via multistage training. This paper also suggests some guidelines to gather and clean corpora, mine parallel sentences, address noise in the mined data, and create high-quality evaluation splits. For the sake of reproducibility, we have released our code for parallel data creation.

2019

pdf bib
Overview of the 6th Workshop on Asian TranslationAsian Translation
Toshiaki Nakazawa | Nobushige Doi | Shohei Higashiyama | Chenchen Ding | Raj Dabre | Hideya Mino | Isao Goto | Win Pa Pa | Anoop Kunchukuttan | Yusuke Oda | Shantipriya Parida | Ondřej Bojar | Sadao Kurohashi
Proceedings of the 6th Workshop on Asian Translation

This paper presents the results of the shared tasks from the 6th workshop on Asian translation (WAT2019) including JaEn, JaZh scientific paper translation subtasks, JaEn, JaKo, JaEn patent translation subtasks, HiEn, MyEn, KmEn, TaEn mixed domain subtasks and RuJa news commentary translation task. For the WAT2019, 25 teams participated in the shared tasks. We also received 10 research paper submissions out of which 61 were accepted. About 400 translation results were submitted to the automatic evaluation server, and selected submis- sions were manually evaluated.

pdf bib
Machine Comprehension Improves Domain-Specific Japanese Predicate-Argument Structure AnalysisJapanese Predicate-Argument Structure Analysis
Norio Takahashi | Tomohide Shibata | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 2nd Workshop on Machine Reading for Question Answering

To improve the accuracy of predicate-argument structure (PAS) analysis, large-scale training data and knowledge for PAS analysis are indispensable. We focus on a specific domain, specifically Japanese blogs on driving, and construct two wide-coverage datasets as a form of QA using crowdsourcing : a PAS-QA dataset and a reading comprehension QA (RC-QA) dataset. We train a machine comprehension (MC) model based on these datasets to perform PAS analysis. Our experiments show that a stepwise training method is the most effective, which pre-trains an MC model based on the RC-QA dataset to acquire domain knowledge and then fine-tunes based on the PAS-QA dataset.

pdf bib
Diversity-aware Event Prediction based on a Conditional Variational Autoencoder with Reconstruction
Hirokazu Kiyomaru | Kazumasa Omura | Yugo Murawaki | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the First Workshop on Commonsense Inference in Natural Language Processing

Typical event sequences are an important class of commonsense knowledge. Formalizing the task as the generation of a next event conditioned on a current event, previous work in event prediction employs sequence-to-sequence (seq2seq) models. However, what can happen after a given event is usually diverse, a fact that can hardly be captured by deterministic models. In this paper, we propose to incorporate a conditional variational autoencoder (CVAE) into seq2seq for its ability to represent diverse next events as a probabilistic distribution. We further extend the CVAE-based seq2seq with a reconstruction mechanism to prevent the model from concentrating on highly typical events. To facilitate fair and systematic evaluation of the diversity-aware models, we also extend existing evaluation datasets by tying each current event to multiple next events. Experiments show that the CVAE-based models drastically outperform deterministic models in terms of precision and that the reconstruction mechanism improves the recall of CVAE-based models without sacrificing precision.

pdf bib
Shrinking Japanese Morphological Analyzers With Neural Networks and Semi-supervised LearningJapanese Morphological Analyzers With Neural Networks and Semi-supervised Learning
Arseny Tolmachev | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

For languages without natural word boundaries, like Japanese and Chinese, word segmentation is a prerequisite for downstream analysis. For Japanese, segmentation is often done jointly with part of speech tagging, and this process is usually referred to as morphological analysis. Morphological analyzers are trained on data hand-annotated with segmentation boundaries and part of speech tags. A segmentation dictionary or character n-gram information is also provided as additional inputs to the model. Incorporating this extra information makes models large. Modern neural morphological analyzers can consume gigabytes of memory. We propose a compact alternative to these cumbersome approaches which do not rely on any externally provided n-gram or word representations. The model uses only unigram character embeddings, encodes them using either stacked bi-LSTM or a self-attention network, and independently infers both segmentation and part of speech information. The model is trained in an end-to-end and semi-supervised fashion, on labels produced by a state-of-the-art analyzer. We demonstrate that the proposed technique rivals performance of a previous dictionary-based state-of-the-art approach and can even surpass it when training with the combination of human-annotated and automatically-annotated data. Our model itself is significantly smaller than the dictionary-based one : it uses less than 15 megabytes of space.

2018

pdf bib
Cross-lingual Knowledge Projection Using Machine Translation and Target-side Knowledge Base Completion
Naoki Otani | Hirokazu Kiyomaru | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 27th International Conference on Computational Linguistics

Considerable effort has been devoted to building commonsense knowledge bases. However, they are not available in many languages because the construction of KBs is expensive. To bridge the gap between languages, this paper addresses the problem of projecting the knowledge in English, a resource-rich language, into other languages, where the main challenge lies in projection ambiguity. This ambiguity is partially solved by machine translation and target-side knowledge base completion, but neither of them is adequately reliable by itself. We show their combination can project English commonsense knowledge into Japanese and Chinese with high precision. Our method also achieves a top-10 accuracy of 90 % on the crowdsourced EnglishJapanese benchmark. Furthermore, we use our method to obtain 18,747 facts of accurate Japanese commonsense within a very short period.

pdf bib
Juman++ : A Morphological Analysis Toolkit for Scriptio ContinuaJuman++: A Morphological Analysis Toolkit for Scriptio Continua
Arseny Tolmachev | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

We present a three-part toolkit for developing morphological analyzers for languages without natural word boundaries. The first part is a C++11/14 lattice-based morphological analysis library that uses a combination of linear and recurrent neural net language models for analysis. The other parts are a tool for exposing problems in the trained model and a partial annotation tool. Our morphological analyzer of Japanese achieves new SOTA on Jumandic-based corpora while being 250 times faster than the previous one. We also perform a small experiment and quantitive analysis and experience of using development tools. All components of the toolkit is open source and available under a permissive Apache 2 License.

pdf bib
Knowledge-Enriched Two-Layered Attention Network for Sentiment Analysis
Abhishek Kumar | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

We propose a novel two-layered attention network based on Bidirectional Long Short-Term Memory for sentiment analysis. The novel two-layered attention network takes advantage of the external knowledge bases to improve the sentiment prediction. It uses the Knowledge Graph Embedding generated using the WordNet. We build our model by combining the two-layered attention network with the supervised model based on Support Vector Regression using a Multilayer Perceptron network for sentiment analysis. We evaluate our model on the benchmark dataset of SemEval 2017 Task 5. Experimental results show that the proposed model surpasses the top system of SemEval 2017 Task 5. The model performs significantly better by improving the state-of-the-art system at SemEval 2017 Task 5 by 1.7 and 3.7 points for sub-tracks 1 and 2 respectively.

pdf bib
Neural Adversarial Training for Semi-supervised Japanese Predicate-argument Structure AnalysisJapanese Predicate-argument Structure Analysis
Shuhei Kurita | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Japanese predicate-argument structure (PAS) analysis involves zero anaphora resolution, which is notoriously difficult. To improve the performance of Japanese PAS analysis, it is straightforward to increase the size of corpora annotated with PAS. However, since it is prohibitively expensive, it is promising to take advantage of a large amount of raw corpora. In this paper, we propose a novel Japanese PAS analysis model based on semi-supervised adversarial training with a raw corpus. In our experiments, our model outperforms existing state-of-the-art models for Japanese PAS analysis.

pdf bib
Entity-Centric Joint Modeling of Japanese Coreference Resolution and Predicate Argument Structure AnalysisJapanese Coreference Resolution and Predicate Argument Structure Analysis
Tomohide Shibata | Sadao Kurohashi
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Predicate argument structure analysis is a task of identifying structured events. To improve this field, we need to identify a salient entity, which can not be identified without performing coreference resolution and predicate argument structure analysis simultaneously. This paper presents an entity-centric joint model for Japanese coreference resolution and predicate argument structure analysis. Each entity is assigned an embedding, and when the result of both analyses refers to an entity, the entity embedding is updated. The analyses take the entity embedding into consideration to access the global information of entities. Our experimental results demonstrate the proposed method can improve the performance of the inter-sentential zero anaphora resolution drastically, which is a notoriously difficult task in predicate argument structure analysis.

2017

pdf bib
Proceedings of the IJCNLP 2017, Tutorial Abstracts
Sadao Kurohashi | Michael Strube
Proceedings of the IJCNLP 2017, Tutorial Abstracts

pdf bib
An Empirical Comparison of Domain Adaptation Methods for Neural Machine Translation
Chenhui Chu | Raj Dabre | Sadao Kurohashi
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

In this paper, we propose a novel domain adaptation method named mixed fine tuning for neural machine translation (NMT). We combine two existing approaches namely fine tuning and multi domain NMT. We first train an NMT model on an out-of-domain parallel corpus, and then fine tune it on a parallel corpus which is a mix of the in-domain and out-of-domain corpora. All corpora are augmented with artificial tags to indicate specific domains. We empirically compare our proposed method against fine tuning and multi domain methods and discuss its benefits and shortcomings.

pdf bib
Automatic Extraction of High-Quality Example Sentences for Word Learning Using a Determinantal Point Process
Arseny Tolmachev | Sadao Kurohashi
Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications

Flashcard systems are effective tools for learning words but have their limitations in teaching word usage. To overcome this problem, we propose a novel flashcard system that shows a new example sentence on each repetition. This extension requires high-quality example sentences, automatically extracted from a huge corpus. To do this, we use a Determinantal Point Process which scales well to large data and allows to naturally represent sentence similarity and quality as features. Our human evaluation experiment on Japanese language indicates that the proposed method successfully extracted high-quality example sentences.

pdf bib
Kyoto University Participation to WAT 2017Kyoto University Participation to WAT 2017
Fabien Cromieres | Raj Dabre | Toshiaki Nakazawa | Sadao Kurohashi
Proceedings of the 4th Workshop on Asian Translation (WAT2017)

We describe here our approaches and results on the WAT 2017 shared translation tasks. Following our good results with Neural Machine Translation in the previous shared task, we continue this approach this year, with incremental improvements in models and training methods. We focused on the ASPEC dataset and could improve the state-of-the-art results for Chinese-to-Japanese and Japanese-to-Chinese translations.

pdf bib
Automatically Acquired Lexical Knowledge Improves Japanese Joint Morphological and Dependency AnalysisJapanese Joint Morphological and Dependency Analysis
Daisuke Kawahara | Yuta Hayashibe | Hajime Morita | Sadao Kurohashi
Proceedings of the 15th International Conference on Parsing Technologies

This paper presents a joint model for morphological and dependency analysis based on automatically acquired lexical knowledge. This model takes advantage of rich lexical knowledge to simultaneously resolve word segmentation, POS, and dependency ambiguities. In our experiments on Japanese, we show the effectiveness of our joint model over conventional pipeline models.

pdf bib
Kyoto University MT System Description for IWSLT 2017Kyoto University MT System Description for IWSLT 2017
Raj Dabre | Fabien Cromieres | Sadao Kurohashi
Proceedings of the 14th International Conference on Spoken Language Translation

We describe here our Machine Translation (MT) model and the results we obtained for the IWSLT 2017 Multilingual Shared Task. Motivated by Zero Shot NMT [ 1 ] we trained a Multilingual Neural Machine Translation by combining all the training data into one single collection by appending the tokens to the source sentences in order to indicate the target language they should be translated to. We observed that even in a low resource situation we were able to get translations whose quality surpass the quality of those obtained by Phrase Based Statistical Machine Translation by several BLEU points. The most surprising result we obtained was in the zero shot setting for Dutch-German and Italian-Romanian where we observed that despite using no parallel corpora between these language pairs, the NMT model was able to translate between these languages and the translations were either as good as or better (in terms of BLEU) than the non zero resource setting. We also verify that the NMT models that use feed forward layers and self attention instead of recurrent layers are extremely fast in terms of training which is useful in a NMT experimental setting.

pdf bib
Improving Chinese Semantic Role Labeling using High-quality Surface and Deep Case FramesChinese Semantic Role Labeling using High-quality Surface and Deep Case Frames
Gongye Jin | Daisuke Kawahara | Sadao Kurohashi
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

This paper presents a method for applying automatically acquired knowledge to semantic role labeling (SRL). We use a large amount of automatically extracted knowledge to improve the performance of SRL. We present two varieties of knowledge, which we call surface case frames and deep case frames. Although the surface case frames are compiled from syntactic parses and can be used as rich syntactic knowledge, they have limited capability for resolving semantic ambiguity. To compensate the deficiency of the surface case frames, we compile deep case frames from automatic semantic roles. We also consider quality management for both types of knowledge in order to get rid of the noise brought from the automatic analyses. The experimental results show that Chinese SRL can be improved using automatically acquired knowledge and the quality management shows a positive effect on this task.