Dat Quoc Nguyen


2021

pdf bib
PhoMT : A High-Quality and Large-Scale Benchmark Dataset for Vietnamese-English Machine TranslationPhoMT: A High-Quality and Large-Scale Benchmark Dataset for Vietnamese-English Machine Translation
Long Doan | Linh The Nguyen | Nguyen Luong Tran | Thai Hoang | Dat Quoc Nguyen
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

We introduce a high-quality and large-scale Vietnamese-English parallel dataset of 3.02 M sentence pairs, which is 2.9 M pairs larger than the benchmark Vietnamese-English machine translation corpus IWSLT15. We conduct experiments comparing strong neural baselines and well-known automatic translation engines on our dataset and find that in both automatic and human evaluations : the best performance is obtained by fine-tuning the pre-trained sequence-to-sequence denoising auto-encoder mBART. To our best knowledge, this is the first large-scale Vietnamese-English machine translation study. We hope our publicly available dataset and study can serve as a starting point for future research and applications on Vietnamese-English machine translation. We release our dataset at : https://github.com/VinAIResearch/PhoMT

pdf bib
PhoNLP : A joint multi-task learning model for Vietnamese part-of-speech tagging, named entity recognition and dependency parsingPhoNLP: A joint multi-task learning model for Vietnamese part-of-speech tagging, named entity recognition and dependency parsing
Linh The Nguyen | Dat Quoc Nguyen
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations

We present the first multi-task learning model named PhoNLP for joint Vietnamese part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing. Experiments on Vietnamese benchmark datasets show that PhoNLP produces state-of-the-art results, outperforming a single-task learning approach that fine-tunes the pre-trained Vietnamese language model PhoBERT (Nguyen and Nguyen, 2020) for each task independently. We publicly release PhoNLP as an open-source toolkit under the Apache License 2.0. Although we specify PhoNLP for Vietnamese, our PhoNLP training and evaluation command scripts in fact can directly work for other languages that have a pre-trained BERT-based language model and gold annotated corpora available for the three tasks of POS tagging, NER and dependency parsing. We hope that PhoNLP can serve as a strong baseline and useful toolkit for future NLP research and applications to not only Vietnamese but also the other languages. Our PhoNLP is available at https://github.com/VinAIResearch/PhoNLP

2020

pdf bib
A survey of embedding models of entities and relationships for knowledge graph completion
Dat Quoc Nguyen
Proceedings of the Graph-based Methods for Natural Language Processing (TextGraphs)

Knowledge graphs (KGs) of real-world facts about entities and their relationships are useful resources for a variety of natural language processing tasks. However, because knowledge graphs are typically incomplete, it is useful to perform knowledge graph completion or link prediction, i.e. predict whether a relationship not in the knowledge graph is likely to be true. This paper serves as a comprehensive survey of embedding models of entities and relationships for knowledge graph completion, summarizing up-to-date experimental results on standard benchmark datasets and pointing out potential future research directions.

pdf bib
BERTweet : A pre-trained language model for English TweetsBERTweet: A pre-trained language model for English Tweets
Dat Quoc Nguyen | Thanh Vu | Anh Tuan Nguyen
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

We present BERTweet, the first public large-scale pre-trained language model for English Tweets. Our BERTweet, having the same architecture as BERT-base (Devlin et al., 2019), is trained using the RoBERTa pre-training procedure (Liu et al., 2019). Experiments show that BERTweet outperforms strong baselines RoBERTa-base and XLM-R-base (Conneau et al., 2020), producing better performance results than the previous state-of-the-art models on three Tweet NLP tasks : Part-of-speech tagging, Named-entity recognition and text classification. We release BERTweet under the MIT License to facilitate future research and applications on Tweet data. Our BERTweet is available at https://github.com/VinAIResearch/BERTweet

pdf bib
WNUT-2020 Task 2 : Identification of Informative COVID-19 English TweetsWNUT-2020 Task 2: Identification of Informative COVID-19 English Tweets
Dat Quoc Nguyen | Thanh Vu | Afshin Rahimi | Mai Hoang Dao | Linh The Nguyen | Long Doan
Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)

In this paper, we provide an overview of the WNUT-2020 shared task on the identification of informative COVID-19 English Tweets. We describe how we construct a corpus of 10 K Tweets and organize the development and evaluation phases for this task. In addition, we also present a brief summary of results obtained from the final system evaluation submissions of 55 teams, finding that (i) many systems obtain very high performance, up to 0.91 F1 score, (ii) the majority of the submissions achieve substantially higher results than the baseline fastText (Joulin et al., 2017), and (iii) fine-tuning pre-trained language models on relevant language data followed by supervised training performs well in this task.

2019

pdf bib
Detecting Chemical Reactions in Patents
Hiyori Yoshikawa | Dat Quoc Nguyen | Zenan Zhai | Christian Druckenbrodt | Camilo Thorne | Saber A. Akhondi | Timothy Baldwin | Karin Verspoor
Proceedings of the The 17th Annual Workshop of the Australasian Language Technology Association

Extracting chemical reactions from patents is a crucial task for chemists working on chemical exploration. In this paper we introduce the novel task of detecting the textual spans that describe or refer to chemical reactions within patents. We formulate this task as a paragraph-level sequence tagging problem, where the system is required to return a sequence of paragraphs which contain a description of a reaction. To address this new task, we construct an annotated dataset from an existing proprietary database of chemical reactions manually extracted from patents. We introduce several baseline methods for the task and evaluate them over our dataset. Through error analysis, we discuss what makes the task complex and challenging, and suggest possible directions for future research.

2018

pdf bib
NIHRIO at SemEval-2018 Task 3 : A Simple and Accurate Neural Network Model for Irony Detection in TwitterNIHRIO at SemEval-2018 Task 3: A Simple and Accurate Neural Network Model for Irony Detection in Twitter
Thanh Vu | Dat Quoc Nguyen | Xuan-Son Vu | Dai Quoc Nguyen | Michael Catt | Michael Trenell
Proceedings of The 12th International Workshop on Semantic Evaluation

This paper describes our NIHRIO system for SemEval-2018 Task 3 Irony detection in English tweets. We propose to use a simple neural network architecture of Multilayer Perceptron with various types of input features including : lexical, syntactic, semantic and polarity features. Our system achieves very high performance in both subtasks of binary and multi-class irony detection in tweets. In particular, we rank at least fourth using the accuracy metric and sixth using the F1 metric. Our code is available at :https://github.com/NIHRIO/IronyDetectionInTwitter\n

pdf bib
Comparing CNN and LSTM character-level embeddings in BiLSTM-CRF models for chemical and disease named entity recognitionCNN and LSTM character-level embeddings in BiLSTM-CRF models for chemical and disease named entity recognition
Zenan Zhai | Dat Quoc Nguyen | Karin Verspoor
Proceedings of the Ninth International Workshop on Health Text Mining and Information Analysis

We compare the use of LSTM-based and CNN-based character-level word embeddings in BiLSTM-CRF models to approach chemical and disease named entity recognition (NER) tasks. Empirical results over the BioCreative V CDR corpus show that the use of either type of character-level word embeddings in conjunction with the BiLSTM-CRF models leads to comparable state-of-the-art performance. However, the models using CNN-based character-level word embeddings have a computational performance advantage, increasing training time over word-based models by 25 % while the LSTM-based character-level word embeddings more than double the required training time.

pdf bib
A Novel Embedding Model for Knowledge Base Completion Based on Convolutional Neural Network
Dai Quoc Nguyen | Tu Dinh Nguyen | Dat Quoc Nguyen | Dinh Phung
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

In this paper, we propose a novel embedding model, named ConvKB, for knowledge base completion. Our model ConvKB advances state-of-the-art models by employing a convolutional neural network, so that it can capture global relationships and transitional characteristics between entities and relations in knowledge bases. In ConvKB, each triple (head entity, relation, tail entity) is represented as a 3-column matrix where each column vector represents a triple element. This 3-column matrix is then fed to a convolution layer where multiple filters are operated on the matrix to generate different feature maps. These feature maps are then concatenated into a single feature vector representing the input triple. The feature vector is multiplied with a weight vector via a dot product to return a score. This score is then used to predict whether the triple is valid or not. Experiments show that ConvKB achieves better link prediction performance than previous state-of-the-art embedding models on two benchmark datasets WN18RR and FB15k-237.

pdf bib
VnCoreNLP : A Vietnamese Natural Language Processing ToolkitVnCoreNLP: A Vietnamese Natural Language Processing Toolkit
Thanh Vu | Dat Quoc Nguyen | Dai Quoc Nguyen | Mark Dras | Mark Johnson
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations

We present an easy-to-use and fast toolkit, namely VnCoreNLPa Java NLP annotation pipeline for Vietnamese. Our VnCoreNLP supports key natural language processing (NLP) tasks including word segmentation, part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing, and obtains state-of-the-art (SOTA) results for these tasks. We release VnCoreNLP to provide rich linguistic annotations to facilitate research work on Vietnamese NLP. Our VnCoreNLP is open-source and available at :https://github.com/vncorenlp/VnCoreNLP\n

pdf bib
An Improved Neural Network Model for Joint POS Tagging and Dependency ParsingPOS Tagging and Dependency Parsing
Dat Quoc Nguyen | Karin Verspoor
Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

We propose a novel neural network model for joint part-of-speech (POS) tagging and dependency parsing. Our model extends the well-known BIST graph-based dependency parser (Kiperwasser and Goldberg, 2016) by incorporating a BiLSTM-based tagging component to produce automatically predicted POS tags for the parser. On the benchmark English Penn treebank, our model obtains strong UAS and LAS scores at 94.51 % and 92.87 %, respectively, producing 1.5+% absolute improvements to the BIST graph-based parser, and also obtaining a state-of-the-art POS tagging accuracy at 97.97 %. Furthermore, experimental results on parsing 61 big Universal Dependencies treebanks from raw texts show that our model outperforms the baseline UDPipe (Straka and Strakova, 2017) with 0.8 % higher average POS tagging score and 3.6 % higher average LAS score. In addition, with our model, we also obtain state-of-the-art downstream task scores for biomedical event extraction and opinion analysis applications. Our code is available together with all pre-trained models at :https://github.com/datquocnguyen/jPTDP\n

2017

pdf bib
A Novel Neural Network Model for Joint POS Tagging and Graph-based Dependency ParsingPOS Tagging and Graph-based Dependency Parsing
Dat Quoc Nguyen | Mark Dras | Mark Johnson
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

We present a novel neural network model that learns POS tagging and graph-based dependency parsing jointly. Our model uses bidirectional LSTMs to learn feature representations shared for both POS tagging and dependency parsing tasks, thus handling the feature-engineering problem. Our extensive experiments, on 19 languages from the Universal Dependencies project, show that our model outperforms the state-of-the-art neural network-based Stack-propagation model for joint POS tagging and transition-based dependency parsing, resulting in a new state of the art. Our code is open-source and available together with pre-trained models at :https://github.com/datquocnguyen/jPTDP\n

pdf bib
A Mixture Model for Learning Multi-Sense Word Embeddings
Dai Quoc Nguyen | Dat Quoc Nguyen | Ashutosh Modi | Stefan Thater | Manfred Pinkal
Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (*SEM 2017)

Word embeddings are now a standard technique for inducing meaning representations for words. For getting good representations, it is important to take into account different senses of a word. In this paper, we propose a mixture model for learning multi-sense word embeddings. Our model generalizes the previous works in that it allows to induce different weights of different senses of a word. The experimental results show that our model outperforms previous models on standard evaluation tasks.