El Moatez Billah Nagoudi


2021

pdf bib
ARBERT & MARBERT : Deep Bidirectional Transformers for ArabicARBERT & MARBERT: Deep Bidirectional Transformers for Arabic
Muhammad Abdul-Mageed | AbdelRahim Elmadany | El Moatez Billah Nagoudi
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Pre-trained language models (LMs) are currently integral to many natural language processing systems. Although multilingual LMs were also introduced to serve many languages, these have limitations such as being costly at inference time and the size and diversity of non-English data involved in their pre-training. We remedy these issues for a collection of diverse Arabic varieties by introducing two powerful deep bidirectional transformer-based models, ARBERT and MARBERT. To evaluate our models, we also introduce ARLUE, a new benchmark for multi-dialectal Arabic language understanding evaluation. ARLUE is built using 42 datasets targeting six different task clusters, allowing us to offer a series of standardized experiments under rich conditions. When fine-tuned on ARLUE, our models collectively achieve new state-of-the-art results across the majority of tasks (37 out of 48 classification tasks, on the 42 datasets). Our best model acquires the highest ARLUE score (77.40) across all six task clusters, outperforming all other models including XLM-R Large (3.4x larger size). Our models are publicly available at https://github.com/UBC-NLP/marbert and ARLUE will be released through the same repository.

pdf bib
Exploring Text-to-Text Transformers for English to Hinglish Machine Translation with Synthetic Code-MixingEnglish to Hinglish Machine Translation with Synthetic Code-Mixing
Ganesh Jawahar | El Moatez Billah Nagoudi | Muhammad Abdul-Mageed | Laks Lakshmanan, V.S.
Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching

We describe models focused at the understudied problem of translating between monolingual and code-mixed language pairs. More specifically, we offer a wide range of models that convert monolingual English text into Hinglish (code-mixed Hindi and English). Given the recent success of pretrained language models, we also test the utility of two recent Transformer-based encoder-decoder models (i.e., mT5 and mBART) on the task finding both to work well. Given the paucity of training data for code-mixing, we also propose a dependency-free method for generating code-mixed texts from bilingual distributed representations that we exploit for improving language model performance. In particular, armed with this additional data, we adopt a curriculum learning approach where we first finetune the language models on synthetic data then on gold code-mixed data. We find that, although simple, our synthetic code-mixing method is competitive with (and in some cases is even superior to) several standard methods (backtranslation, method based on equivalence constraint theory) under a diverse set of conditions. Our work shows that the mT5 model, finetuned following the curriculum learning procedure, achieves best translation performance (12.67 BLEU). Our models place first in the overall ranking of the English-Hinglish official shared task.

2020

pdf bib
Translating Similar Languages : Role of Mutual Intelligibility in Multilingual Transformers
Ife Adebara | El Moatez Billah Nagoudi | Muhammad Abdul Mageed
Proceedings of the Fifth Conference on Machine Translation

In this work we investigate different approaches to translate between similar languages despite low resource limitations. This work is done as the participation of the UBC NLP research group in the WMT 2019 Similar Languages Translation Shared Task. We participated in all language pairs and performed various experiments. We used a transformer architecture for all the models and used back-translation for one of the language pairs. We explore both bilingual and multi-lingual approaches. We describe the pre-processing, training, translation and results for each model. We also investigate the role of mutual intelligibility in model performance.

2018

pdf bib
ARB-SEN at SemEval-2018 Task1 : A New Set of Features for Enhancing the Sentiment Intensity Prediction in Arabic TweetsARB-SEN at SemEval-2018 Task1: A New Set of Features for Enhancing the Sentiment Intensity Prediction in Arabic Tweets
El Moatez Billah Nagoudi
Proceedings of The 12th International Workshop on Semantic Evaluation

This article describes our proposed Arabic Sentiment Analysis system named ARB-SEN. This system is designed for the International Workshop on Semantic Evaluation 2018 (SemEval-2018), Task1 : Affect in Tweets. ARB-SEN proposes two supervised models to estimate the sentiment intensity in Arabic tweets. Both models use a set of features including sentiment lexicon, negation, word embedding and emotion symbols features. Our system combines these features to assist the sentiment analysis task. ARB-SEN system achieves a correlation score of 0.720, ranking 6th among all participants in the valence intensity regression (V-reg) for the Arabic sub-task organized within the SemEval 2018 evaluation campaign.

2017

pdf bib
Semantic Similarity of Arabic Sentences with Word EmbeddingsArabic Sentences with Word Embeddings
El Moatez Billah Nagoudi | Didier Schwab
Proceedings of the Third Arabic Natural Language Processing Workshop

Semantic textual similarity is the basis of countless applications and plays an important role in diverse areas, such as information retrieval, plagiarism detection, information extraction and machine translation. This article proposes an innovative word embedding-based system devoted to calculate the semantic similarity in Arabic sentences. The main idea is to exploit vectors as word representations in a multidimensional space in order to capture the semantic and syntactic properties of words. IDF weighting and Part-of-Speech tagging are applied on the examined sentences to support the identification of words that are highly descriptive in each sentence. The performance of our proposed system is confirmed through the Pearson correlation between our assigned semantic similarity scores and human judgments.

pdf bib
LIM-LIG at SemEval-2017 Task1 : Enhancing the Semantic Similarity for Arabic Sentences with Vectors WeightingLIM-LIG at SemEval-2017 Task1: Enhancing the Semantic Similarity for Arabic Sentences with Vectors Weighting
El Moatez Billah Nagoudi | Jérémy Ferrero | Didier Schwab
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

This article describes our proposed system named LIM-LIG. This system is designed for SemEval 2017 Task1 : Semantic Textual Similarity (Track1). LIM-LIG proposes an innovative enhancement to word embedding-based model devoted to measure the semantic similarity in Arabic sentences. The main idea is to exploit the word representations as vectors in a multidimensional space to capture the semantic and syntactic properties of words. IDF weighting and Part-of-Speech tagging are applied on the examined sentences to support the identification of words that are highly descriptive in each sentence. LIM-LIG system achieves a Pearson’s correlation of 0.74633, ranking 2nd among all participants in the Arabic monolingual pairs STS task organized within the SemEval 2017 evaluation campaign