Didier Schwab


2022

pdf bib
Automatic Speech Recognition and Query By Example for Creole Languages Documentation
Cécile Macaire | Didier Schwab | Benjamin Lecouteux | Emmanuel Schang
Findings of the Association for Computational Linguistics: ACL 2022

We investigate the exploitation of self supervised models for two Creole languages with few resources Gwadloupyen and Morisien Automatic language processing tools are almost non existent for these two languages We propose to use about one hour of annotated data to design an automatic speech recognition system for each language We evaluate how much data is needed to obtain a query by example system that is usable by linguists Moreover our experiments show that multilingual self supervised models are not necessarily the most efficient for Creole languages

2020

pdf bib
Dual-decoder Transformer for Joint Automatic Speech Recognition and Multilingual Speech Translation
Hang Le | Juan Pino | Changhan Wang | Jiatao Gu | Didier Schwab | Laurent Besacier
Proceedings of the 28th International Conference on Computational Linguistics

We introduce dual-decoder Transformer, a new model architecture that jointly performs automatic speech recognition (ASR) and multilingual speech translation (ST). Our models are based on the original Transformer architecture (Vaswani et al., 2017) but consist of two decoders, each responsible for one task (ASR or ST). Our major contribution lies in how these decoders interact with each other : one decoder can attend to different information sources from the other via a dual-attention mechanism. We propose two variants of these architectures corresponding to two different levels of dependencies between the decoders, called the parallel and cross dual-decoder Transformers, respectively. Extensive experiments on the MuST-C dataset show that our models outperform the previously-reported highest translation performance in the multilingual settings, and outperform as well bilingual one-to-one results. Furthermore, our parallel models demonstrate no trade-off between ASR and ST compared to the vanilla multi-task architecture. Our code and pre-trained models are available at https://github.com/formiel/speech-translation.

pdf bib
WIKIR : A Python Toolkit for Building a Large-scale Wikipedia-based English Information Retrieval DatasetWIKIR: A Python Toolkit for Building a Large-scale Wikipedia-based English Information Retrieval Dataset
Jibril Frej | Didier Schwab | Jean-Pierre Chevallet
Proceedings of the 12th Language Resources and Evaluation Conference

Over the past years, deep learning methods allowed for new state-of-the-art results in ad-hoc information retrieval. However such methods usually require large amounts of annotated data to be effective. Since most standard ad-hoc information retrieval datasets publicly available for academic research (e.g. Robust04, ClueWeb09) have at most 250 annotated queries, the recent deep learning models for information retrieval perform poorly on these datasets. These models (e.g. DUET, Conv-KNRM) are trained and evaluated on data collected from commercial search engines not publicly available for academic research which is a problem for reproducibility and the advancement of research. In this paper, we propose WIKIR : an open-source toolkit to automatically build large-scale English information retrieval datasets based on Wikipedia. WIKIR is publicly available on GitHub. We also provide wikIR59k : a large-scale publicly available dataset that contains 59,252 queries and 2,617,003 (query, relevant documents) pairs.

pdf bib
FlauBERT : Unsupervised Language Model Pre-training for FrenchFlauBERT: Unsupervised Language Model Pre-training for French
Hang Le | Loïc Vial | Jibril Frej | Vincent Segonne | Maximin Coavoux | Benjamin Lecouteux | Alexandre Allauzen | Benoit Crabbé | Laurent Besacier | Didier Schwab
Proceedings of the 12th Language Resources and Evaluation Conference

Language models have become a key step to achieve state-of-the art results in many different Natural Language Processing (NLP) tasks. Leveraging the huge amount of unlabeled texts nowadays available, they provide an efficient way to pre-train continuous word representations that can be fine-tuned for a downstream task, along with their contextualization at the sentence level. This has been widely demonstrated for English using contextualized representations (Dai and Le, 2015 ; Peters et al., 2018 ; Howard and Ruder, 2018 ; Radford et al., 2018 ; Devlin et al., 2019 ; Yang et al., 2019b). In this paper, we introduce and share FlauBERT, a model learned on a very large and heterogeneous French corpus. Models of different sizes are trained using the new CNRS (French National Centre for Scientific Research) Jean Zay supercomputer. We apply our French language models to diverse NLP tasks (text classification, paraphrasing, natural language inference, parsing, word sense disambiguation) and show that most of the time they outperform other pre-training approaches. Different versions of FlauBERT as well as a unified evaluation protocol for the downstream tasks, called FLUE (French Language Understanding Evaluation), are shared to the research community for further reproducible experiments in French NLP.

2019

pdf bib
Sense Vocabulary Compression through the Semantic Knowledge of WordNet for Neural Word Sense DisambiguationWordNet for Neural Word Sense Disambiguation
Loïc Vial | Benjamin Lecouteux | Didier Schwab
Proceedings of the 10th Global Wordnet Conference

In this article, we tackle the issue of the limited quantity of manually sense annotated corpora for the task of word sense disambiguation, by exploiting the semantic relationships between senses such as synonymy, hypernymy and hyponymy, in order to compress the sense vocabulary of Princeton WordNet, and thus reduce the number of different sense tags that must be observed to disambiguate all words of the lexical database. We propose two different methods that greatly reduce the size of neural WSD models, with the benefit of improving their coverage without additional training data, and without impacting their precision. In addition to our methods, we present a WSD system which relies on pre-trained BERT word vectors in order to achieve results that significantly outperforms the state of the art on all WSD evaluation tasks.

pdf bib
The LIG system for the English-Czech Text Translation Task of IWSLT 2019LIG system for the English-Czech Text Translation Task of IWSLT 2019
Loïc Vial | Benjamin Lecouteux | Didier Schwab | Hang Le | Laurent Besacier
Proceedings of the 16th International Conference on Spoken Language Translation

In this paper, we present our submission for the English to Czech Text Translation Task of IWSLT 2019. Our system aims to study how pre-trained language models, used as input embeddings, can improve a specialized machine translation system trained on few data. Therefore, we implemented a Transformer-based encoder-decoder neural system which is able to use the output of a pre-trained language model as input embeddings, and we compared its performance under three configurations : 1) without any pre-trained language model (constrained), 2) using a language model trained on the monolingual parts of the allowed English-Czech data (constrained), and 3) using a language model trained on a large quantity of external monolingual data (unconstrained). We used BERT as external pre-trained language model (configuration 3), and BERT architecture for training our own language model (configuration 2). Regarding the training data, we trained our MT system on a small quantity of parallel text : one set only consists of the provided MuST-C corpus, and the other set consists of the MuST-C corpus and the News Commentary corpus from WMT. We observed that using the external pre-trained BERT improves the scores of our system by +0.8 to +1.5 of BLEU on our development set, and +0.97 to +1.94 of BLEU on the test set. However, using our own language model trained only on the allowed parallel data seems to improve the machine translation performances only when the system is trained on the smallest dataset.

2017

pdf bib
Semantic Similarity of Arabic Sentences with Word EmbeddingsArabic Sentences with Word Embeddings
El Moatez Billah Nagoudi | Didier Schwab
Proceedings of the Third Arabic Natural Language Processing Workshop

Semantic textual similarity is the basis of countless applications and plays an important role in diverse areas, such as information retrieval, plagiarism detection, information extraction and machine translation. This article proposes an innovative word embedding-based system devoted to calculate the semantic similarity in Arabic sentences. The main idea is to exploit vectors as word representations in a multidimensional space in order to capture the semantic and syntactic properties of words. IDF weighting and Part-of-Speech tagging are applied on the examined sentences to support the identification of words that are highly descriptive in each sentence. The performance of our proposed system is confirmed through the Pearson correlation between our assigned semantic similarity scores and human judgments.

pdf bib
Deep Investigation of Cross-Language Plagiarism Detection Methods
Jérémy Ferrero | Laurent Besacier | Didier Schwab | Frédéric Agnès
Proceedings of the 10th Workshop on Building and Using Comparable Corpora

This paper is a deep investigation of cross-language plagiarism detection methods on a new recently introduced open dataset, which contains parallel and comparable collections of documents with multiple characteristics (different genres, languages and sizes of texts). We investigate cross-language plagiarism detection methods for 6 language pairs on 2 granularities of text units in order to draw robust conclusions on the best methods while deeply analyzing correlations across document styles and languages.

pdf bib
CompiLIG at SemEval-2017 Task 1 : Cross-Language Plagiarism Detection Methods for Semantic Textual SimilarityCompiLIG at SemEval-2017 Task 1: Cross-Language Plagiarism Detection Methods for Semantic Textual Similarity
Jérémy Ferrero | Laurent Besacier | Didier Schwab | Frédéric Agnès
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

We present our submitted systems for Semantic Textual Similarity (STS) Track 4 at SemEval-2017. Given a pair of Spanish-English sentences, each system must estimate their semantic similarity by a score between 0 and 5. In our submission, we use syntax-based, dictionary-based, context-based, and MT-based methods. We also combine these methods in unsupervised and supervised way. Our best run ranked 1st on track 4a with a correlation of 83.02 % with human annotations.

pdf bib
LIM-LIG at SemEval-2017 Task1 : Enhancing the Semantic Similarity for Arabic Sentences with Vectors WeightingLIM-LIG at SemEval-2017 Task1: Enhancing the Semantic Similarity for Arabic Sentences with Vectors Weighting
El Moatez Billah Nagoudi | Jérémy Ferrero | Didier Schwab
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

This article describes our proposed system named LIM-LIG. This system is designed for SemEval 2017 Task1 : Semantic Textual Similarity (Track1). LIM-LIG proposes an innovative enhancement to word embedding-based model devoted to measure the semantic similarity in Arabic sentences. The main idea is to exploit the word representations as vectors in a multidimensional space to capture the semantic and syntactic properties of words. IDF weighting and Part-of-Speech tagging are applied on the examined sentences to support the identification of words that are highly descriptive in each sentence. LIM-LIG system achieves a Pearson’s correlation of 0.74633, ranking 2nd among all participants in the Arabic monolingual pairs STS task organized within the SemEval 2017 evaluation campaign

pdf bib
Using Word Embedding for Cross-Language Plagiarism Detection
Jérémy Ferrero | Laurent Besacier | Didier Schwab | Frédéric Agnès
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

This paper proposes to use distributed representation of words (word embeddings) in cross-language textual similarity detection. The main contributions of this paper are the following : (a) we introduce new cross-language similarity detection methods based on distributed representation of words ; (b) we combine the different methods proposed to verify their complementarity and finally obtain an overall F1 score of 89.15 % for English-French similarity detection at chunk level (88.5 % at sentence level) on a very challenging corpus.