Wafia Adouane
2019
Normalising Non-standardised Orthography in Algerian Code-switched User-generated DataAlgerian Code-switched User-generated Data
Wafia Adouane
|
Jean-Philippe Bernardy
|
Simon Dobnik
Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019)
We work with Algerian, an under-resourced non-standardised Arabic variety, for which we compile a new parallel corpus consisting of user-generated textual data matched with normalised and corrected human annotations following data-driven and our linguistically motivated standard. We use an end-to-end deep neural model designed to deal with context-dependent spelling correction and normalisation. Results indicate that a model with two CNN sub-network encoders and an LSTM decoder performs the best, and that word context matters. Additionally, pre-processing data token-by-token with an edit-distance based aligner significantly improves the performance. We get promising results for the spelling correction and normalisation, as a pre-processing step for downstream tasks, on detecting binary Semantic Textual Similarity.
2018
Improving Neural Network Performance by Injecting Background Knowledge : Detecting Code-switching and Borrowing in Algerian textsAlgerian texts
Wafia Adouane
|
Jean-Philippe Bernardy
|
Simon Dobnik
Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching
We explore the effect of injecting background knowledge to different deep neural network (DNN) configurations in order to mitigate the problem of the scarcity of annotated data when applying these models on datasets of low-resourced languages. The background knowledge is encoded in the form of lexicons and pre-trained sub-word embeddings. The DNN models are evaluated on the task of detecting code-switching and borrowing points in non-standardised user-generated Algerian texts. Overall results show that DNNs benefit from adding background knowledge. However, the gain varies between models and categories. The proposed DNN architectures are generic and could be applied to other low-resourced languages.
2017
Identification of Languages in Algerian Arabic Multilingual DocumentsAlgerian Arabic Multilingual Documents
Wafia Adouane
|
Simon Dobnik
Proceedings of the Third Arabic Natural Language Processing Workshop
This paper presents a language identification system designed to detect the language of each word, in its context, in a multilingual documents as generated in social media by bilingual / multilingual communities, in our case speakers of Algerian Arabic. We frame the task as a sequence tagging problem and use supervised machine learning with standard methods like HMM and Ngram classification tagging. We also experiment with a lexicon-based method. Combining all the methods in a fall-back mechanism and introducing some linguistic rules, to deal with unseen tokens and ambiguous words, gives an overall accuracy of 93.14 %. Finally, we introduced rules for language identification from sequences of recognised words.
Search