Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching

Gustavo Aguilar, Fahad AlGhamdi, Victor Soto, Thamar Solorio, Mona Diab, Julia Hirschberg (Editors)


Anthology ID:
W18-32
Month:
July
Year:
2018
Address:
Melbourne, Australia
Venues:
ACL | WS
SIG:
Publisher:
Association for Computational Linguistics
URL:
https://aclanthology.org/W18-32
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
https://aclanthology.org/W18-32.pdf

pdf bib
Proceedings of the Third Workshop on Computational Approaches to Linguistic Code-Switching
Gustavo Aguilar | Fahad AlGhamdi | Victor Soto | Thamar Solorio | Mona Diab | Julia Hirschberg

pdf bib
Joint Part-of-Speech and Language ID Tagging for Code-Switched DataID Tagging for Code-Switched Data
Victor Soto | Julia Hirschberg

Code-switching is the fluent alternation between two or more languages in conversation between bilinguals. Large populations of speakers code-switch during communication, but little effort has been made to develop tools for code-switching, including part-of-speech taggers. In this paper, we propose an approach to POS tagging of code-switched English-Spanish data based on recurrent neural networks. We test our model on known monolingual benchmarks to demonstrate that our neural POS tagging model is on par with state-of-the-art methods. We next test our code-switched methods on the Miami Bangor corpus of English Spanish conversation, focusing on two types of experiments : POS tagging alone, for which we achieve 96.34 % accuracy, and joint part-of-speech and language ID tagging, which achieves similar POS tagging accuracy (96.39 %) and very high language ID accuracy (98.78 %). Finally, we show that our proposed models outperform other state-of-the-art code-switched taggers.

pdf bib
Phone Merging For Code-Switched Speech Recognition
Sunit Sivasankaran | Brij Mohan Lal Srivastava | Sunayana Sitaram | Kalika Bali | Monojit Choudhury

Speakers in multilingual communities often switch between or mix multiple languages in the same conversation. Automatic Speech Recognition (ASR) of code-switched speech faces many challenges including the influence of phones of different languages on each other. This paper shows evidence that phone sharing between languages improves the Acoustic Model performance for Hindi-English code-switched speech. We compare baseline system built with separate phones for Hindi and English with systems where the phones were manually merged based on linguistic knowledge. Encouraged by the improved ASR performance after manually merging the phones, we further investigate multiple data-driven methods to identify phones to be merged across the languages. We show detailed analysis of automatic phone merging in this language pair and the impact it has on individual phone accuracies and WER. Though the best performance gain of 1.2 % WER was observed with manually merged phones, we show experimentally that the manual phone merge is not optimal.

pdf bib
Improving Neural Network Performance by Injecting Background Knowledge : Detecting Code-switching and Borrowing in Algerian textsAlgerian texts
Wafia Adouane | Jean-Philippe Bernardy | Simon Dobnik

We explore the effect of injecting background knowledge to different deep neural network (DNN) configurations in order to mitigate the problem of the scarcity of annotated data when applying these models on datasets of low-resourced languages. The background knowledge is encoded in the form of lexicons and pre-trained sub-word embeddings. The DNN models are evaluated on the task of detecting code-switching and borrowing points in non-standardised user-generated Algerian texts. Overall results show that DNNs benefit from adding background knowledge. However, the gain varies between models and categories. The proposed DNN architectures are generic and could be applied to other low-resourced languages.

pdf bib
Code-Mixed Question Answering Challenge : Crowd-sourcing Data and Techniques
Khyathi Chandu | Ekaterina Loginova | Vishal Gupta | Josef van Genabith | Günter Neumann | Manoj Chinnakotla | Eric Nyberg | Alan W. Black

Code-Mixing (CM) is the phenomenon of alternating between two or more languages which is prevalent in bi- and multi-lingual communities. Most NLP applications today are still designed with the assumption of a single interaction language and are most likely to break given a CM utterance with multiple languages mixed at a morphological, phrase or sentence level. For example, popular commercial search engines do not yet fully understand the intents expressed in CM queries. As a first step towards fostering research which supports CM in NLP applications, we systematically crowd-sourced and curated an evaluation dataset for factoid question answering in three CM languages-Hinglish (Hindi+English), Tenglish (Telugu+English) and Tamlish (Tamil+English) which belong to two language families (Indo-Aryan and Dravidian). We share the details of our data collection process, techniques which were used to avoid inducing lexical bias amongst the crowd workers and other CM specific linguistic properties of the dataset. Our final dataset, which is available freely for research purposes, has 1,694 Hinglish, 2,848 Tamlish and 1,391 Tenglish factoid questions and their answers. We discuss the techniques used by the participants for the first edition of this ongoing challenge.

pdf bib
Transliteration Better than Translation? Answering Code-mixed Questions over a Knowledge Base
Vishal Gupta | Manoj Chinnakotla | Manish Shrivastava

Humans can learn multiple languages. If they know a fact in one language, they can answer a question in another language they understand. They can also answer Code-mix (CM) questions : questions which contain both languages. This behavior is attributed to the unique learning ability of humans. Our task aims to study if machines can achieve this. We demonstrate how effectively a machine can answer CM questions. In this work, we adopt a two phase approach : candidate generation and candidate re-ranking to answer questions. We propose a Triplet-Siamese-Hybrid CNN (TSHCNN) to re-rank candidate answers. We show experiments on the SimpleQuestions dataset. Our network is trained only on English questions provided in this dataset and noisy Hindi translations of these questions and can answer English-Hindi CM questions effectively without the need of translation into English. Back-transliterated CM questions outperform their lexical and sentence level translated counterparts by 5 % & 35 % in accuracy respectively, highlighting the efficacy of our approach in a resource constrained setting.

pdf bib
Predicting the presence of a Matrix Language in code-switching
Barbara Bullock | Wally Guzmán | Jacqueline Serigos | Vivek Sharath | Almeida Jacqueline Toribio

One language is often assumed to be dominant in code-switching but this assumption has not been empirically tested. We operationalize the matrix language (ML) at the level of the sentence, using three common definitions from linguistics. We test whether these converge and then model this convergence via a set of metrics that together quantify the nature of C-S. We conduct our experiment on four Spanish-English corpora. Our results demonstrate that our model can separate some corpora according to whether they have a dominant ML or not but that the corpora span a range of mixing types that can not be sorted neatly into an insertional vs. alternational dichotomy.

pdf bib
Accommodation of Conversational Code-Choice
Anshul Bawa | Monojit Choudhury | Kalika Bali

Bilingual speakers often freely mix languages. However, in such bilingual conversations, are the language choices of the speakers coordinated? How much does one speaker’s choice of language affect other speakers? In this paper, we formulate code-choice as a linguistic style, and show that speakers are indeed sensitive to and accommodating of each other’s code-choice. We find that the saliency or markedness of a language in context directly affects the degree of accommodation observed. More importantly, we discover that accommodation of code-choices persists over several conversational turns. We also propose an alternative interpretation of conversational accommodation as a retrieval problem, and show that the differences in accommodation characteristics of code-choices are based on their markedness in context.

pdf bib
Language Informed Modeling of Code-Switched Text
Khyathi Chandu | Thomas Manzini | Sumeet Singh | Alan W. Black

Code-switching (CS), the practice of alternating between two or more languages in conversations, is pervasive in most multi-lingual communities. CS texts have a complex interplay between languages and occur in informal contexts that make them harder to collect and construct NLP tools for. We approach this problem through Language Modeling (LM) on a new Hindi-English mixed corpus containing 59,189 unique sentences collected from blogging websites. We implement and discuss different Language Models derived from a multi-layered LSTM architecture. We hypothesize that encoding language information strengthens a language model by helping to learn code-switching points. We show that our highest performing model achieves a test perplexity of 19.52 on the CS corpus that we collected and processed. On this data we demonstrate that our performance is an improvement over AWD-LSTM LM (a recent state of the art on monolingual English).

pdf bib
GHHT at CALCS 2018 : Named Entity Recognition for Dialectal Arabic Using Neural NetworksGHHT at CALCS 2018: Named Entity Recognition for Dialectal Arabic Using Neural Networks
Mohammed Attia | Younes Samih | Wolfgang Maier

This paper describes our system submission to the CALCS 2018 shared task on named entity recognition on code-switched data for the language variant pair of Modern Standard Arabic and Egyptian dialectal Arabic. We build a a Deep Neural Network that combines word and character-based representations in convolutional and recurrent networks with a CRF layer. The model is augmented with stacked layers of enriched information such pre-trained embeddings, Brown clusters and named entity gazetteers. Our system is ranked second among those participating in the shared task achieving an FB1 average of 70.09 %.

pdf bib
Simple Features for Strong Performance on Named Entity Recognition in Code-Switched Twitter DataTwitter Data
Devanshu Jain | Maria Kustikova | Mayank Darbari | Rishabh Gupta | Stephen Mayhew

In this work, we address the problem of Named Entity Recognition (NER) in code-switched tweets as a part of the Workshop on Computational Approaches to Linguistic Code-switching (CALCS) at ACL’18. Code-switching is the phenomenon where a speaker switches between two languages or variants of the same language within or across utterances, known as intra-sentential or inter-sentential code-switching, respectively. Processing such data is challenging using state of the art methods since such technology is generally geared towards processing monolingual text. In this paper we explored ways to use language identification and translation to recognize named entities in such data, however, utilizing simple features (sans multi-lingual features) with Conditional Random Field (CRF) classifier achieved the best results. Our experiments were mainly aimed at the (ENG-SPA) English-Spanish dataset but we submitted a language-independent version of our system to the (MSA-EGY) Arabic-Egyptian dataset as well and achieved good results.

pdf bib
Bilingual Character Representation for Efficiently Addressing Out-of-Vocabulary Words in Code-Switching Named Entity Recognition
Genta Indra Winata | Chien-Sheng Wu | Andrea Madotto | Pascale Fung

We propose an LSTM-based model with hierarchical architecture on named entity recognition from code-switching Twitter data. Our model uses bilingual character representation and transfer learning to address out-of-vocabulary words. In order to mitigate data noise, we propose to use token replacement and normalization. In the 3rd Workshop on Computational Approaches to Linguistic Code-Switching Shared Task, we achieved second place with 62.76 % harmonic mean F1-score for English-Spanish language pair without using any gazetteer and knowledge-based information.

pdf bib
The University of Texas System Submission for the Code-Switching Workshop Shared Task 2018University of Texas System Submission for the Code-Switching Workshop Shared Task 2018
Florian Janke | Tongrui Li | Eric Rincón | Gualberto Guzmán | Barbara Bullock | Almeida Jacqueline Toribio

This paper describes the system for the Named Entity Recognition Shared Task of the Third Workshop on Computational Approaches to Linguistic Code-Switching (CALCS) submitted by the Bilingual Annotations Tasks (BATs) research group of the University of Texas. Our system uses several features to train a Conditional Random Field (CRF) model for classifying input words as Named Entities (NEs) using the Inside-Outside-Beginning (IOB) tagging scheme. We participated in the Modern Standard Arabic-Egyptian Arabic (MSA-EGY) and English-Spanish (ENG-SPA) tasks, achieving weighted average F-scores of 65.62 and 54.16 respectively. We also describe the performance of a deep neural network (NN) trained on a subset of the CRF features, which did not surpass CRF performance.

pdf bib
Tackling Code-Switched NER : Participation of CMUNER: Participation of CMU
Parvathy Geetha | Khyathi Chandu | Alan W Black

Named Entity Recognition plays a major role in several downstream applications in NLP. Though this task has been heavily studied in formal monolingual texts and also noisy texts like Twitter data, it is still an emerging task in code-switched (CS) content on social media. This paper describes our participation in the shared task of NER on code-switched data for Spanglish (Spanish + English) and Arabish (Arabic + English). In this paper we describe models that intuitively developed from the data for the shared task Named Entity Recognition on Code-switched Data. Owing to the sparse and non-linear relationships between words in Twitter data, we explored neural architectures that are capable of non-linearities fairly well. In specific, we trained character level models and word level models based on Bidirectional LSTMs (Bi-LSTMs) to perform sequential tagging. We train multiple models to identify nominal mentions and subsequently use this information to predict the labels of named entity in a sequence. Our best model is a character level model along with word level pre-trained multilingual embeddings that gave an F-score of 56.72 in Spanglish and a word level model that gave an F-score of 65.02 in Arabish on the test data.

pdf bib
Multilingual Named Entity Recognition on Spanish-English Code-switched Tweets using Support Vector MachinesSpanish-English Code-switched Tweets using Support Vector Machines
Daniel Claeser | Samantha Kent | Dennis Felske

This paper describes our system submission for the ACL 2018 shared task on named entity recognition (NER) in code-switched Twitter data. Our best result (F1 = 53.65) was obtained using a Support Vector Machine (SVM) with 14 features combined with rule-based post processing.

pdf bib
IIT (BHU) Submission for the ACL Shared Task on Named Entity Recognition on Code-switched DataIIT (BHU) Submission for the ACL Shared Task on Named Entity Recognition on Code-switched Data
Shashwat Trivedi | Harsh Rangwani | Anil Kumar Singh

This paper describes the best performing system for the shared task on Named Entity Recognition (NER) on code-switched data for the language pair Spanish-English (ENG-SPA). We introduce a gated neural architecture for the NER task. Our final model achieves an F1 score of 63.76 %, outperforming the baseline by 10 %.

pdf bib
Code-Switched Named Entity Recognition with Embedding Attention
Changhan Wang | Kyunghyun Cho | Douwe Kiela

We describe our work for the CALCS 2018 shared task on named entity recognition on code-switched data. Our system ranked first place for MS Arabic-Egyptian named entity recognition and third place for English-Spanish.