Bidisha Samanta
2019
Improved Sentiment Detection via Label Transfer from Monolingual to Synthetic Code-Switched Text
Bidisha Samanta
|
Niloy Ganguly
|
Soumen Chakrabarti
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Multilingual writers and speakers often alternate between two languages in a single discourse. This practice is called code-switching. Existing sentiment detection methods are usually trained on sentiment-labeled monolingual text. Manually labeled code-switched text, especially involving minority languages, is extremely rare. Consequently, the best monolingual methods perform relatively poorly on code-switched text. We present an effective technique for synthesizing labeled code-switched text from labeled monolingual text, which is relatively readily available. The idea is to replace carefully selected subtrees of constituency parses of sentences in the resource-rich language with suitable token spans selected from automatic translations to the resource-poor language. By augmenting the scarce labeled code-switched text with plentiful synthetic labeled code-switched text, we achieve significant improvements in sentiment labeling accuracy (1.5 %, 5.11 % 7.20 %) for three different language pairs (English-Hindi, English-Spanish and English-Bengali). The improvement is even significant in hatespeech detection whereby we achieve a 4 % improvement using only synthetic code-switched data (6 % with data augmentation).
2017
All that is English may be Hindi : Enhancing language identification through automatic ranking of the likeliness of word borrowing in social mediaEnglish may be Hindi: Enhancing language identification through automatic ranking of the likeliness of word borrowing in social media
Jasabanta Patro
|
Bidisha Samanta
|
Saurabh Singh
|
Abhipsa Basu
|
Prithwish Mukherjee
|
Monojit Choudhury
|
Animesh Mukherjee
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing
n this paper, we present a set of computational methods to identify the likeliness of a word being borrowed, based on the signals from social media. In terms of Spearman’s correlation values, our methods perform more than two times better (0.62) in predicting the borrowing likeliness compared to the best performing baseline (0.26) reported in literature. Based on this likeliness estimate we asked annotators to re-annotate the language tags of foreign words in predominantly native contexts. In 88 % of cases the annotators felt that the foreign language tag should be replaced by native language tag, thus indicating a huge scope for improvement of automatic language identification systems.
Search
Co-authors
- Jasabanta Patro 1
- Saurabh Singh 1
- Abhipsa Basu 1
- Prithwish Mukherjee 1
- Monojit Choudhury 1
- show all...