Addressing Noise in Multidialectal Word Embeddings

Alexander Erdmann, Nasser Zalmout, Nizar Habash


Abstract
Word embeddings are crucial to many natural language processing tasks. The quality of embeddings relies on large non-noisy corpora. Arabic dialects lack large corpora and are noisy, being linguistically disparate with no standardized spelling. We make three contributions to address this noise. First, we describe simple but effective adaptations to word embedding tools to maximize the informative content leveraged in each training sentence. Second, we analyze methods for representing disparate dialects in one embedding space, either by mapping individual dialects into a shared space or learning a joint model of all dialects. Finally, we evaluate via dictionary induction, showing that two metrics not typically reported in the task enable us to analyze our contributions’ effects on low and high frequency words. In addition to boosting performance between 2-53 %, we specifically improve on noisy, low frequency forms without compromising accuracy on high frequency forms.
Anthology ID:
P18-2089
Volume:
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Month:
July
Year:
2018
Address:
Melbourne, Australia
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
558–565
Language:
URL:
https://aclanthology.org/P18-2089
DOI:
10.18653/v1/P18-2089
Bibkey:
Cite (ACL):
Alexander Erdmann, Nasser Zalmout, and Nizar Habash. 2018. Addressing Noise in Multidialectal Word Embeddings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 558–565, Melbourne, Australia. Association for Computational Linguistics.
Cite (Informal):
Addressing Noise in Multidialectal Word Embeddings (Erdmann et al., ACL 2018)
Copy Citation:
PDF:
https://aclanthology.org/P18-2089.pdf
Poster:
 P18-2089.Poster.pdf
Terminologies: