Proceedings of the Second Workshop on Subword/Character LEvel Models

Manaal Faruqui, Hinrich Schütze, Isabel Trancoso, Yulia Tsvetkov, Yadollah Yaghoobzadeh (Editors)

Anthology ID:
New Orleans
Association for Computational Linguistics
Bib Export formats:

pdf bib
Proceedings of the Second Workshop on Subword/Character LEvel Models
Manaal Faruqui | Hinrich Schütze | Isabel Trancoso | Yulia Tsvetkov | Yadollah Yaghoobzadeh

pdf bib
Morphological Word Embeddings for Arabic Neural Machine Translation in Low-Resource SettingsArabic Neural Machine Translation in Low-Resource Settings
Pamela Shapiro | Kevin Duh

Neural machine translation has achieved impressive results in the last few years, but its success has been limited to settings with large amounts of parallel data. One way to improve NMT for lower-resource settings is to initialize a word-based NMT model with pretrained word embeddings. However, rare words still suffer from lower quality word embeddings when trained with standard word-level objectives. We introduce word embeddings that utilize morphological resources, and compare to purely unsupervised alternatives. We work with Arabic, a morphologically rich language with available linguistic resources, and perform Ar-to-En MT experiments on a small corpus of TED subtitles. We find that word embeddings utilizing subword information consistently outperform standard word embeddings on a word similarity task and as initialization of the source word embeddings in a low-resource NMT system.

pdf bib
Entropy-Based Subword Mining with an Application to Word Embeddings
Ahmed El-Kishky | Frank Xu | Aston Zhang | Stephen Macke | Jiawei Han

Recent literature has shown a wide variety of benefits to mapping traditional one-hot representations of words and phrases to lower-dimensional real-valued vectors known as word embeddings. Traditionally, most word embedding algorithms treat each word as the finest meaningful semantic granularity and perform embedding by learning distinct embedding vectors for each word. Contrary to this line of thought, technical domains such as scientific and medical literature compose words from subword structures such as prefixes, suffixes, and root-words as well as compound words. Treating individual words as the finest-granularity unit discards meaningful shared semantic structure between words sharing substructures. This not only leads to poor embeddings for text corpora that have long-tail distributions, but also heuristic methods for handling out-of-vocabulary words. In this paper we propose SubwordMine, an entropy-based subword mining algorithm that is fast, unsupervised, and fully data-driven. We show that this allows for great cross-domain performance in identifying semantically meaningful subwords. We then investigate utilizing the mined subwords within the FastText embedding model and compare performance of the learned representations in a downstream language modeling task.

pdf bib
Addressing Low-Resource Scenarios with Character-aware Embeddings
Sean Papay | Sebastian Padó | Ngoc Thang Vu

Most modern approaches to computing word embeddings assume the availability of text corpora with billions of words. In this paper, we explore a setup where only corpora with millions of words are available, and many words in any new text are out of vocabulary. This setup is both of practical interests modeling the situation for specific domains and low-resource languages and of psycholinguistic interest, since it corresponds much more closely to the actual experiences and challenges of human language learning and use. We compare standard skip-gram word embeddings with character-based embeddings on word relatedness prediction. Skip-grams excel on large corpora, while character-based embeddings do well on small corpora generally and rare and complex words specifically. The models can be combined easily.

pdf bib
Subword-level Composition Functions for Learning Word Embeddings
Bofang Li | Aleksandr Drozd | Tao Liu | Xiaoyong Du

Subword-level information is crucial for capturing the meaning and morphology of words, especially for out-of-vocabulary entries. We propose CNN- and RNN-based subword-level composition functions for learning word embeddings, and systematically compare them with popular word-level and subword-level models (Skip-Gram and FastText). Additionally, we propose a hybrid training scheme in which a pure subword-level model is trained jointly with a conventional word-level embedding model based on lookup-tables. This increases the fitness of all types of subword-level word embeddings ; the word-level embeddings can be discarded after training, leaving only compact subword-level representation with much smaller data volume. We evaluate these embeddings on a set of intrinsic and extrinsic tasks, showing that subword-level models have advantage on tasks related to morphology and datasets with high OOV rate, and can be combined with other types of embeddings.

pdf bib
Discovering Phonesthemes with Sparse Regularization
Nelson F. Liu | Gina-Anne Levow | Noah A. Smith

We introduce a simple method for extracting non-arbitrary form-meaning representations from a collection of semantic vectors. We treat the problem as one of feature selection for a model trained to predict word vectors from subword features. We apply this model to the problem of automatically discovering phonesthemes, which are submorphemic sound clusters that appear in words with similar meaning. Many of our model-predicted phonesthemes overlap with those proposed in the linguistics literature, and we validate our approach with human judgments.

pdf bib
Incorporating Subword Information into Matrix Factorization Word Embeddings
Alexandre Salle | Aline Villavicencio

The positive effect of adding subword information to word embeddings has been demonstrated for predictive models. In this paper we investigate whether similar benefits can also be derived from incorporating subwords into counting models. We evaluate the impact of different types of subwords (n-grams and unsupervised morphemes), with results confirming the importance of subword information in learning representations of rare and out-of-vocabulary words.

pdf bib
A Multi-Context Character Prediction Model for a Brain-Computer Interface
Shiran Dudy | Shaobin Xu | Steven Bedrick | David Smith

Brain-computer interfaces and other augmentative and alternative communication devices introduce language-modeing challenges distinct from other character-entry methods. In particular, the acquired signal of the EEG (electroencephalogram) signal is noisier, which, in turn, makes the user intent harder to decipher. In order to adapt to this condition, we propose to maintain ambiguous history for every time step, and to employ, apart from the character language model, word information to produce a more robust prediction system. We present preliminary results that compare this proposed Online-Context Language Model (OCLM) to current algorithms that are used in this type of setting. Evaluation on both perplexity and predictive accuracy demonstrates promising results when dealing with ambiguous histories in order to provide to the front end a distribution of the next character the user might type.