Ryan Cotterell


2021

pdf bib
A Cognitive Regularizer for Language Modeling
Jason Wei | Clara Meister | Ryan Cotterell
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

The uniform information density (UID) hypothesis, which posits that speakers behaving optimally tend to distribute information uniformly across a linguistic signal, has gained traction in psycholinguistics as an explanation for certain syntactic, morphological, and prosodic choices. In this work, we explore whether the UID hypothesis can be operationalized as an inductive bias for statistical language modeling. Specifically, we augment the canonical MLE objective for training language models with a regularizer that encodes UID. In experiments on ten languages spanning five language families, we find that using UID regularization consistently improves perplexity in language models, having a larger effect when training data is limited. Moreover, via an analysis of generated sequences, we find that UID-regularized language models have other desirable properties, e.g., they generate text that is more lexically diverse. Our results not only suggest that UID is a reasonable inductive bias for language modeling, but also provide an alternative validation of the UID hypothesis using modern-day NLP tools.

pdf bib
Language Model Evaluation Beyond Perplexity
Clara Meister | Ryan Cotterell
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

We propose an alternate approach to quantifying how well language models learn natural language : we ask how well they match the statistical tendencies of natural language. To answer this question, we analyze whether text generated from language models exhibits the statistical tendencies present in the human-generated text on which they were trained. We provide a frameworkpaired with significance testsfor evaluating the fit of language models to these trends. We find that neural language models appear to learn only a subset of the tendencies considered, but align much more closely with empirical trends than proposed theoretical distributions (when present). Further, the fit to different distributions is highly-dependent on both model architecture and generation strategy. As concrete examples, text generated under the nucleus sampling scheme adheres more closely to the typetoken relationship of natural language than text produced using standard ancestral sampling ; text from LSTMs reflects the natural language distributions over length, stopwords, and symbols surprisingly well.

pdf bib
Higher-order Derivatives of Weighted Finite-state Machines
Ran Zmigrod | Tim Vieira | Ryan Cotterell
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Weighted finite-state machines are a fundamental building block of NLP systems. They have withstood the test of timefrom their early use in noisy channel models in the 1990s up to modern-day neurally parameterized conditional random fields. This work examines the computation of higher-order derivatives with respect to the normalization constant for weighted finite-state machines. We provide a general algorithm for evaluating derivatives of all orders, which has not been previously described in the literature. In the case of second-order derivatives, our scheme runs in the optimal O(A2 N4) time where A is the alphabet size and N is the number of states. Our algorithm is significantly faster than prior algorithms. Additionally, our approach leads to a significantly faster algorithm for computing second-order expectations, such as covariance matrices and gradients of first-order expectations.

pdf bib
Disambiguatory Signals are Stronger in Word-initial Positions
Tiago Pimentel | Ryan Cotterell | Brian Roark
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Psycholinguistic studies of human word processing and lexical access provide ample evidence of the preferred nature of word-initial versus word-final segments, e.g., in terms of attention paid by listeners (greater) or the likelihood of reduction by speakers (lower). This has led to the conjectureas in Wedel et al. (2019b), but common elsewherethat languages have evolved to provide more information earlier in words than later. Information-theoretic methods to establish such tendencies in lexicons have suffered from several methodological shortcomings that leave open the question of whether this high word-initial informativeness is actually a property of the lexicon or simply an artefact of the incremental nature of recognition. In this paper, we point out the confounds in existing methods for comparing the informativeness of segments early in the word versus later in the word, and present several new measures that avoid these confounds. When controlling for these confounds, we still find evidence across hundreds of languages that indeed there is a cross-linguistic tendency to front-load information in words.

pdf bib
Searching for Search Errors in Neural Morphological Inflection
Martina Forster | Clara Meister | Ryan Cotterell
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Neural sequence-to-sequence models are currently the predominant choice for language generation tasks. Yet, on word-level tasks, exact inference of these models reveals the empty string is often the global optimum. Prior works have speculated this phenomenon is a result of the inadequacy of neural models for language generation. However, in the case of morphological inflection, we find that the empty string is almost never the most probable solution under the model. Further, greedy search often finds the global optimum. These observations suggest that the poor calibration of many neural models may stem from characteristics of a specific subset of tasks rather than general ill-suitedness of such models for language generation.

pdf bib
Proceedings of the 18th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology
Garrett Nicolai | Kyle Gorman | Ryan Cotterell
Proceedings of the 18th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

pdf bib
Conditional Poisson Stochastic BeamsPoisson Stochastic Beams
Clara Meister | Afra Amini | Tim Vieira | Ryan Cotterell
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Beam search is the default decoding strategy for many sequence generation tasks in NLP. The set of approximate K-best items returned by the algorithm is a useful summary of the distribution for many applications ; however, the candidates typically exhibit high overlap and may give a highly biased estimate for expectations under our model. These problems can be addressed by instead using stochastic decoding strategies. In this work, we propose a new method for turning beam search into a stochastic process : Conditional Poisson stochastic beam search. Rather than taking the maximizing set at each iteration, we sample K candidates without replacement according to the conditional Poisson sampling design. We view this as a more natural alternative to Kool et al. (2019)’s stochastic beam search (SBS). Furthermore, we show how samples generated under the CPSBS design can be used to build consistent estimators and sample diverse sets from sequence models. In our experiments, we observe CPSBS produces lower variance and more efficient estimators than SBS, even showing improvements in high entropy settings.

pdf bib
Revisiting the Uniform Information Density HypothesisUniform Information Density Hypothesis
Clara Meister | Tiago Pimentel | Patrick Haller | Lena Jäger | Ryan Cotterell | Roger Levy
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

The uniform information density (UID) hypothesis posits a preference among language users for utterances structured such that information is distributed uniformly across a signal. While its implications on language production have been well explored, the hypothesis potentially makes predictions about language comprehension and linguistic acceptability as well. Further, it is unclear how uniformity in a linguistic signalor lack thereofshould be measured, and over which linguistic unit, e.g., the sentence or language level, this uniformity should hold. Here we investigate these facets of the UID hypothesis using reading time and acceptability data. While our reading time results are generally consistent with previous work, they are also consistent with a weakly super-linear effect of surprisal, which would be compatible with UID’s predictions. For acceptability judgments, we find clearer evidence that non-uniformity in information density is predictive of lower acceptability. We then explore multiple operationalizations of UID, motivated by different interpretations of the original hypothesis, and analyze the scope over which the pressure towards uniformity is exerted. The explanatory power of a subset of the proposed operationalizations suggests that the strongest trend may be a regression towards a mean surprisal across the language, rather than the phrase, sentence, or documenta finding that supports a typical interpretation of UID, namely that it is the byproduct of language users maximizing the use of a (hypothetical) communication channel.

pdf bib
A Bayesian Framework for Information-Theoretic ProbingBayesian Framework for Information-Theoretic Probing
Tiago Pimentel | Ryan Cotterell
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Pimentel et al. (2020) recently analysed probing from an information-theoretic perspective. They argue that probing should be seen as approximating a mutual information. This led to the rather unintuitive conclusion that representations encode exactly the same information about a target task as the original sentences. The mutual information, however, assumes the true probability distribution of a pair of random variables is known, leading to unintuitive results in settings where it is not. This paper proposes a new framework to measure what we term Bayesian mutual information, which analyses information from the perspective of Bayesian agentsallowing for more intuitive findings in scenarios with finite data. For instance, under Bayesian MI we have that data can add information, processing can help, and information can hurt, which makes it more intuitive for machine learning applications. Finally, we apply our framework to probing where we believe Bayesian mutual information naturally operationalises ease of extraction by explicitly limiting the available background knowledge to solve a task.

pdf bib
Efficient Sampling of Dependency Structure
Ran Zmigrod | Tim Vieira | Ryan Cotterell
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Probabilistic distributions over spanning trees in directed graphs are a fundamental model of dependency structure in natural language processing, syntactic dependency trees. In NLP, dependency trees often have an additional root constraint : only one edge may emanate from the root. However, no sampling algorithm has been presented in the literature to account for this additional constraint. In this paper, we adapt two spanning tree sampling algorithms to faithfully sample dependency trees from a graph subject to the root constraint. Wilson (1996 (’s sampling algorithm has a running time of O(H) where H is the mean hitting time of the graph. Colbourn (1996)’s sampling algorithm has a running time of O(N3), which is often greater than the mean hitting time of a directed graph. Additionally, we build upon Colbourn’s algorithm and present a novel extension that can sample K trees without replacement in O(K N3 + K2 N) time. To the best of our knowledge, no algorithm has been given for sampling spanning trees without replacement from a directed graph.

pdf bib
Proceedings of the Third Workshop on Computational Typology and Multilingual NLP
Ekaterina Vylomova | Elizabeth Salesky | Sabrina Mielke | Gabriella Lapesa | Ritesh Kumar | Harald Hammarström | Ivan Vulić | Anna Korhonen | Roi Reichart | Edoardo Maria Ponti | Ryan Cotterell
Proceedings of the Third Workshop on Computational Typology and Multilingual NLP

pdf bib
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Kristina Toutanova | Anna Rumshisky | Luke Zettlemoyer | Dilek Hakkani-Tur | Iz Beltagy | Steven Bethard | Ryan Cotterell | Tanmoy Chakraborty | Yichao Zhou
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf bib
A Non-Linear Structural Probe
Jennifer C. White | Tiago Pimentel | Naomi Saphra | Ryan Cotterell
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Probes are models devised to investigate the encoding of knowledgee.g. syntactic structurein contextual representations. Probes are often designed for simplicity, which has led to restrictions on probe design that may not allow for the full exploitation of the structure of encoded information ; one such restriction is linearity. We examine the case of a structural probe (Hewitt and Manning, 2019), which aims to investigate the encoding of syntactic structure in contextual representations through learning only linear transformations. By observing that the structural probe learns a metric, we are able to kernelize it and develop a novel non-linear variant with an identical number of parameters. We test on 6 languages and find that the radial-basis function (RBF) kernel, in conjunction with regularization, achieves a statistically significant improvement over the baseline in all languagesimplying that at least part of the syntactic knowledge is encoded non-linearly. We conclude by discussing how the RBF kernel resembles BERT’s self-attention layers and speculate that this resemblance leads to the RBF-based probe’s stronger performance.

pdf bib
Finding Concept-specific Biases in FormMeaning Associations
Tiago Pimentel | Brian Roark | Søren Wichmann | Ryan Cotterell | Damián Blasi
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

This work presents an information-theoretic operationalisation of cross-linguistic non-arbitrariness. It is not a new idea that there are small, cross-linguistic associations between the forms and meanings of words. For instance, it has been claimed (Blasi et al., 2016) that the word for tongue is more likely than chance to contain the phone [ l ]. By controlling for the influence of language family and geographic proximity within a very large concept-aligned, cross-lingual lexicon, we extend methods previously used to detect within language non-arbitrariness (Pimentel et al., 2019) to measure cross-linguistic associations. We find that there is a significant effect of non-arbitrariness, but it is unsurprisingly small (less than 0.5 % on average according to our information-theoretic estimate). We also provide a concept-level analysis which shows that a quarter of the concepts considered in our work exhibit a significant level of cross-linguistic non-arbitrariness. In sum, the paper provides new methods to detect cross-linguistic associations at scale, and confirms their effects are minor.

2020

pdf bib
Exploring the Linear Subspace Hypothesis in Gender Bias Mitigation
Francisco Vargas | Ryan Cotterell
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Bolukbasi et al. (2016) presents one of the first gender bias mitigation techniques for word embeddings. Their method takes pre-trained word embeddings as input and attempts to isolate a linear subspace that captures most of the gender bias in the embeddings. As judged by an analogical evaluation task, their method virtually eliminates gender bias in the embeddings. However, an implicit and untested assumption of their method is that the bias subspace is actually linear. In this work, we generalize their method to a kernelized, non-linear version. We take inspiration from kernel principal component analysis and derive a non-linear bias isolation technique. We discuss and overcome some of the practical drawbacks of our method for non-linear gender bias mitigation in word embeddings and analyze empirically whether the bias subspace is actually linear. Our analysis shows that gender bias is in fact well captured by a linear subspace, justifying the assumption of Bolukbasi et al.

pdf bib
Information-Theoretic Probing for Linguistic Structure
Tiago Pimentel | Josef Valvoda | Rowan Hall Maudslay | Ran Zmigrod | Adina Williams | Ryan Cotterell
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

The success of neural networks on a diverse set of NLP tasks has led researchers to question how much these networks actually know about natural language. Probes are a natural way of assessing this. When probing, a researcher chooses a linguistic task and trains a supervised model to predict annotations in that linguistic task from the network’s learned representations. If the probe does well, the researcher may conclude that the representations encode knowledge related to the task. A commonly held belief is that using simpler models as probes is better ; the logic is that simpler models will identify linguistic structure, but not learn the task itself. We propose an information-theoretic operationalization of probing as estimating mutual information that contradicts this received wisdom : one should always select the highest performing probe one can, even if it is more complex, since it will result in a tighter estimate, and thus reveal more of the linguistic information inherent in the representation. The experimental portion of our paper focuses on empirically estimating the mutual information between a linguistic property and BERT, comparing these estimates to several baselines. We evaluate on a set of ten typologically diverse languages often underrepresented in NLP researchplus Englishtotalling eleven languages. Our implementation is available in https://github.com/rycolab/info-theoretic-probing.

pdf bib
The Paradigm Discovery Problem
Alexander Erdmann | Micha Elsner | Shijie Wu | Ryan Cotterell | Nizar Habash
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

This work treats the paradigm discovery problem (PDP), the task of learning an inflectional morphological system from unannotated sentences. We formalize the PDP and develop evaluation metrics for judging systems. Using currently available resources, we construct datasets for the task. We also devise a heuristic benchmark for the PDP and report empirical results on five diverse languages. Our benchmark system first makes use of word embeddings and string similarity to cluster forms by cell and by paradigm. Then, we bootstrap a neural transducer on top of the clustered data to predict words to realize the empty paradigm slots. An error analysis of our system suggests clustering by cell across different inflection classes is the most pressing challenge for future work.

pdf bib
Proceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology
Garrett Nicolai | Kyle Gorman | Ryan Cotterell
Proceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

pdf bib
Proceedings of the Second Workshop on Computational Research in Linguistic Typology
Ekaterina Vylomova | Edoardo M. Ponti | Eitan Grossman | Arya D. McCarthy | Yevgeni Berzak | Haim Dubossarsky | Ivan Vulić | Roi Reichart | Anna Korhonen | Ryan Cotterell
Proceedings of the Second Workshop on Computational Research in Linguistic Typology

pdf bib
SIGTYP 2020 Shared Task : Prediction of Typological FeaturesSIGTYP 2020 Shared Task: Prediction of Typological Features
Johannes Bjerva | Elizabeth Salesky | Sabrina J. Mielke | Aditi Chaudhary | Giuseppe G. A. Celano | Edoardo Maria Ponti | Ekaterina Vylomova | Ryan Cotterell | Isabelle Augenstein
Proceedings of the Second Workshop on Computational Research in Linguistic Typology

Typological knowledge bases (KBs) such as WALS (Dryer and Haspelmath, 2013) contain information about linguistic properties of the world’s languages. They have been shown to be useful for downstream applications, including cross-lingual transfer learning and linguistic probing. A major drawback hampering broader adoption of typological KBs is that they are sparsely populated, in the sense that most languages only have annotations for some features, and skewed, in that few features have wide coverage. As typological features often correlate with one another, it is possible to predict them and thus automatically populate typological KBs, which is also the focus of this shared task. Overall, the task attracted 8 submissions from 5 teams, out of which the most successful methods make use of such feature correlations. However, our error analysis reveals that even the strongest submitted systems struggle with predicting feature values for languages where few features are known.

2019

pdf bib
Towards Zero-shot Language Modeling
Edoardo Maria Ponti | Ivan Vulić | Ryan Cotterell | Roi Reichart | Anna Korhonen
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Can we construct a neural language model which is inductively biased towards learning human language? Motivated by this question, we aim at constructing an informative prior for held-out languages on the task of character-level, open-vocabulary language modelling. We obtain this prior as the posterior over network weights conditioned on the data from a sample of training languages, which is approximated through Laplace’s method. Based on a large and diverse sample of languages, the use of our prior outperforms baseline models with an uninformative prior in both zero-shot and few-shot settings, showing that the prior is imbued with universal linguistic knowledge. Moreover, we harness broad language-specific information available for most languages of the world, i.e., features from typological databases, as distant supervision for held-out languages. We explore several language modelling conditioning techniques, including concatenation and meta-networks for parameter generation. They appear beneficial in the few-shot setting, but ineffective in the zero-shot setting. Since the paucity of even plain digital text affects the majority of the world’s languages, we hope that these insights will broaden the scope of applications for language technology.

pdf bib
Examining Gender Bias in Languages with Grammatical Gender
Pei Zhou | Weijia Shi | Jieyu Zhao | Kuan-Hao Huang | Muhao Chen | Ryan Cotterell | Kai-Wei Chang
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Recent studies have shown that word embeddings exhibit gender bias inherited from the training corpora. However, most studies to date have focused on quantifying and mitigating such bias only in English. These analyses can not be directly extended to languages that exhibit morphological agreement on gender, such as Spanish and French. In this paper, we propose new metrics for evaluating gender bias in word embeddings of these languages and further demonstrate evidence of gender bias in bilingual embeddings which align these languages with English. Finally, we extend an existing approach to mitigate gender bias in word embedding of these languages under both monolingual and bilingual settings. Experiments on modified Word Embedding Association Test, word similarity, word translation, and word pair translation tasks show that the proposed approaches can effectively reduce the gender bias while preserving the utility of the original embeddings.

pdf bib
Quantifying the Semantic Core of Gender Systems
Adina Williams | Damian Blasi | Lawrence Wolf-Sonkin | Hanna Wallach | Ryan Cotterell
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Many of the world’s languages employ grammatical gender on the lexeme. For instance, in Spanish, house casa is feminine, whereas the word for paper papel is masculine. To a speaker of a genderless language, this categorization seems to exist with neither rhyme nor reason. But, is the association of nouns to gender classes truly arbitrary? In this work, we present the first large-scale investigation of the arbitrariness of gender assignment that uses canonical correlation analysis as a method for correlating the gender of inanimate nouns with their lexical semantic meaning. We find that the gender systems of 18 languages exhibit a significant correlation with an externally grounded definition of lexical semantics.

bib
Rethinking Phonotactic Complexity
Tiago Pimentel | Brian Roark | Ryan Cotterell
Proceedings of the 2019 Workshop on Widening NLP

In this work, we propose the use of phone-level language models to estimate phonotactic complexity—measured in bits per phoneme—which makes cross-linguistic comparison straightforward. We compare the entropy across languages using this simple measure, gaining insight on how complex different language’s phonotactics are. Finally, we show a very strong negative correlation between phonotactic complexity and the average length of words—Spearman rho=-0.744—when analysing a collection of 106 languages with 1016 basic concepts each.

pdf bib
Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology
Garrett Nicolai | Ryan Cotterell
Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology

pdf bib
The SIGMORPHON 2019 Shared Task : Morphological Analysis in Context and Cross-Lingual Transfer for InflectionSIGMORPHON 2019 Shared Task: Morphological Analysis in Context and Cross-Lingual Transfer for Inflection
Arya D. McCarthy | Ekaterina Vylomova | Shijie Wu | Chaitanya Malaviya | Lawrence Wolf-Sonkin | Garrett Nicolai | Christo Kirov | Miikka Silfverberg | Sabrina J. Mielke | Jeffrey Heinz | Ryan Cotterell | Mans Hulden
Proceedings of the 16th Workshop on Computational Research in Phonetics, Phonology, and Morphology

The SIGMORPHON 2019 shared task on cross-lingual transfer and contextual analysis in morphology examined transfer learning of inflection between 100 language pairs, as well as contextual lemmatization and morphosyntactic description in 66 languages. The first task evolves past years’ inflection tasks by examining transfer of morphological inflection knowledge from a high-resource language to a low-resource language. This year also presents a new second challenge on lemmatization and morphological feature analysis in context. All submissions featured a neural component and built on either this year’s strong baselines or highly ranked systems from previous years’ shared tasks. Every participating team improved in accuracy over the baselines for the inflection task (though not Levenshtein distance), and every team in the contextual analysis task improved on both state-of-the-art neural and non-neural baselines.

pdf bib
Proceedings of TyP-NLP: The First Workshop on Typology for Polyglot NLP
Haim Dubossarsky | Arya D. McCarthy | Edoardo Maria Ponti | Ivan Vulić | Ekaterina Vylomova | Yevgeni Berzak | Ryan Cotterell | Manaal Faruqui | Anna Korhonen | Roi Reichart
Proceedings of TyP-NLP: The First Workshop on Typology for Polyglot NLP

pdf bib
Gender Bias in Contextualized Word Embeddings
Jieyu Zhao | Tianlu Wang | Mark Yatskar | Ryan Cotterell | Vicente Ordonez | Kai-Wei Chang
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

In this paper, we quantify, analyze and mitigate gender bias exhibited in ELMo’s contextualized word vectors. First, we conduct several intrinsic analyses and find that (1) training data for ELMo contains significantly more male than female entities, (2) the trained ELMo embeddings systematically encode gender information and (3) ELMo unequally encodes gender information about male and female entities. Then, we show that a state-of-the-art coreference system that depends on ELMo inherits its bias and demonstrates significant bias on the WinoBias probing corpus. Finally, we explore two methods to mitigate such gender bias and show that the bias demonstrated on WinoBias can be eliminated.

pdf bib
Combining Sentiment Lexica with a Multi-View Variational AutoencoderCombining Sentiment Lexica with a Multi-View Variational Autoencoder
Alexander Miserlis Hoyle | Lawrence Wolf-Sonkin | Hanna Wallach | Ryan Cotterell | Isabelle Augenstein
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

When assigning quantitative labels to a dataset, different methodologies may rely on different scales. In particular, when assigning polarities to words in a sentiment lexicon, annotators may use binary, categorical, or continuous labels. Naturally, it is of interest to unify these labels from disparate scales to both achieve maximal coverage over words and to create a single, more robust sentiment lexicon while retaining scale coherence. We introduce a generative model of sentiment lexica to combine disparate scales into a common latent representation. We realize this model with a novel multi-view variational autoencoder (VAE), called SentiVAE. We evaluate our approach via a downstream text classification task involving nine English-Language sentiment analysis datasets ; our representation outperforms six individual sentiment lexica, as well as a straightforward combination thereof.

pdf bib
A Simple Joint Model for Improved Contextual Neural Lemmatization
Chaitanya Malaviya | Shijie Wu | Ryan Cotterell
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

English verbs have multiple forms. For instance, talk may also appear as talks, talked or talking, depending on the context. The NLP task of lemmatization seeks to map these diverse forms back to a canonical one, known as the lemma. We present a simple joint neural model for lemmatization and morphological tagging that achieves state-of-the-art results on 20 languages from the Universal Dependencies corpora. Our paper describes the model in addition to training and decoding procedures. Error analysis indicates that joint morphological tagging and lemmatization is especially helpful in low-resource lemmatization and languages that display a larger degree of morphological complexity.

pdf bib
Contextualization of Morphological Inflection
Ekaterina Vylomova | Ryan Cotterell | Trevor Cohn | Timothy Baldwin | Jason Eisner
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Critical to natural language generation is the production of correctly inflected text. In this paper, we isolate the task of predicting a fully inflected sentence from its partially lemmatized version. Unlike traditional morphological inflection or surface realization, our task input does not provide gold tags that specify what morphological features to realize on each lemmatized word ; rather, such features must be inferred from sentential context. We develop a neural hybrid graphical model that explicitly reconstructs morphological features before predicting the inflected forms, and compare this to a system that directly predicts the inflected forms without relying on any morphological annotation. We experiment on several typologically diverse languages from the Universal Dependencies treebanks, showing the utility of incorporating linguistically-motivated latent variables into NLP models.

pdf bib
On the Idiosyncrasies of the Mandarin Chinese Classifier SystemMandarin Chinese Classifier System
Shijia Liu | Hongyuan Mei | Adina Williams | Ryan Cotterell
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

While idiosyncrasies of the Chinese classifier system have been a richly studied topic among linguists (Adams and Conklin, 1973 ; Erbaugh, 1986 ; Lakoff, 1986), not much work has been done to quantify them with statistical methods. In this paper, we introduce an information-theoretic approach to measuring idiosyncrasy ; we examine how much the uncertainty in Mandarin Chinese classifiers can be reduced by knowing semantic information about the nouns that the classifiers modify. Using the empirical distribution of classifiers from the parsed Chinese Gigaword corpus (Graff et al., 2005), we compute the mutual information (in bits) between the distribution over classifiers and distributions over other linguistic quantities. We investigate whether semantic classes of nouns and adjectives differ in how much they reduce uncertainty in classifier choice, and find that it is not fully idiosyncratic ; while there are no obvious trends for the majority of semantic classes, shape nouns reduce uncertainty in classifier choice the most.

pdf bib
Unsupervised Discovery of Gendered Language through Latent-Variable Modeling
Alexander Miserlis Hoyle | Lawrence Wolf-Sonkin | Hanna Wallach | Isabelle Augenstein | Ryan Cotterell
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Studying the ways in which language is gendered has long been an area of interest in sociolinguistics. Studies have explored, for example, the speech of male and female characters in film and the language used to describe male and female politicians. In this paper, we aim not to merely study this phenomenon qualitatively, but instead to quantify the degree to which the language used to describe men and women is different and, moreover, different in a positive or negative way. To that end, we introduce a generative latent-variable model that jointly represents adjective (or verb) choice, with its sentiment, given the natural gender of a head (or dependent) noun. We find that there are significant differences between descriptions of male and female nouns and that these differences align with common gender stereotypes : Positive adjectives used to describe women are more often related to their bodies than adjectives used to describe men.

pdf bib
Weird Inflects but OK : Making Sense of Morphological Generation ErrorsOK: Making Sense of Morphological Generation Errors
Kyle Gorman | Arya D. McCarthy | Ryan Cotterell | Ekaterina Vylomova | Miikka Silfverberg | Magdalena Markowska
Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL)

We conduct a manual error analysis of the CoNLL-SIGMORPHON Shared Task on Morphological Reinflection. This task involves natural language generation : systems are given a word in citation form (e.g., hug) and asked to produce the corresponding inflected form (e.g., the simple past hugged). We propose an error taxonomy and use it to annotate errors made by the top two systems across twelve languages. Many of the observed errors are related to inflectional patterns sensitive to inherent linguistic properties such as animacy or affect ; many others are failures to predict truly unpredictable inflectional behaviors. We also find nearly one quarter of the residual errors reflect errors in the gold data.

2018

pdf bib
A Discriminative Latent-Variable Model for Bilingual Lexicon Induction
Sebastian Ruder | Ryan Cotterell | Yova Kementchedjhieva | Anders Søgaard
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

We introduce a novel discriminative latent-variable model for the task of bilingual lexicon induction. Our model combines the bipartite matching dictionary prior of Haghighi et al. (2008) with a state-of-the-art embedding-based approach. To train the model, we derive an efficient Viterbi EM algorithm. We provide empirical improvements on six language pairs under two metrics and show that the prior theoretically and empirically helps to mitigate the hubness problem. We also demonstrate how previous work may be viewed as a similarly fashioned latent-variable model, albeit with a different prior.

pdf bib
Hard Non-Monotonic Attention for Character-Level Transduction
Shijie Wu | Pamela Shapiro | Ryan Cotterell
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Character-level string-to-string transduction is an important component of various NLP tasks. The goal is to map an input string to an output string, where the strings may be of different lengths and have characters taken from different alphabets. Recent approaches have used sequence-to-sequence models with an attention mechanism to learn which parts of the input string the model should focus on during the generation of the output string. Both soft attention and hard monotonic attention have been used, but hard non-monotonic attention has only been used in other sequence modeling tasks and has required a stochastic approximation to compute the gradient. In this work, we introduce an exact, polynomial-time algorithm for marginalizing over the exponential number of non-monotonic alignments between two strings, showing that hard attention models can be viewed as neural reparameterizations of the classical IBM Model 1. We compare soft and hard non-monotonic attention experimentally and find that the exact algorithm significantly improves performance over the stochastic approximation and outperforms soft attention.

pdf bib
Marrying Universal Dependencies and Universal MorphologyUniversal Dependencies and Universal Morphology
Arya D. McCarthy | Miikka Silfverberg | Ryan Cotterell | Mans Hulden | David Yarowsky
Proceedings of the Second Workshop on Universal Dependencies (UDW 2018)

The Universal Dependencies (UD) and Universal Morphology (UniMorph) projects each present schemata for annotating the morphosyntactic details of language. Each project also provides corpora of annotated text in many languagesUD at the token level and UniMorph at the type level. As each corpus is built by different annotators, language-specific decisions hinder the goal of universal schemata. With compatibility of tags, each project’s annotations could be used to validate the other’s. Additionally, the availability of both type- and token-level resources would be a boon to tasks such as parsing and homograph disambiguation. To ease this interoperability, we present a deterministic mapping from Universal Dependencies v2 features into the UniMorph schema. We validate our approach by lookup in the UniMorph corpora and find a macro-average of 64.13 % recall. We also note incompatibilities due to paucity of data on either side. Finally, we present a critical evaluation of the foundations, strengths, and weaknesses of the two annotation projects.

pdf bib
A Deep Generative Model of Vowel Formant Typology
Ryan Cotterell | Jason Eisner
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

What makes some types of languages more probable than others? For instance, we know that almost all spoken languages contain the vowel phoneme /i/ ; why should that be? The field of linguistic typology seeks to answer these questions and, thereby, divine the mechanisms that underlie human language. In our work, we tackle the problem of vowel system typology, i.e., we propose a generative probability model of which vowels a language contains. In contrast to previous work, we work directly with the acoustic informationthe first two formant valuesrather than modeling discrete sets of symbols from the international phonetic alphabet. We develop a novel generative probability model and report results on over 200 languages.

pdf bib
Proceedings of the CoNLLSIGMORPHON 2018 Shared Task: Universal Morphological Reinflection
Mans Hulden | Ryan Cotterell
Proceedings of the CoNLL–SIGMORPHON 2018 Shared Task: Universal Morphological Reinflection

pdf bib
A Structured Variational Autoencoder for Contextual Morphological Inflection
Lawrence Wolf-Sonkin | Jason Naradowsky | Sabrina J. Mielke | Ryan Cotterell
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Statistical morphological inflectors are typically trained on fully supervised, type-level data. One remaining open research question is the following : How can we effectively exploit raw, token-level data to improve their performance? To this end, we introduce a novel generative latent-variable model for the semi-supervised learning of inflection generation. To enable posterior inference over the latent variables, we derive an efficient variational inference procedure based on the wake-sleep algorithm. We experiment on 23 languages, using the Universal Dependencies corpora in a simulated low-resource setting, and find improvements of over 10 % absolute accuracy in some cases.

2017

pdf bib
Low-Resource Named Entity Recognition with Cross-lingual, Character-Level Neural Conditional Random Fields
Ryan Cotterell | Kevin Duh
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

Low-resource named entity recognition is still an open problem in NLP. Most state-of-the-art systems require tens of thousands of annotated sentences in order to obtain high performance. However, for most of the world’s languages it is unfeasible to obtain such annotation. In this paper, we present a transfer learning scheme, whereby we train character-level neural CRFs to predict named entities for both high-resource languages and low-resource languages jointly. Learning character representations for multiple related languages allows knowledge transfer from the high-resource languages to the low-resource ones, improving F1 by up to 9.8 points.

pdf bib
Probabilistic Typology : Deep Generative Models of Vowel Inventories
Ryan Cotterell | Jason Eisner
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Linguistic typology studies the range of structures present in human language. The main goal of the field is to discover which sets of possible phenomena are universal, and which are merely frequent. For example, all languages have vowels, while mostbut not alllanguages have an /u/ sound. In this paper we present the first probabilistic treatment of a basic question in phonological typology : What makes a natural vowel inventory? We introduce a series of deep stochastic point processes, and contrast them with previous computational, simulation-based approaches. We provide a comprehensive suite of experiments on over 200 distinct languages.

pdf bib
One-Shot Neural Cross-Lingual Transfer for Paradigm Completion
Katharina Kann | Ryan Cotterell | Hinrich Schütze
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We present a novel cross-lingual transfer method for paradigm completion, the task of mapping a lemma to its inflected forms, using a neural encoder-decoder model, the state of the art for the monolingual task. We use labeled data from a high-resource language to increase performance on a low-resource language. In experiments on 21 language pairs from four different language families, we obtain up to 58 % higher accuracy than without transfer and show that even zero-shot and one-shot learning are possible. We further find that the degree of language relatedness strongly influences the ability to transfer morphological knowledge.

pdf bib
Frame-Based Continuous Lexical Semantics through Exponential Family Tensor Factorization and Semantic Proto-Roles
Francis Ferraro | Adam Poliak | Ryan Cotterell | Benjamin Van Durme
Proceedings of the 6th Joint Conference on Lexical and Computational Semantics (*SEM 2017)

We study how different frame annotations complement one another when learning continuous lexical semantics. We learn the representations from a tensorized skip-gram model that consistently encodes syntactic-semantic content better, with multiple 10 % gains over baselines.

pdf bib
Paradigm Completion for Derivational Morphology
Ryan Cotterell | Ekaterina Vylomova | Huda Khayrallah | Christo Kirov | David Yarowsky
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

The generation of complex derived word forms has been an overlooked problem in NLP ; we fill this gap by applying neural sequence-to-sequence models to the task. We overview the theoretical motivation for a paradigmatic treatment of derivational morphology, and introduce the task of derivational paradigm completion as a parallel to inflectional paradigm completion. State-of-the-art neural models adapted from the inflection task are able to learn the range of derivation patterns, and outperform a non-neural baseline by 16.4 %. However, due to semantic, historical, and lexical considerations involved in derivational morphology, future work will be needed to achieve performance parity with inflection-generating systems.

pdf bib
Neural Multi-Source Morphological Reinflection
Katharina Kann | Ryan Cotterell | Hinrich Schütze
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

We explore the task of multi-source morphological reinflection, which generalizes the standard, single-source version. The input consists of (i) a target tag and (ii) multiple pairs of source form and source tag for a lemma. The motivation is that it is beneficial to have access to more than one source form since different source forms can provide complementary information, e.g., different stems. We further present a novel extension to the encoder-decoder recurrent neural architecture, consisting of multiple encoders, to better solve the task. We show that our new architecture outperforms single-source reinflection models and publish our dataset for multi-source morphological reinflection to facilitate future research.

pdf bib
A Rich Morphological Tagger for English : Exploring the Cross-Linguistic Tradeoff Between Morphology and SyntaxEnglish: Exploring the Cross-Linguistic Tradeoff Between Morphology and Syntax
Christo Kirov | John Sylak-Glassman | Rebecca Knowles | Ryan Cotterell | Matt Post
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

A traditional claim in linguistics is that all human languages are equally expressiveable to convey the same wide range of meanings. Morphologically rich languages, such as Czech, rely on overt inflectional and derivational morphology to convey many semantic distinctions. Languages with comparatively limited morphology, such as English, should be able to accomplish the same using a combination of syntactic and contextual cues. We capitalize on this idea by training a tagger for English that uses syntactic features obtained by automatic parsing to recover complex morphological tags projected from Czech. The high accuracy of the resulting model provides quantitative confirmation of the underlying linguistic hypothesis of equal expressivity, and bodes well for future improvements in downstream HLT tasks including machine translation.

pdf bib
Context-Aware Prediction of Derivational Word-forms
Ekaterina Vylomova | Ryan Cotterell | Timothy Baldwin | Trevor Cohn
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

Derivational morphology is a fundamental and complex characteristic of language. In this paper we propose a new task of predicting the derivational form of a given base-form lemma that is appropriate for a given context. We present an encoder-decoder style neural network to produce a derived form character-by-character, based on its corresponding character-level representation of the base form and the context. We demonstrate that our model is able to generate valid context-sensitive derivations from known base forms, but is less accurate under lexicon agnostic setting.

pdf bib
Explaining and Generalizing Skip-Gram through Exponential Family Principal Component Analysis
Ryan Cotterell | Adam Poliak | Benjamin Van Durme | Jason Eisner
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

The popular skip-gram model induces word embeddings by exploiting the signal from word-context coocurrence. We offer a new interpretation of skip-gram based on exponential family PCA-a form of matrix factorization to generalize the skip-gram model to tensor factorization. In turn, this lets us train embeddings through richer higher-order coocurrences, e.g., triples that include positional information (to incorporate syntax) or morphological information (to share parameters across related words). We experiment on 40 languages and show our model improves upon skip-gram.

pdf bib
Morphological Analysis of the Dravidian Language FamilyDravidian Language Family
Arun Kumar | Ryan Cotterell | Lluís Padró | Antoni Oliver
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers

The Dravidian languages are one of the most widely spoken language families in the world, yet there are very few annotated resources available to NLP researchers. To remedy this, we create DravMorph, a corpus annotated for morphological segmentation and part-of-speech. Additionally, we exploit novel features and higher-order models to set state-of-the-art results on these corpora on both tasks, beating techniques proposed in the literature by as much as 4 points in segmentation F1.
Search
Co-authors