Katharina Kann


pdf bib
Proceedings of the 4th Workshop on Technologies for MT of Low Resource Languages (LoResMT2021)
John Ortega | Atul Kr. Ojha | Katharina Kann | Chao-Hong Liu
Proceedings of the 4th Workshop on Technologies for MT of Low Resource Languages (LoResMT2021)

pdf bib
CLiMP : A Benchmark for Chinese Language Model EvaluationCLiMP: A Benchmark for Chinese Language Model Evaluation
Beilei Xiang | Changbing Yang | Yu Li | Alex Warstadt | Katharina Kann
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Linguistically informed analyses of language models (LMs) contribute to the understanding and improvement of such models. Here, we introduce the corpus of Chinese linguistic minimal pairs (CLiMP) to investigate what knowledge Chinese LMs acquire. CLiMP consists of sets of 1000 minimal pairs (MPs) for 16 syntactic contrasts in Chinese, covering 9 major Chinese linguistic phenomena. The MPs are semi-automatically generated, and human agreement with the labels in CLiMP is 95.8 %. We evaluate 11 different LMs on CLiMP, covering n-grams, LSTMs, and Chinese BERT. We find that classifiernoun agreement and verb complement selection are the phenomena that models generally perform best at. However, models struggle the most with the ba construction, binding, and filler-gap dependencies. Overall, Chinese BERT achieves an 81.8 % average accuracy, while the performances of LSTMs and 5-grams are only moderately above chance level.

pdf bib
The World of an Octopus : How Reporting Bias Influences a Language Model’s Perception of ColorThe World of an Octopus: How Reporting Bias Influences a Language Model’s Perception of Color
Cory Paik | Stéphane Aroca-Ouellette | Alessandro Roncone | Katharina Kann
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Recent work has raised concerns about the inherent limitations of text-only pretraining. In this paper, we first demonstrate that reporting bias, the tendency of people to not state the obvious, is one of the causes of this limitation, and then investigate to what extent multimodal training can mitigate this issue. To accomplish this, we 1) generate the Color Dataset (CoDa), a dataset of human-perceived color distributions for 521 common objects ; 2) use CoDa to analyze and compare the color distribution found in text, the distribution captured by language models, and a human’s perception of color ; and 3) investigate the performance differences between text-only and multimodal models on CoDa. Our results show that the distribution of colors that a language model recovers correlates more strongly with the inaccurate distribution found in text than with the ground-truth, supporting the claim that reporting bias negatively impacts and inherently limits text-only training. We then demonstrate that multimodal models can leverage their visual training to mitigate these effects, providing a promising avenue for future research.

pdf bib
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas
Manuel Mager | Arturo Oncevay | Annette Rios | Ivan Vladimir Meza Ruiz | Alexis Palmer | Graham Neubig | Katharina Kann
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

pdf bib
Findings of the AmericasNLP 2021 Shared Task on Open Machine Translation for Indigenous Languages of the AmericasAmericasNLP 2021 Shared Task on Open Machine Translation for Indigenous Languages of the Americas
Manuel Mager | Arturo Oncevay | Abteen Ebrahimi | John Ortega | Annette Rios | Angela Fan | Ximena Gutierrez-Vasques | Luis Chiruzzo | Gustavo Giménez-Lugo | Ricardo Ramos | Ivan Vladimir Meza Ruiz | Rolando Coto-Solano | Alexis Palmer | Elisabeth Mager-Hois | Vishrav Chaudhary | Graham Neubig | Ngoc Thang Vu | Katharina Kann
Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas

This paper presents the results of the 2021 Shared Task on Open Machine Translation for Indigenous Languages of the Americas. The shared task featured two independent tracks, and participants submitted machine translation systems for up to 10 indigenous languages. Overall, 8 teams participated with a total of 214 submissions. We provided training sets consisting of data collected from various sources, as well as manually translated sentences for the development and test sets. An official baseline trained on this data was also provided. Team submissions featured a variety of architectures, including both statistical and neural models, and for the majority of languages, many teams were able to considerably improve over the baseline. The best performing systems achieved 12.97 ChrF higher than baseline, when averaged across languages.


pdf bib
Self-Training for Unsupervised Parsing with PRPNPRPN
Anhad Mohananey | Katharina Kann | Samuel R. Bowman
Proceedings of the 16th International Conference on Parsing Technologies and the IWPT 2020 Shared Task on Parsing into Enhanced Universal Dependencies

Neural unsupervised parsing (UP) models learn to parse without access to syntactic annotations, while being optimized for another task like language modeling. In this work, we propose self-training for neural UP models : we leverage aggregated annotations predicted by copies of our model as supervision for future copies. To be able to use our model’s predictions during training, we extend a recent neural UP architecture, the PRPN (Shen et al., 2018a), such that it can be trained in a semi-supervised fashion. We then add examples with parses predicted by our model to our unlabeled UP training data. Our self-trained model outperforms the PRPN by 8.1 % F1 and the previous state of the art by 1.6 % F1. In addition, we show that our architecture can also be helpful for semi-supervised parsing in ultra-low-resource settings.

pdf bib
Acrostic Poem Generation
Rajat Agarwal | Katharina Kann
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

We propose a new task in the area of computational creativity : acrostic poem generation in English. Acrostic poems are poems that contain a hidden message ; typically, the first letter of each line spells out a word or short phrase. We define the task as a generation task with multiple constraints : given an input word, 1) the initial letters of each line should spell out the provided word, 2) the poem’s semantics should also relate to it, and 3) the poem should conform to a rhyming scheme. We further provide a baseline model for the task, which consists of a conditional neural language model in combination with a neural rhyming model. Since no dedicated datasets for acrostic poem generation exist, we create training data for our task by first training a separate topic prediction model on a small set of topic-annotated poems and then predicting topics for additional poems. Our experiments show that the acrostic poems generated by our baseline are received well by humans and do not lose much quality due to the additional constraints. Last, we confirm that poems generated by our model are indeed closely related to the provided prompts, and that pretraining on Wikipedia can boost performance.

pdf bib
Unsupervised Morphological Paradigm Completion
Huiming Jin | Liwei Cai | Yihui Peng | Chen Xia | Arya McCarthy | Katharina Kann
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We propose the task of unsupervised morphological paradigm completion. Given only raw text and a lemma list, the task consists of generating the morphological paradigms, i.e., all inflected forms, of the lemmas. From a natural language processing (NLP) perspective, this is a challenging unsupervised task, and high-performing systems have the potential to improve tools for low-resource languages or to assist linguistic annotators. From a cognitive science perspective, this can shed light on how children acquire morphological knowledge. We further introduce a system for the task, which generates morphological paradigms via the following steps : (i) EDIT TREE retrieval, (ii) additional lemma retrieval, (iii) paradigm size discovery, and (iv) inflection generation. We perform an evaluation on 14 typologically diverse languages. Our system outperforms trivial baselines with ease and, for some languages, even obtains a higher accuracy than minimally supervised systems.

pdf bib
The IMSCUBoulder System for the SIGMORPHON 2020 Shared Task on Unsupervised Morphological Paradigm CompletionIMSCUBoulder System for the SIGMORPHON 2020 Shared Task on Unsupervised Morphological Paradigm Completion
Manuel Mager | Katharina Kann
Proceedings of the 17th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

In this paper, we present the systems of the University of Stuttgart IMS and the University of Colorado Boulder (IMSCUBoulder) for SIGMORPHON 2020 Task 2 on unsupervised morphological paradigm completion (Kann et al., 2020). The task consists of generating the morphological paradigms of a set of lemmas, given only the lemmas themselves and unlabeled text. Our proposed system is a modified version of the baseline introduced together with the task. In particular, we experiment with substituting the inflection generation component with an LSTM sequence-to-sequence model and an LSTM pointer-generator network. Our pointer-generator system obtains the best score of all seven submitted systems on average over all languages, and outperforms the official baseline, which was best overall, on Bulgarian and Kannada.

Can Wikipedia Categories Improve Masked Language Model Pretraining?
Diksha Meghwal | Katharina Kann | Iacer Calixto | Stanislaw Jastrzebski
Proceedings of the The Fourth Widening Natural Language Processing Workshop

Pretrained language models have obtained impressive results for a large set of natural language understanding tasks. However, training these models is computationally expensive and requires huge amounts of data. Thus, it would be desirable to automatically detect groups of more or less important examples. Here, we investigate if we can leverage sources of information which are commonly overlooked, Wikipedia categories as listed in DBPedia, to identify useful or harmful data points during pretraining. We define an experimental setup in which we analyze correlations between language model perplexity on specific clusters and downstream NLP task performances during pretraining. Our experiments show that Wikipedia categories are not a good indicator of the importance of specific sentences for pretraining.


pdf bib
Towards Realistic Practices In Low-Resource Natural Language Processing : The Development Set
Katharina Kann | Kyunghyun Cho | Samuel R. Bowman
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Development sets are impractical to obtain for real low-resource languages, since using all available data for training is often more effective. However, development sets are widely used in research papers that purport to deal with low-resource natural language processing (NLP). Here, we aim to answer the following questions : Does using a development set for early stopping in the low-resource setting influence results as compared to a more realistic alternative, where the number of training epochs is tuned on development languages? And does it lead to overestimation or underestimation of performance? We repeat multiple experiments from recent work on neural models for low-resource NLP and compare results for models obtained by training with and without development sets. On average over languages, absolute accuracy differs by up to 1.4 %. However, for some languages and tasks, differences are as big as 18.0 % accuracy. Our results highlight the importance of realistic experimental setups in the publication of low-resource NLP research results.

pdf bib
Transductive Auxiliary Task Self-Training for Neural Multi-Task Models
Johannes Bjerva | Katharina Kann | Isabelle Augenstein
Proceedings of the 2nd Workshop on Deep Learning Approaches for Low-Resource NLP (DeepLo 2019)

Multi-task learning and self-training are two common ways to improve a machine learning model’s performance in settings with limited training data. Drawing heavily on ideas from those two approaches, we suggest transductive auxiliary task self-training : training a multi-task model on (i) a combination of main and auxiliary task training data, and (ii) test instances with auxiliary task labels which a single-task version of the model has previously generated. We perform extensive experiments on 86 combinations of languages and tasks. Our results are that, on average, transductive auxiliary task self-training improves absolute accuracy by up to 9.56 % over the pure multi-task model for dependency relation tagging and by up to 13.03 % for semantic tagging.

pdf bib
Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019)
Isabelle Augenstein | Spandana Gella | Sebastian Ruder | Katharina Kann | Burcu Can | Johannes Welbl | Alexis Conneau | Xiang Ren | Marek Rei
Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019)

pdf bib
Subword-Level Language Identification for Intra-Word Code-Switching
Manuel Mager | Özlem Çetinoğlu | Katharina Kann
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Language identification for code-switching (CS), the phenomenon of alternating between two or more languages in conversations, has traditionally been approached under the assumption of a single language per token. However, if at least one language is morphologically rich, a large number of words can be composed of morphemes from more than one language (intra-word CS). In this paper, we extend the language identification task to the subword-level, such that it includes splitting mixed words while tagging each part with a language ID. We further propose a model for this task, which is based on a segmental recurrent neural network. In experiments on a new SpanishWixarika dataset and on an adapted GermanTurkish dataset, our proposed model performs slightly better than or roughly on par with our best baseline, respectively. Considering only mixed words, however, it strongly outperforms all baselines.

pdf bib
Probing for Semantic Classes : Diagnosing the Meaning Content of Word Embeddings
Yadollah Yaghoobzadeh | Katharina Kann | T. J. Hazen | Eneko Agirre | Hinrich Schütze
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Word embeddings typically represent different meanings of a word in a single conflated vector. Empirical analysis of embeddings of ambiguous words is currently limited by the small size of manually annotated resources and by the fact that word senses are treated as unrelated individual concepts. We present a large dataset based on manual Wikipedia annotations and word senses, where word senses from different words are related by semantic classes. This is the basis for novel diagnostic tests for an embedding’s content : we probe word embeddings for semantic classes and analyze the embedding space by classifying embeddings into semantic classes. Our main findings are : (i) Information about a sense is generally represented well in a single-vector embedding if the sense is frequent. (ii) A classifier can accurately predict whether a word is single-sense or multi-sense, based only on its embedding. (iii) Although rare senses are not well represented in single-vector embeddings, this does not have negative impact on an NLP application whose performance depends on frequent senses.


pdf bib
Evaluating Word Embeddings in Multi-label Classification Using Fine-Grained Name Typing
Yadollah Yaghoobzadeh | Katharina Kann | Hinrich Schütze
Proceedings of The Third Workshop on Representation Learning for NLP

Embedding models typically associate each word with a single real-valued vector, representing its different properties. Evaluation methods, therefore, need to analyze the accuracy and completeness of these properties in embeddings. This requires fine-grained analysis of embedding subspaces. Multi-label classification is an appropriate way to do so. We propose a new evaluation method for word embeddings based on multi-label classification given a word embedding. The task we use is fine-grained name typing : given a large corpus, find all types that a name can refer to based on the name embedding. Given the scale of entities in knowledge bases, we can build datasets for this task that are complementary to the current embedding evaluation datasets in : they are very large, contain fine-grained classes, and allow the direct evaluation of embeddings without confounding factors like sentence context.

pdf bib
Lost in Translation : Analysis of Information Loss During Machine Translation Between Polysynthetic and Fusional Languages
Manuel Mager | Elisabeth Mager | Alfonso Medina-Urrea | Ivan Vladimir Meza Ruiz | Katharina Kann
Proceedings of the Workshop on Computational Modeling of Polysynthetic Languages

Machine translation from polysynthetic to fusional languages is a challenging task, which gets further complicated by the limited amount of parallel text available. Thus, translation performance is far from the state of the art for high-resource and more intensively studied language pairs. To shed light on the phenomena which hamper automatic translation to and from polysynthetic languages, we study translations from three low-resource, polysynthetic languages (Nahuatl, Wixarika and Yorem Nokki) into Spanish and vice versa. Doing so, we find that in a morpheme-to-morpheme alignment an important amount of information contained in polysynthetic morphemes has no Spanish counterpart, and its translation is often omitted. We further conduct a qualitative analysis and, thus, identify morpheme types that are commonly hard to align or ignored in the translation process.

pdf bib
Fortification of Neural Morphological Segmentation Models for Polysynthetic Minimal-Resource Languages
Katharina Kann | Jesus Manuel Mager Hois | Ivan Vladimir Meza-Ruiz | Hinrich Schütze
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Morphological segmentation for polysynthetic languages is challenging, because a word may consist of many individual morphemes and training data can be extremely scarce. Since neural sequence-to-sequence (seq2seq) models define the state of the art for morphological segmentation in high-resource settings and for (mostly) European languages, we first show that they also obtain competitive performance for Mexican polysynthetic languages in minimal-resource settings. We then propose two novel multi-task training approachesone with, one without need for external unlabeled resources, and two corresponding data augmentation methods, improving over the neural baseline for all languages. Finally, we explore cross-lingual transfer as a third way to fortify our neural model and show that we can train one single multi-lingual model for related languages while maintaining comparable or even improved performance, thus reducing the amount of parameters by close to 75 %. We provide our morphological segmentation datasets for Mexicanero, Nahuatl, Wixarika and Yorem Nokki for future research.

pdf bib
Sentence-Level Fluency Evaluation : References Help, But Can Be Spared !
Katharina Kann | Sascha Rothe | Katja Filippova
Proceedings of the 22nd Conference on Computational Natural Language Learning

Motivated by recent findings on the probabilistic modeling of acceptability judgments, we propose syntactic log-odds ratio (SLOR), a normalized language model score, as a metric for referenceless fluency evaluation of natural language generation output at the sentence level. We further introduce WPSLOR, a novel WordPiece-based version, which harnesses a more compact language model. Even though word-overlap metrics like ROUGE are computed with the help of hand-written references, our referenceless methods obtain a significantly higher correlation with human fluency scores on a benchmark dataset of compressed sentences. Finally, we present ROUGE-LM, a reference-based metric which is a natural extension of WPSLOR to the case of available references. We show that ROUGE-LM yields a significantly higher correlation with human judgments than all baseline metrics, including WPSLOR on its own.


pdf bib
One-Shot Neural Cross-Lingual Transfer for Paradigm Completion
Katharina Kann | Ryan Cotterell | Hinrich Schütze
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We present a novel cross-lingual transfer method for paradigm completion, the task of mapping a lemma to its inflected forms, using a neural encoder-decoder model, the state of the art for the monolingual task. We use labeled data from a high-resource language to increase performance on a low-resource language. In experiments on 21 language pairs from four different language families, we obtain up to 58 % higher accuracy than without transfer and show that even zero-shot and one-shot learning are possible. We further find that the degree of language relatedness strongly influences the ability to transfer morphological knowledge.

pdf bib
Exploring Cross-Lingual Transfer of Morphological Knowledge In Sequence-to-Sequence Models
Huiming Jin | Katharina Kann
Proceedings of the First Workshop on Subword and Character Level Models in NLP

Multi-task training is an effective method to mitigate the data sparsity problem. It has recently been applied for cross-lingual transfer learning for paradigm completionthe task of producing inflected forms of lemmatawith sequence-to-sequence networks. However, it is still vague how the model transfers knowledge across languages, as well as if and which information is shared. To investigate this, we propose a set of data-dependent experiments using an existing encoder-decoder recurrent neural network for the task. Our results show that indeed the performance gains surpass a pure regularization effect and that knowledge about language and morphology can be transferred.

pdf bib
Unlabeled Data for Morphological Generation With Character-Based Sequence-to-Sequence Models
Katharina Kann | Hinrich Schütze
Proceedings of the First Workshop on Subword and Character Level Models in NLP

We present a semi-supervised way of training a character-based encoder-decoder recurrent neural network for morphological reinflectionthe task of generating one inflected wordform from another. This is achieved by using unlabeled tokens or random strings as training data for an autoencoding task, adapting a network for morphological reinflection, and performing multi-task training. We thus use limited labeled data more effectively, obtaining up to 9.92 % improvement over state-of-the-art baselines for 8 different languages.

pdf bib
Neural Multi-Source Morphological Reinflection
Katharina Kann | Ryan Cotterell | Hinrich Schütze
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

We explore the task of multi-source morphological reinflection, which generalizes the standard, single-source version. The input consists of (i) a target tag and (ii) multiple pairs of source form and source tag for a lemma. The motivation is that it is beneficial to have access to more than one source form since different source forms can provide complementary information, e.g., different stems. We further present a novel extension to the encoder-decoder recurrent neural architecture, consisting of multiple encoders, to better solve the task. We show that our new architecture outperforms single-source reinflection models and publish our dataset for multi-source morphological reinflection to facilitate future research.