Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology

Sandra Kuebler, Garrett Nicolai (Editors)


Anthology ID:
W18-58
Month:
October
Year:
2018
Address:
Brussels, Belgium
Venues:
EMNLP | WS
SIG:
SIGMORPHON
Publisher:
Association for Computational Linguistics
URL:
https://aclanthology.org/W18-58
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
https://aclanthology.org/W18-58.pdf

pdf bib
Proceedings of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology
Sandra Kuebler | Garrett Nicolai

pdf bib
Efficient Computation of Implicational Universals in Constraint-Based Phonology Through the Hyperplane Separation Theorem
Giorgio Magri

This paper focuses on the most basic implicational universals in phonological theory, called T-orders after Anttila and Andrus (2006). It develops necessary and sufficient constraint characterizations of T-orders within Harmonic Grammar and Optimality Theory. These conditions rest on the rich convex geometry underlying these frameworks. They are phonologically intuitive and have significant algorithmic implications.implicational universals in phonological theory, called T-orders after Anttila and Andrus (2006). It develops necessary and sufficient constraint characterizations of T-orders within Harmonic Grammar and Optimality Theory. These conditions rest on the rich convex geometry underlying these frameworks. They are phonologically intuitive and have significant algorithmic implications.

pdf bib
Lexical Networks in ! XungXung
Syed-Amad Hussain | Micha Elsner | Amanda Miller

We investigate the lexical network properties of the large phoneme inventory Southern African language Mangetti Dune ! Xung as it compares to English and other commonly-studied languages. Lexical networks are graphs in which nodes (words) are linked to their minimal pairs ; global properties of these networks are believed to mediate lexical access in the minds of speakers. We show that the network properties of ! Xung are within the range found in previously-studied languages. By simulating data (pseudolexicons) with varying levels of phonotactic structure, we find that the lexical network properties of ! Xung diverge from previously-studied languages when fewer phonotactic constraints are retained. We conclude that lexical network properties are representative of an underlying cognitive structure which is necessary for efficient word retrieval and that the phonotactics of ! Xung may be shaped by a selective pressure which preserves network properties within this cognitively useful range.

pdf bib
Acoustic Word Disambiguation with Phonogical Features in Danish ASRDanish ASR
Andreas Søeborg Kirkedal

Phonological features can indicate word class and we can use word class information to disambiguate both homophones and homographs in automatic speech recognition (ASR). We show Danish std can be predicted from speech and used to improve ASR. We discover which acoustic features contain the signal of std, how to use these features to predict std and how we can make use of std and stdpredictive acoustic features to improve overall ASR accuracy and decoding speed. In the process, we discover acoustic features that are novel to the phonetic characterisation of std.

pdf bib
Adaptor Grammars for the Linguist : Word Segmentation Experiments for Very Low-Resource LanguagesAdaptor Grammars for the Linguist: Word Segmentation Experiments for Very Low-Resource Languages
Pierre Godard | Laurent Besacier | François Yvon | Martine Adda-Decker | Gilles Adda | Hélène Maynard | Annie Rialland

Computational Language Documentation attempts to make the most recent research in speech and language technologies available to linguists working on language preservation and documentation. In this paper, we pursue two main goals along these lines. The first is to improve upon a strong baseline for the unsupervised word discovery task on two very low-resource Bantu languages, taking advantage of the expertise of linguists on these particular languages. The second consists in exploring the Adaptor Grammar framework as a decision and prediction tool for linguists studying a new language. We experiment 162 grammar configurations for each language and show that using Adaptor Grammars for word segmentation enables us to test hypotheses about a language. Specializing a generic grammar with language specific knowledge leads to great improvements for the word discovery task, ultimately achieving a leap of about 30 % token F-score from the results of a strong baseline.

pdf bib
String Transduction with Target Language Models and Insertion Handling
Garrett Nicolai | Saeed Najafi | Grzegorz Kondrak

Many character-level tasks can be framed as sequence-to-sequence transduction, where the target is a word from a natural language. We show that leveraging target language models derived from unannotated target corpora, combined with a precise alignment of the training data, yields state-of-the art results on cognate projection, inflection generation, and phoneme-to-grapheme conversion.

pdf bib
Modeling Reduplication with 2-way Finite-State Transducers
Hossep Dolatian | Jeffrey Heinz

This article describes a novel approach to the computational modeling of reduplication. Reduplication is a well-studied linguistic phenomenon. However, it is often treated as a stumbling block within finite-state treatments of morphology. Most finite-state implementations of computational morphology can not adequately capture the productivity of unbounded copying in reduplication, nor can they adequately capture bounded copying. We show that an understudied type of finite-state machines, two-way finite-state transducers (2-way FSTs), captures virtually all reduplicative processes, including total reduplication. 2-way FSTs can model reduplicative typology in a way which is convenient, easy to design and debug in practice, and linguistically-motivated. By virtue of being finite-state, 2-way FSTs are likewise incorporable into existing finite-state systems and programs. A small but representative typology of reduplicative processes is described in this article, alongside their corresponding 2-way FST models.

pdf bib
A Comparison of Entity Matching Methods between English and Japanese KatakanaEnglish and Japanese Katakana
Michiharu Yamashita | Hideki Awashima | Hidekazu Oiwa

Japanese Katakana is one component of the Japanese writing system and is used to express English terms, loanwords, and onomatopoeia in Japanese characters based on the phonemes. The main purpose of this research is to find the best entity matching methods between English and Katakana. We built two research questions to clarify which types of entity matching systems works better than others. The first question is what transliteration should be used for conversion. We need to transliterate English or Katakana terms into the same form in order to compute the string similarity. We consider five conversions that transliterate English to Katakana directly, Katakana to English directly, English to Katakana via phoneme, Katakana to English via phoneme, and both English and Katakana to phoneme. The second question is what should be used for the similarity measure at entity matching. To investigate the problem, we choose six methods, which are Overlap Coefficient, Cosine, Jaccard, Jaro-Winkler, Levenshtein, and the similarity of the phoneme probability predicted by RNN. Our results show that 1) matching using phonemes and conversion of Katakana to English works better than other methods, and 2) the similarity of phonemes outperforms other methods while other similarity score is changed depending on data and models.

pdf bib
Seq2Seq Models with Dropout can Learn Generalizable ReduplicationSeq2Seq Models with Dropout can Learn Generalizable Reduplication
Brandon Prickett | Aaron Traylor | Joe Pater

Natural language reduplication can pose a challenge to neural models of language, and has been argued to require variables (Marcus et al., 1999). Sequence-to-sequence neural networks have been shown to perform well at a number of other morphological tasks (Cotterell et al., 2016), and produce results that highly correlate with human behavior (Kirov, 2017 ; Kirov & Cotterell, 2018) but do not include any explicit variables in their architecture. We find that they can learn a reduplicative pattern that generalizes to novel segments if they are trained with dropout (Srivastava et al., 2014). We argue that this matches the scope of generalization observed in human reduplication.

pdf bib
A Characterwise Windowed Approach to Hebrew Morphological SegmentationHebrew Morphological Segmentation
Amir Zeldes

This paper presents a novel approach to the segmentation of orthographic word forms in contemporary Hebrew, focusing purely on splitting without carrying out morphological analysis or disambiguation. Casting the analysis task as character-wise binary classification and using adjacent character and word-based lexicon-lookup features, this approach achieves over 98 % accuracy on the benchmark SPMRL shared task data for Hebrew, and 97 % accuracy on a new out of domain Wikipedia dataset, an improvement of 4 % and 5 % over previous state of the art performance.

pdf bib
Phonetic Vector Representations for Sound Sequence Alignment
Pavel Sofroniev | Çağrı Çöltekin

This study explores a number of data-driven vector representations of the IPA-encoded sound segments for the purpose of sound sequence alignment. We test the alternative representations based on the alignment accuracy in the context of computational historical linguistics. We show that the data-driven methods consistently do better than linguistically-motivated articulatory-acoustic features. The similarity scores obtained using the data-driven representations in a monolingual context, however, performs worse than the state-of-the-art distance (or similarity) scoring methods proposed in earlier studies of computational historical linguistics. We also show that adapting representations to the task at hand improves the results, yielding alignment accuracy comparable to the state of the art methods.

pdf bib
Sounds Wilde. Phonetically Extended Embeddings for Author-Stylized Poetry Generation
Aleksey Tikhonov | Ivan P. Yamshchikov

This paper addresses author-stylized text generation. Using a version of a language model with extended phonetic and semantic embeddings for poetry generation we show that phonetics has comparable contribution to the overall model performance as the information on the target author. Phonetic information is shown to be important for English and Russian language. Humans tend to attribute machine generated texts to the target author.

pdf bib
On Hapax Legomena and Morphological Productivity
Janet Pierrehumbert | Ramon Granell

Quantifying and predicting morphological productivity is a long-standing challenge in corpus linguistics and psycholinguistics. The same challenge reappears in natural language processing in the context of handling words that were not seen in the training set (out-of-vocabulary, or OOV, words). Prior research showed that a good indicator of the productivity of a morpheme is the number of words involving it that occur exactly once (the hapax legomena). A technical connection was adduced between this result and Good-Turing smoothing, which assigns probability mass to unseen events on the basis of the simplifying assumption that word frequencies are stationary. In a large-scale study of 133 affixes in Wikipedia, we develop evidence that success in fact depends on tapping the frequency range in which the assumptions of Good-Turing are violated.hapax legomena). A technical connection was adduced between this result and Good-Turing smoothing, which assigns probability mass to unseen events on the basis of the simplifying assumption that word frequencies are stationary. In a large-scale study of 133 affixes in Wikipedia, we develop evidence that success in fact depends on tapping the frequency range in which the assumptions of Good-Turing are violated.

pdf bib
An Arabic Morphological Analyzer and Generator with Copious FeaturesArabic Morphological Analyzer and Generator with Copious Features
Dima Taji | Salam Khalifa | Ossama Obeid | Fadhl Eryani | Nizar Habash

We introduce CALIMA-Star, a very rich Arabic morphological analyzer and generator that provides functional and form-based morphological features as well as built-in tokenization, phonological representation, lexical rationality and much more. This tool includes a fast engine that can be easily integrated into other systems, as well as an easy-to-use API and a web interface. CALIMA-Star also supports morphological reinflection. We evaluate CALIMA-Star against four commonly used analyzers for Arabic in terms of speed and morphological content.

pdf bib
Sanskrit n-Retroflexion is Input-Output Tier-Based Strictly LocalSanskrit n-Retroflexion is Input-Output Tier-Based Strictly Local
Thomas Graf | Connor Mayer

Sanskrit /n/-retroflexion is one of the most complex segmental processes in phonology. While it is still star-free, it does not fit in any of the subregular classes that are commonly entertained in the literature. We show that when construed as a phonotactic dependency, the process fits into a class we call input-output tier-based strictly local (IO-TSL), a natural extension of the familiar class TSL. IO-TSL increases the power of TSL’s tier projection function by making it an input-output strictly local transduction. Assuming that /n/-retroflexion represents the upper bound on the complexity of segmental phonology, this shows that all of segmental phonology can be captured by combining the intuitive notion of tiers with the independently motivated machinery of strictly local mappings.input-output tier-based strictly local (IO-TSL), a natural extension of the familiar class TSL. IO-TSL increases the power of TSL’s tier projection function by making it an input-output strictly local transduction. Assuming that /n/-retroflexion represents the upper bound on the complexity of segmental phonology, this shows that all of segmental phonology can be captured by combining the intuitive notion of tiers with the independently motivated machinery of strictly local mappings.

pdf bib
Phonological Features for Morphological Inflection
Adam Wiemerslage | Miikka Silfverberg | Mans Hulden

Modeling morphological inflection is an important task in Natural Language Processing. In contrast to earlier work that has largely used orthographic representations, we experiment with this task in a phonetic character space, representing inputs as either IPA segments or bundles of phonological distinctive features. We show that both of these inputs, somewhat counterintuitively, achieve similar accuracies on morphological inflection, slightly lower than orthographic models. We conclude that providing detailed phonological representations is largely redundant when compared to IPA segments, and that articulatory distinctions relevant for word inflection are already latently present in the distributional properties of many graphemic writing systems.

pdf bib
Extracting Morphophonology from Small Corpora
Marina Ermolaeva

Probabilistic approaches have proven themselves well in learning phonological structure. In contrast, theoretical linguistics usually works with deterministic generalizations. The goal of this paper is to explore possible interactions between information-theoretic methods and deterministic linguistic knowledge and to examine some ways in which both can be used in tandem to extract phonological and morphophonological patterns from a small annotated dataset. Local and nonlocal processes in Mishar Tatar (Turkic / Kipchak) are examined as a case study.