Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature

Beatrice Alex, Stefania Degaetano-Ortlieb, Anna Kazantseva, Nils Reiter, Stan Szpakowicz (Editors)

Anthology ID:
Minneapolis, USA
Association for Computational Linguistics
Bib Export formats:

pdf bib
Proceedings of the 3rd Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature
Beatrice Alex | Stefania Degaetano-Ortlieb | Anna Kazantseva | Nils Reiter | Stan Szpakowicz

pdf bib
Modeling Word Emotion in Historical Language : Quantity Beats Supposed Stability in Seed Word Selection
Johannes Hellrich | Sven Buechel | Udo Hahn

To understand historical texts, we must be aware that languageincluding the emotional connotation attached to wordschanges over time. In this paper, we aim at estimating the emotion which is associated with a given word in former language stages of English and German. Emotion is represented following the popular Valence-Arousal-Dominance (VAD) annotation scheme. While being more expressive than polarity alone, existing word emotion induction methods are typically not suited for addressing it. To overcome this limitation, we present adaptations of two popular algorithms to VAD. To measure their effectiveness in diachronic settings, we present the first gold standard for historical word emotions, which was created by scholars with proficiency in the respective language stages and covers both English and German. In contrast to claims in previous work, our findings indicate that hand-selecting small sets of seed words with supposedly stable emotional meaning is actually harm- rather than helpful.

pdf bib
Are Fictional Voices Distinguishable? Classifying Character Voices in Modern Drama
Krishnapriya Vishnubhotla | Adam Hammond | Graeme Hirst

According to the literary theory of Mikhail Bakhtin, a dialogic novel is one in which characters speak in their own distinct voices, rather than serving as mouthpieces for their authors. We use text classification to determine which authors best achieve dialogism, looking at a corpus of plays from the late nineteenth and early twentieth centuries. We find that the SAGE model of text generation, which highlights deviations from a background lexical distribution, is an effective method of weighting the words of characters’ utterances. Our results show that it is indeed possible to distinguish characters by their speech in the plays of canonical writers such as George Bernard Shaw, whereas characters are clustered more closely in the works of lesser-known playwrights.

pdf bib
Automatic Alignment and Annotation Projection for Literary Texts
Uli Steinbach | Ines Rehbein

This paper presents a modular NLP pipeline for the creation of a parallel literature corpus, followed by annotation transfer from the source to the target language. The test case we use to evaluate our pipeline is the automatic transfer of quote and speaker mention annotations from English to German. We evaluate the different components of the pipeline and discuss challenges specific to literary texts. Our experiments show that after applying a reasonable amount of semi-automatic postprocessing we can obtain high-quality aligned and annotated resources for a new language.

pdf bib
Inferring missing metadata from environmental policy texts
Steven Bethard | Egoitz Laparra | Sophia Wang | Yiyun Zhao | Ragheb Al-Ghezi | Aaron Lien | Laura López-Hoffman

The National Environmental Policy Act (NEPA) provides a trove of data on how environmental policy decisions have been made in the United States over the last 50 years. Unfortunately, there is no central database for this information and it is too voluminous to assess manually. We describe our efforts to enable systematic research over US environmental policy by extracting and organizing metadata from the text of NEPA documents. Our contributions include collecting more than 40,000 NEPA-related documents, and evaluating rule-based baselines that establish the difficulty of three important tasks : identifying lead agencies, aligning document versions, and detecting reused text.

pdf bib
A framework for streamlined statistical prediction using topic models
Vanessa Glenny | Jonathan Tuke | Nigel Bean | Lewis Mitchell

In the Humanities and Social Sciences, there is increasing interest in approaches to information extraction, prediction, intelligent linkage, and dimension reduction applicable to large text corpora. With approaches in these fields being grounded in traditional statistical techniques, the need arises for frameworks whereby advanced NLP techniques such as topic modelling may be incorporated within classical methodologies. This paper provides a classical, supervised, statistical learning framework for prediction from text, using topic models as a data reduction method and the topics themselves as predictors, alongside typical statistical tools for predictive modelling. We apply this framework in a Social Sciences context (applied animal behaviour) as well as a Humanities context (narrative analysis) as examples of this framework. The results show that topic regression models perform comparably to their much less efficient equivalents that use individual words as predictors.

pdf bib
Graph convolutional networks for exploring authorship hypotheses
Tom Lippincott

This work considers a task from traditional literary criticism : annotating a structured, composite document with information about its sources. We take the Documentary Hypothesis, a prominent theory regarding the composition of the first five books of the Hebrew bible, extract stylistic features designed to avoid bias or overfitting, and train several classification models. Our main result is that the recently-introduced graph convolutional network architecture outperforms structurally-uninformed models. We also find that including information about the granularity of text spans is a crucial ingredient when employing hidden layers, in contrast to simple logistic regression. We perform error analysis at several levels, noting how some characteristic limitations of the models and simple features lead to misclassifications, and conclude with an overview of future work.

pdf bib
Semantics and Homothetic Clustering of Hafez Poetry
Arya Rahgozar | Diana Inkpen

We have created two sets of labels for Hafez (1315-1390) poems, using unsupervised learning. Our labels are the only semantic clustering alternative to the previously existing, hand-labeled, gold-standard classification of Hafez poems, to be used for literary research. We have cross-referenced, measured and analyzed the agreements of our clustering labels with Houman’s chronological classes. Our features are based on topic modeling and word embeddings. We also introduced a similarity of similarities’ features, we called homothetic clustering approach that proved effective, in case of Hafez’s small corpus of ghazals2. Although all our experiments showed different clusters when compared with Houman’s classes, we think they were valid in their own right to have provided further insights, and have proved useful as a contrasting alternative to Houman’s classes. Our homothetic clusterer and its feature design and engineering framework can be used for further semantic analysis of Hafez’s poetry and other similar literary research.

pdf bib
Computational Linguistics Applications for Multimedia Services
Kyeongmin Rim | Kelley Lynch | James Pustejovsky

We present Computational Linguistics Applications for Multimedia Services (CLAMS), a platform that provides access to computational content analysis tools for archival multimedia material that appear in different media, such as text, audio, image, and video. The primary goal of CLAMS is : (1) to develop an interchange format between multimodal metadata generation tools to ensure interoperability between tools ; (2) to provide users with a portable, user-friendly workflow engine to chain selected tools to extract meaningful analyses ; and (3) to create a public software development kit (SDK) for developers that eases deployment of analysis tools within the CLAMS platform. CLAMS is designed to help archives and libraries enrich the metadata associated with their mass-digitized multimedia collections, that would otherwise be largely unsearchable.

pdf bib
On the Feasibility of Automated Detection of Allusive Text Reuse
Enrique Manjavacas | Brian Long | Mike Kestemont

The detection of allusive text reuse is particularly challenging due to the sparse evidence on which allusive references rely commonly based on none or very few shared words. Arguably, lexical semantics can be resorted to since uncovering semantic relations between words has the potential to increase the support underlying the allusion and alleviate the lexical sparsity. A further obstacle is the lack of evaluation benchmark corpora, largely due to the highly interpretative character of the annotation process. In the present paper, we aim to elucidate the feasibility of automated allusion detection. We approach the matter from an Information Retrieval perspective in which referencing texts act as queries and referenced texts as relevant documents to be retrieved, and estimate the difficulty of benchmark corpus compilation by a novel inter-annotator agreement study on query segmentation. Furthermore, we investigate to what extent the integration of lexical semantic information derived from distributional models and ontologies can aid retrieving cases of allusive reuse. The results show that (i) despite low agreement scores, using manual queries considerably improves retrieval performance with respect to a windowing approach, and that (ii) retrieval performance can be moderately boosted with distributional semantics.

pdf bib
Sign Clustering and Topic Extraction in Proto-ElamiteProto-Elamite
Logan Born | Kate Kelley | Nishant Kambhatla | Carolyn Chen | Anoop Sarkar

We describe a first attempt at using techniques from computational linguistics to analyze the undeciphered proto-Elamite script. Using hierarchical clustering, n-gram frequencies, and LDA topic models, we both replicate results obtained by manual decipherment and reveal previously-unobserved relationships between signs. This demonstrates the utility of these techniques as an aid to manual decipherment.