David Jurgens


2022

pdf bib
Classification without (Proper) Representation: Political Heterogeneity in Social Media and Its Implications for Classification and Behavioral Analysis
Kenan Alkiek | Bohan Zhang | David Jurgens
Findings of the Association for Computational Linguistics: ACL 2022

Reddit is home to a broad spectrum of political activity, and users signal their political affiliations in multiple ways—from self-declarations to community participation. Frequently, computational studies have treated political users as a single bloc, both in developing models to infer political leaning and in studying political behavior. Here, we test this assumption of political users and show that commonly-used political-inference models do not generalize, indicating heterogeneous types of political users. The models remain imprecise at best for most users, regardless of which sources of data or methods are used. Across a 14-year longitudinal analysis, we demonstrate that the choice in definition of a political user has significant implications for behavioral analysis. Controlling for multiple factors, political users are more toxic on the platform and inter-party interactions are even more toxic—but not all political users behave this way. Last, we identify a subset of political users who repeatedly flip affiliations, showing that these users are the most controversial of all, acting as provocateurs by more frequently bringing up politics, and are more likely to be banned, suspended, or deleted.

2021

pdf bib
Proceedings of the Fifth Workshop on Teaching NLP
David Jurgens | Varada Kolhatkar | Lucy Li | Margot Mieskes | Ted Pedersen
Proceedings of the Fifth Workshop on Teaching NLP

pdf bib
Using Sociolinguistic Variables to Reveal Changing Attitudes Towards Sexuality and Gender
Sky CH-Wang | David Jurgens
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Individuals signal aspects of their identity and beliefs through linguistic choices. Studying these choices in aggregate allows us to examine large-scale attitude shifts within a population. Here, we develop computational methods to study word choice within a sociolinguistic lexical variablealternate words used to express the same conceptin order to test for change in the United States towards sexuality and gender. We examine two variables : i) referents to significant others, such as the word partner and ii) referents to an indefinite person, both of which could optionally be marked with gender. The linguistic choices in each variable allow us to study increased rates of acceptances of gay marriage and gender equality, respectively. In longitudinal analyses across Twitter and Reddit over 87 M messages, we demonstrate that attitudes are changing but that these changes are driven by specific demographics within the United States. Further, in a quasi-causal analysis, we show that passages of Marriage Equality Acts in different states are drivers of linguistic change.

pdf bib
Modeling Framing in Immigration Discourse on Social Media
Julia Mendelsohn | Ceren Budak | David Jurgens
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

The framing of political issues can influence policy and public opinion. Even though the public plays a key role in creating and spreading frames, little is known about how ordinary people on social media frame political issues. By creating a new dataset of immigration-related tweets labeled for multiple framing typologies from political communication theory, we develop supervised models to detect frames. We demonstrate how users’ ideology and region impact framing choices, and how a message’s framing influences audience responses. We find that the more commonly-used issue-generic frames obscure important ideological and regional patterns that are only revealed by immigration-specific frames. Furthermore, frames oriented towards human interests, culture, and politics are associated with higher user engagement. This large-scale analysis of a complex social and linguistic phenomenon contributes to both NLP and social science research.

pdf bib
Detecting Cross-Geographic Biases in Toxicity Modeling on Social Media
Sayan Ghosh | Dylan Baker | David Jurgens | Vinodkumar Prabhakaran
Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)

Online social media platforms increasingly rely on Natural Language Processing (NLP) techniques to detect abusive content at scale in order to mitigate the harms it causes to their users. However, these techniques suffer from various sampling and association biases present in training data, often resulting in sub-par performance on content relevant to marginalized groups, potentially furthering disproportionate harms towards them. Studies on such biases so far have focused on only a handful of axes of disparities and subgroups that have annotations / lexicons available. Consequently, biases concerning non-Western contexts are largely ignored in the literature. In this paper, we introduce a weakly supervised method to robustly detect lexical biases in broader geo-cultural contexts. Through a case study on a publicly available toxicity detection model, we demonstrate that our method identifies salient groups of cross-geographic errors, and, in a follow up, demonstrate that these groupings reflect human judgments of offensive and inoffensive language in those geographic contexts. We also conduct analysis of a model trained on a dataset with ground truth labels to better understand these biases, and present preliminary mitigation experiments.

pdf bib
HamiltonDinggg at SemEval-2021 Task 5 : Investigating Toxic Span Detection using RoBERTa Pre-trainingHamiltonDinggg at SemEval-2021 Task 5: Investigating Toxic Span Detection using RoBERTa Pre-training
Huiyang Ding | David Jurgens
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

This paper presents our system submission to task 5 : Toxic Spans Detection of the SemEval-2021 competition. The competition aims at detecting the spans that make a toxic span toxic. In this paper, we demonstrate our system for detecting toxic spans, which includes expanding the toxic training set with Local Interpretable Model-Agnostic Explanations (LIME), fine-tuning RoBERTa model for detection, and error analysis. We found that feeding the model with an expanded training set using Reddit comments of polarized-toxicity and labeling with LIME on top of logistic regression classification could help RoBERTa more accurately learn to recognize toxic spans. We achieved a span-level F1 score of 0.6715 on the testing phase. Our quantitative and qualitative results show that the predictions from our system could be a good supplement to the gold training set’s annotations.

2020

pdf bib
Condolence and Empathy in Online Communities
Naitian Zhou | David Jurgens
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Offering condolence is a natural reaction to hearing someone’s distress. Individuals frequently express distress in social media, where some communities can provide support. However, not all condolence is equaltrite responses offer little actual support despite their good intentions. Here, we develop computational tools to create a massive dataset of 11.4 M expressions of distress and 2.8 M corresponding offerings of condolence in order to examine the dynamics of condolence online. Our study reveals widespread disparity in what types of distress receive supportive condolence rather than just engagement. Building on studies from social psychology, we analyze the language of condolence and develop a new dataset for quantifying the empathy in a condolence using appraisal theory. Finally, we demonstrate that the features of condolence individuals find most helpful online differ substantially in their features from those seen in interpersonal settings.

pdf bib
Quantifying Intimacy in Language
Jiaxin Pei | David Jurgens
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Intimacy is a fundamental aspect of how we relate to others in social settings. Language encodes the social information of intimacy through both topics and other more subtle cues (such as linguistic hedging and swearing). Here, we introduce a new computational framework for studying expressions of the intimacy in language with an accompanying dataset and deep learning model for accurately predicting the intimacy level of questions (Pearson r = 0.87). Through analyzing a dataset of 80.5 M questions across social media, books, and films, we show that individuals employ interpersonal pragmatic moves in their language to align their intimacy with social settings. Then, in three studies, we further demonstrate how individuals modulate their intimacy to match social norms around gender, social distance, and audience, each validating key findings from studies in social psychology. Our work demonstrates that intimacy is a pervasive and impactful social dimension of language.

pdf bib
Proceedings of the Fourth Workshop on Natural Language Processing and Computational Social Science
David Bamman | Dirk Hovy | David Jurgens | Brendan O'Connor | Svitlana Volkova
Proceedings of the Fourth Workshop on Natural Language Processing and Computational Social Science

2019

pdf bib
Finding Microaggressions in the Wild : A Case for Locating Elusive Phenomena in Social Media Posts
Luke Breitfeller | Emily Ahn | David Jurgens | Yulia Tsvetkov
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Microaggressions are subtle, often veiled, manifestations of human biases. These uncivil interactions can have a powerful negative impact on people by marginalizing minorities and disadvantaged groups. The linguistic subtlety of microaggressions in communication has made it difficult for researchers to analyze their exact nature, and to quantify and extract microaggressions automatically. Specifically, the lack of a corpus of real-world microaggressions and objective criteria for annotating them have prevented researchers from addressing these problems at scale. In this paper, we devise a general but nuanced, computationally operationalizable typology of microaggressions based on a small subset of data that we have. We then create two datasets : one with examples of diverse types of microaggressions recollected by their targets, and another with gender-based microaggressions in public conversations on social media. We introduce a new, more objective, criterion for annotation and an active-learning based procedure that increases the likelihood of surfacing posts containing microaggressions. Finally, we analyze the trends that emerge from these new datasets.

pdf bib
Proceedings of the Third Workshop on Natural Language Processing and Computational Social Science
Svitlana Volkova | David Jurgens | Dirk Hovy | David Bamman | Oren Tsur
Proceedings of the Third Workshop on Natural Language Processing and Computational Social Science

2017

pdf bib
Incorporating Dialectal Variability for Socially Equitable Language Identification
David Jurgens | Yulia Tsvetkov | Dan Jurafsky
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Language identification (LID) is a critical first step for processing multilingual text. Yet most LID systems are not designed to handle the linguistic diversity of global platforms like Twitter, where local dialects and rampant code-switching lead language classifiers to systematically miss minority dialect speakers and multilingual speakers. We propose a new dataset and a character-based sequence-to-sequence model for LID designed to support dialectal and multilingual language varieties. Our model achieves state-of-the-art performance on multiple LID benchmarks. Furthermore, in a case study using Twitter for health tracking, our method substantially increases the availability of texts written by underrepresented populations, enabling the development of socially inclusive NLP tools.

pdf bib
Proceedings of the Second Workshop on NLP and Computational Social Science
Dirk Hovy | Svitlana Volkova | David Bamman | David Jurgens | Brendan O’Connor | Oren Tsur | A. Seza Doğruöz
Proceedings of the Second Workshop on NLP and Computational Social Science

pdf bib
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)
Steven Bethard | Marine Carpuat | Marianna Apidianaki | Saif M. Mohammad | Daniel Cer | David Jurgens
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)