Dan Jurafsky


2021

pdf bib
Improving Factual Completeness and Consistency of Image-to-Text Radiology Report Generation
Yasuhide Miura | Yuhao Zhang | Emily Tsai | Curtis Langlotz | Dan Jurafsky
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Neural image-to-text radiology report generation systems offer the potential to improve radiology reporting by reducing the repetitive process of report drafting and identifying possible medical errors. However, existing report generation systems, despite achieving high performances on natural language generation metrics such as CIDEr or BLEU, still suffer from incomplete and inconsistent generations. Here we introduce two new simple rewards to encourage the generation of factually complete and consistent radiology reports : one that encourages the system to generate radiology domain entities consistent with the reference, and one that uses natural language inference to encourage these entities to be described in inferentially consistent ways. We combine these with the novel use of an existing semantic equivalence metric (BERTScore). We further propose a report generation system that optimizes these rewards via reinforcement learning. On two open radiology report datasets, our system substantially improved the F1 score of a clinical information extraction performance by +22.1 (Delta +63.9 %). We further show via a human evaluation and a qualitative analysis that our system leads to generations that are more factually complete and consistent compared to the baselines.

2020

pdf bib
Utility is in the Eye of the User : A Critique of NLP LeaderboardsNLP Leaderboards
Kawin Ethayarajh | Dan Jurafsky
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Benchmarks such as GLUE have helped drive advances in NLP by incentivizing the creation of more accurate models. While this leaderboard paradigm has been remarkably successful, a historical focus on performance-based evaluation has been at the expense of other qualities that the NLP community values in models, such as compactness, fairness, and energy efficiency. In this opinion paper, we study the divergence between what is incentivized by leaderboards and what is useful in practice through the lens of microeconomic theory. We frame both the leaderboard and NLP practitioners as consumers and the benefit they get from a model as its utility to them. With this framing, we formalize how leaderboards in their current form can be poor proxies for the NLP community at large. For example, a highly inefficient model would provide less utility to practitioners but not to a leaderboard, since it is a cost that only the former must bear. To allow practitioners to better estimate a model’s utility to them, we advocate for more transparency on leaderboards, such as the reporting of statistics that are of practical concern (e.g., model size, energy efficiency, and inference latency).

pdf bib
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Dan Jurafsky | Joyce Chai | Natalie Schluter | Joel Tetreault
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

pdf bib
Pretraining with Contrastive Sentence Objectives Improves Discourse Performance of Language Models
Dan Iter | Kelvin Guu | Larry Lansing | Dan Jurafsky
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Recent models for unsupervised representation learning of text have employed a number of techniques to improve contextual word representations but have put little focus on discourse-level representations. We propose Conpono, an inter-sentence objective for pretraining language models that models discourse coherence and the distance between sentences. Given an anchor sentence, our model is trained to predict the text k sentences away using a sampled-softmax objective where the candidates consist of neighboring sentences and sentences randomly sampled from the corpus. On the discourse representation benchmark DiscoEval, our model improves over the previous state-of-the-art by up to 13 % and on average 4 % absolute across 7 tasks. Our model is the same size as BERT-Base, but outperforms the much larger BERT-Large model and other more recent approaches that incorporate discourse. We also show that Conpono yields gains of 2%-6 % absolute even for tasks that do not explicitly evaluate discourse : textual entailment (RTE), common sense reasoning (COPA) and reading comprehension (ReCoRD).

pdf bib
Social Bias Frames : Reasoning about Social and Power Implications of Language
Maarten Sap | Saadia Gabriel | Lianhui Qin | Dan Jurafsky | Noah A. Smith | Yejin Choi
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Warning : this paper contains content that may be offensive or upsetting. Language has the power to reinforce stereotypes and project social biases onto others. At the core of the challenge is that it is rarely what is stated explicitly, but rather the implied meanings, that frame people’s judgments about others. For example, given a statement that we should n’t lower our standards to hire more women, most listeners will infer the implicature intended by the speaker-that women (candidates) are less qualified. Most semantic formalisms, to date, do not capture such pragmatic implications in which people express social biases and power differentials in language. We introduce Social Bias Frames, a new conceptual formalism that aims to model the pragmatic frames in which people project social biases and stereotypes onto others. In addition, we introduce the Social Bias Inference Corpus to support large-scale modelling and evaluation with 150k structured annotations of social media posts, covering over 34k implications about a thousand demographic groups. We then establish baseline approaches that learn to recover Social Bias Frames from unstructured text. We find that while state-of-the-art neural models are effective at high-level categorization of whether a given statement projects unwanted social bias (80 % F1), they are not effective at spelling out more detailed explanations in terms of Social Bias Frames. Our study motivates future work that combines structured pragmatic inference with commonsense reasoning on social implications.

2019

pdf bib
Analyzing Polarization in Social Media : Method and Application to Tweets on 21 Mass Shootings
Dorottya Demszky | Nikhil Garg | Rob Voigt | James Zou | Jesse Shapiro | Matthew Gentzkow | Dan Jurafsky
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

We provide an NLP framework to uncover four linguistic dimensions of political polarization in social media : topic choice, framing, affect and illocutionary force. We quantify these aspects with existing lexical methods, and propose clustering of tweet embeddings as a means to identify salient topics for analysis across events ; human evaluations show that our approach generates more cohesive topics than traditional LDA-based models. We apply our methods to study 4.4 M tweets on 21 mass shootings. We provide evidence that the discussion of these events is highly polarized politically and that this polarization is primarily driven by partisan differences in framing rather than topic choice. We identify framing devices, such as grounding and the contrasting use of the terms terrorist and crazy, that contribute to polarization. Results pertaining to topic choice, affect and illocutionary force suggest that Republicans focus more on the shooter and event-specific facts (news) while Democrats focus more on the victims and call for policy changes. Our work contributes to a deeper understanding of the way group divisions manifest in language and to computational methods for studying them.

pdf bib
Let’s Make Your Request More Persuasive : Modeling Persuasive Strategies via Semi-Supervised Neural Nets on Crowdfunding Platforms
Diyi Yang | Jiaao Chen | Zichao Yang | Dan Jurafsky | Eduard Hovy
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

Modeling what makes a request persuasive-eliciting the desired response from a reader-is critical to the study of propaganda, behavioral economics, and advertising. Yet current models ca n’t quantify the persuasiveness of requests or extract successful persuasive strategies. Building on theories of persuasion, we propose a neural network to quantify persuasiveness and identify the persuasive strategies in advocacy requests. Our semi-supervised hierarchical neural network model is supervised by the number of people persuaded to take actions and partially supervised at the sentence level with human-labeled rhetorical strategies. Our method outperforms several baselines, uncovers persuasive strategies-offering increased interpretability of persuasive speech-and has applications for other situations with document-level supervision but only partial sentence supervision.

pdf bib
Recursive Routing Networks : Learning to Compose Modules for Language Understanding
Ignacio Cases | Clemens Rosenbaum | Matthew Riemer | Atticus Geiger | Tim Klinger | Alex Tamkin | Olivia Li | Sandhini Agarwal | Joshua D. Greene | Dan Jurafsky | Christopher Potts | Lauri Karttunen
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

We introduce Recursive Routing Networks (RRNs), which are modular, adaptable models that learn effectively in diverse environments. RRNs consist of a set of functions, typically organized into a grid, and a meta-learner decision-making component called the router. The model jointly optimizes the parameters of the functions and the meta-learner’s policy for routing inputs through those functions. RRNs can be incorporated into existing architectures in a number of ways ; we explore adding them to word representation layers, recurrent network hidden layers, and classifier layers. Our evaluation task is natural language inference (NLI). Using the MultiNLI corpus, we show that an RRN’s routing decisions reflect the high-level genre structure of that corpus. To show that RRNs can learn to specialize to more fine-grained semantic distinctions, we introduce a new corpus of NLI examples involving implicative predicates, and show that the model components become fine-tuned to the inferential signatures that are characteristic of these predicates.

2018

pdf bib
Framing and Agenda-setting in Russian News : a Computational Analysis of Intricate Political StrategiesRussian News: a Computational Analysis of Intricate Political Strategies
Anjalie Field | Doron Kliger | Shuly Wintner | Jennifer Pan | Dan Jurafsky | Yulia Tsvetkov
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Amidst growing concern over media manipulation, NLP attention has focused on overt strategies like censorship and fake news. Here, we draw on two concepts from political science literature to explore subtler strategies for government media manipulation : agenda-setting (selecting what topics to cover) and framing (deciding how topics are covered). We analyze 13 years (100 K articles) of the Russian newspaper Izvestia and identify a strategy of distraction : articles mention the U.S. more frequently in the month directly following an economic downturn in Russia. We introduce embedding-based methods for cross-lingually projecting English frames to Russian, and discover that these articles emphasize U.S. moral failings and threats to the U.S. Our work offers new ways to identify subtle media manipulation strategies at the intersection of agenda-setting and framing.

pdf bib
Detecting Institutional Dialog Acts in Police Traffic Stops
Vinodkumar Prabhakaran | Camilla Griffiths | Hang Su | Prateek Verma | Nelson Morgan | Jennifer L. Eberhardt | Dan Jurafsky
Transactions of the Association for Computational Linguistics, Volume 6

We apply computational dialog methods to police body-worn camera footage to model conversations between police officers and community members in traffic stops. Relying on the theory of institutional talk, we develop a labeling scheme for police speech during traffic stops, and a tagger to detect institutional dialog acts (Reasons, Searches, Offering Help) from transcribed text at the turn (78 % F-score) and stop (89 % F-score) level. We then develop speech recognition and segmentation algorithms to detect these acts at the stop level from raw camera audio (81 % F-score, with even higher accuracy for crucial acts like conveying the reason for the stop). We demonstrate that the dialog structures produced by our tagger could reveal whether officers follow law enforcement norms like introducing themselves, explaining the reason for the stop, and asking permission for searches. This work may therefore inform and aid efforts to ensure the procedural justice of police-community interactions.

pdf bib
Automatic Detection of Incoherent Speech for Diagnosing Schizophrenia
Dan Iter | Jong Yoon | Dan Jurafsky
Proceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: From Keyboard to Clinic

Schizophrenia is a mental disorder which afflicts an estimated 0.7 % of adults world wide. It affects many areas of mental function, often evident from incoherent speech. Diagnosing schizophrenia relies on subjective judgments resulting in disagreements even among trained clinicians. Recent studies have proposed the use of natural language processing for diagnosis by drawing on automatically-extracted linguistic features like discourse coherence and lexicon. Here, we present the first benchmark comparison of previously proposed coherence models for detecting symptoms of schizophrenia and evaluate their performance on a new dataset of recorded interviews between subjects and clinicians. We also present two alternative coherence metrics based on modern sentence embedding techniques that outperform the previous methods on our dataset. Lastly, we propose a novel computational model for reference incoherence based on ambiguous pronoun usage and show that it is a highly predictive feature on our data. While the number of subjects is limited in this pilot study, our results suggest new directions for diagnosing common symptoms of schizophrenia.

pdf bib
Noising and Denoising Natural Language : Diverse Backtranslation for Grammar Correction
Ziang Xie | Guillaume Genthial | Stanley Xie | Andrew Ng | Dan Jurafsky
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Translation-based methods for grammar correction that directly map noisy, ungrammatical text to their clean counterparts are able to correct a broad range of errors ; however, such techniques are bottlenecked by the need for a large parallel corpus of noisy and clean sentence pairs. In this paper, we consider synthesizing parallel data by noising a clean monolingual corpus. While most previous approaches introduce perturbations using features computed from local context windows, we instead develop error generation processes using a neural sequence transduction model trained to translate clean examples to their noisy counterparts. Given a corpus of clean examples, we propose beam search noising procedures to synthesize additional noisy examples that human evaluators were nearly unable to discriminate from nonsynthesized examples. Surprisingly, when trained on additional data synthesized using our best-performing noising scheme, our model approaches the same performance as when trained on additional nonsynthesized data.

pdf bib
Deconfounded Lexicon Induction for Interpretable Social Science
Reid Pryzant | Kelly Shen | Dan Jurafsky | Stefan Wagner
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

NLP algorithms are increasingly used in computational social science to take linguistic observations and predict outcomes like human preferences or actions. Making these social models transparent and interpretable often requires identifying features in the input that predict outcomes while also controlling for potential confounds. We formalize this need as a new task : inducing a lexicon that is predictive of a set of target variables yet uncorrelated to a set of confounding variables. We introduce two deep learning algorithms for the task. The first uses a bifurcated architecture to separate the explanatory power of the text and confounds. The second uses an adversarial discriminator to force confound-invariant text encodings. Both elicit lexicons from learned weights and attentional scores. We use them to induce lexicons that are predictive of timely responses to consumer complaints (controlling for product), enrollment from course descriptions (controlling for subject), and sales from product descriptions (controlling for seller). In each domain our algorithms pick words that are associated with narrative persuasion ; more predictive and less confound-related than those of standard feature weighting and lexicon induction techniques like regression and log odds.narrative persuasion; more\n predictive and less confound-related than those of standard\n feature weighting and lexicon induction techniques like\n regression and log odds.\n

pdf bib
Sharp Nearby, Fuzzy Far Away : How Neural Language Models Use Context
Urvashi Khandelwal | He He | Peng Qi | Dan Jurafsky
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We know very little about how neural language models (LM) use prior linguistic context. In this paper, we investigate the role of context in an LSTM LM, through ablation studies. Specifically, we analyze the increase in perplexity when prior context words are shuffled, replaced, or dropped. On two standard datasets, Penn Treebank and WikiText-2, we find that the model is capable of using about 200 tokens of context on average, but sharply distinguishes nearby context (recent 50 tokens) from the distant history. The model is highly sensitive to the order of words within the most recent sentence, but ignores word order in the long-range context (beyond 50 tokens), suggesting the distant past is modeled only as a rough semantic field or topic. We further find that the neural caching model (Grave et al., 2017b) especially helps the LSTM to copy words from within this distant context. Overall, our analysis not only provides a better understanding of how neural LMs use their context, but also sheds light on recent success from cache-based models.

2017

pdf bib
Incorporating Dialectal Variability for Socially Equitable Language Identification
David Jurgens | Yulia Tsvetkov | Dan Jurafsky
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Language identification (LID) is a critical first step for processing multilingual text. Yet most LID systems are not designed to handle the linguistic diversity of global platforms like Twitter, where local dialects and rampant code-switching lead language classifiers to systematically miss minority dialect speakers and multilingual speakers. We propose a new dataset and a character-based sequence-to-sequence model for LID designed to support dialectal and multilingual language varieties. Our model achieves state-of-the-art performance on multiple LID benchmarks. Furthermore, in a case study using Twitter for health tracking, our method substantially increases the availability of texts written by underrepresented populations, enabling the development of socially inclusive NLP tools.

pdf bib
Neural Net Models of Open-domain Discourse Coherence
Jiwei Li | Dan Jurafsky
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Discourse coherence is strongly associated with text quality, making it important to natural language generation and understanding. Yet existing models of coherence focus on measuring individual aspects of coherence (lexical overlap, rhetorical structure, entity centering) in narrow domains. In this paper, we describe domain-independent neural models of discourse coherence that are capable of measuring multiple aspects of coherence in existing sentences and can maintain coherence while generating new sentences. We study both discriminative models that learn to distinguish coherent from incoherent discourse, and generative models that produce coherent text, including a novel neural latent-variable Markovian generative model that captures the latent discourse dependencies between sentences in a text. Our work achieves state-of-the-art performance on multiple coherence evaluations, and marks an initial step in generating coherent texts given discourse contexts.

pdf bib
Adversarial Learning for Neural Dialogue Generation
Jiwei Li | Will Monroe | Tianlin Shi | Sébastien Jean | Alan Ritter | Dan Jurafsky
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

We apply adversarial training to open-domain dialogue generation, training a system to produce sequences that are indistinguishable from human-generated dialogue utterances. We cast the task as a reinforcement learning problem where we jointly train two systems : a generative model to produce response sequences, and a discriminatoranalagous to the human evaluator in the Turing test to distinguish between the human-generated dialogues and the machine-generated ones. In this generative adversarial network approach, the outputs from the discriminator are used to encourage the system towards more human-like dialogue. Further, we investigate models for adversarial evaluation that uses success in fooling an adversary as a dialogue evaluation metric, while avoiding a number of potential pitfalls. Experimental results on several metrics, including adversarial evaluation, demonstrate that the adversarially-trained system generates higher-quality responses than previous baselines

pdf bib
A Two-stage Sieve Approach for Quote Attribution
Grace Muzny | Michael Fang | Angel Chang | Dan Jurafsky
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

We present a deterministic sieve-based system for attributing quotations in literary text and a new dataset : QuoteLi3. Quote attribution, determining who said what in a given text, is important for tasks like creating dialogue systems, and in newer areas like computational literary studies, where it creates opportunities to analyze novels at scale rather than only a few at a time. We release QuoteLi3, which contains more than 6,000 annotations linking quotes to speaker mentions and quotes to speaker entities, and introduce a new algorithm for quote attribution. Our two-stage algorithm first links quotes to mentions, then mentions to entities. Using two stages encapsulates difficult sub-problems and improves system performance. The modular design allows us to tune for overall performance or higher precision, which is useful for many real-world use cases. Our system achieves an average F-score of 87.5 across three novels, outperforming previous systems, and can be tuned for precision of 90.4 at a recall of 65.1.