Rowan Zellers


2021

pdf bib
Edited Media Understanding Frames : Reasoning About the Intent and Implications of Visual Misinformation
Jeff Da | Maxwell Forbes | Rowan Zellers | Anthony Zheng | Jena D. Hwang | Antoine Bosselut | Yejin Choi
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Understanding manipulated media, from automatically generated ‘deepfakes’ to manually edited ones, raises novel research challenges. Because the vast majority of edited or manipulated images are benign, such as photoshopped images for visual enhancements, the key challenge is to understand the complex layers of underlying intents of media edits and their implications with respect to disinformation. In this paper, we study Edited Media Frames, a new formalism to understand visual media manipulation as structured annotations with respect to the intents, emotional reactions, attacks on individuals, and the overall implications of disinformation. We introduce a dataset for our task, EMU, with 56k question-answer pairs written in rich natural language. We evaluate a wide variety of vision-and-language models for our task, and introduce a new model PELICAN, which builds upon recent progress in pretrained multimodal representations. Our model obtains promising results on our dataset, with humans rating its answers as accurate 48.2 % of the time. At the same time, there is still much work to be done and we provide analysis that highlights areas for further progress.

pdf bib
TuringAdvice : A Generative and Dynamic Evaluation of Language UseTuringAdvice: A Generative and Dynamic Evaluation of Language Use
Rowan Zellers | Ari Holtzman | Elizabeth Clark | Lianhui Qin | Ali Farhadi | Yejin Choi
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

We propose TuringAdvice, a new challenge task and dataset for language understanding models. Given a written situation that a real person is currently facing, a model must generate helpful advice in natural language. Our evaluation framework tests a fundamental aspect of human language understanding : our ability to use language to resolve open-ended situations by communicating with each other. Empirical results show that today’s models struggle at TuringAdvice, even multibillion parameter models finetuned on 600k in-domain training examples. The best model, T5, writes advice that is at least as helpful as human-written advice in only 14 % of cases ; a much larger non-finetunable GPT3 model does even worse at 4 %. This low performance reveals language understanding errors that are hard to spot outside of a generative setting, showing much room for progress.

2018

pdf bib
SWAG : A Large-Scale Adversarial Dataset for Grounded Commonsense InferenceSWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference
Rowan Zellers | Yonatan Bisk | Roy Schwartz | Yejin Choi
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Given a partial description like she opened the hood of the car, humans can reason about the situation and anticipate what might come next (then, she examined the engine). In this paper, we introduce the task of grounded commonsense inference, unifying natural language inference and commonsense reasoning. We present SWAG, a new dataset with 113k multiple choice questions about a rich spectrum of grounded situations. To address the recurring challenges of the annotation artifacts and human biases found in many existing datasets, we propose Adversarial Filtering (AF), a novel procedure that constructs a de-biased dataset by iteratively training an ensemble of stylistic classifiers, and using them to filter the data. To account for the aggressive adversarial filtering, we use state-of-the-art language models to massively oversample a diverse set of potential counterfactuals. Empirical results demonstrate that while humans can solve the resulting inference problems with high accuracy (88 %), various competitive models struggle on our task. We provide comprehensive analysis that indicates significant opportunities for future research.

2017

pdf bib
Zero-Shot Activity Recognition with Verb Attribute Induction
Rowan Zellers | Yejin Choi
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

In this paper, we investigate large-scale zero-shot activity recognition by modeling the visual and linguistic attributes of action verbs. For example, the verb salute has several properties, such as being a light movement, a social act, and short in duration. We use these attributes as the internal mapping between visual and textual representations to reason about a previously unseen action. In contrast to much prior work that assumes access to gold standard attributes for zero-shot classes and focuses primarily on object attributes, our model uniquely learns to infer action attributes from dictionary definitions and distributed word representations. Experimental results confirm that action attributes inferred from language can provide a predictive signal for zero-shot prediction of previously unseen activities.