Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Heike Adel, Shuming Shi (Editors)

Anthology ID:
Online and Punta Cana, Dominican Republic
Association for Computational Linguistics
Bib Export formats:

pdf bib
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Heike Adel | Shuming Shi

pdf bib
LMdiff : A Visual Diff Tool to Compare Language ModelsLMdiff: A Visual Diff Tool to Compare Language Models
Hendrik Strobelt | Benjamin Hoover | Arvind Satyanaryan | Sebastian Gehrmann

While different language models are ubiquitous in NLP, it is hard to contrast their outputs and identify which contexts one can handle better than the other. To address this question, we introduce LMdiff, a tool that visually compares probability distributions of two models that differ, e.g., through finetuning, distillation, or simply training with different parameter sizes. LMdiff allows the generation of hypotheses about model behavior by investigating text instances token by token and further assists in choosing these interesting text instances by identifying the most interesting phrases from large corpora. We showcase the applicability of LMdiff for hypothesis generation across multiple case studies. A demo is available at

pdf bib
Beyond Accuracy : A Consolidated Tool for Visual Question Answering Benchmarking
Dirk Väth | Pascal Tilli | Ngoc Thang Vu

On the way towards general Visual Question Answering (VQA) systems that are able to answer arbitrary questions, the need arises for evaluation beyond single-metric leaderboards for specific datasets. To this end, we propose a browser-based benchmarking tool for researchers and challenge organizers, with an API for easy integration of new models and datasets to keep up with the fast-changing landscape of VQA. Our tool helps test generalization capabilities of models across multiple datasets, evaluating not just accuracy, but also performance in more realistic real-world scenarios such as robustness to input noise. Additionally, we include metrics that measure biases and uncertainty, to further explain model behavior. Interactive filtering facilitates discovery of problematic behavior, down to the data sample level. As proof of concept, we perform a case study on four models. We find that state-of-the-art VQA models are optimized for specific tasks or datasets, but fail to generalize even to other in-domain test sets, for example they can not recognize text in images. Our metrics allow us to quantify which image and question embeddings provide most robustness to a model. All code s publicly available.

pdf bib
Athena 2.0 : Contextualized Dialogue Management for an Alexa Prize SocialBotAlexa Prize SocialBot
Juraj Juraska | Kevin Bowden | Lena Reed | Vrindavan Harrison | Wen Cui | Omkar Patil | Rishi Rajasekaran | Angela Ramirez | Cecilia Li | Eduardo Zamora | Phillip Lee | Jeshwanth Bheemanpally | Rohan Pandey | Adwait Ratnaparkhi | Marilyn Walker

Athena 2.0 is an Alexa Prize SocialBot that has been a finalist in the last two Alexa Prize Grand Challenges. One reason for Athena’s success is its novel dialogue management strategy, which allows it to dynamically construct dialogues and responses from component modules, leading to novel conversations with every interaction. Here we describe Athena’s system design and performance in the Alexa Prize during the 20/21 competition. A live demo of Athena as well as video recordings will provoke discussion on the state of the art in conversational AI.

pdf bib
UMR-Writer : A Web Application for Annotating Uniform Meaning RepresentationsUMR-Writer: A Web Application for Annotating Uniform Meaning Representations
Jin Zhao | Nianwen Xue | Jens Van Gysel | Jinho D. Choi

We present UMR-Writer, a web-based application for annotating Uniform Meaning Representations (UMR), a graph-based, cross-linguistically applicable semantic representation developed recently to support the development of interpretable natural language applications that require deep semantic analysis of texts. We present the functionalities of UMR-Writer and discuss the challenges in developing such a tool and how they are addressed.

pdf bib
Summary Explorer : Visualizing the State of the Art in Text Summarization
Shahbaz Syed | Tariq Yousef | Khalid Al Khatib | Stefan Jänicke | Martin Potthast

This paper introduces Summary Explorer, a new tool to support the manual inspection of text summarization systems by compiling the outputs of 55 state-of-the-art single document summarization approaches on three benchmark datasets, and visually exploring them during a qualitative assessment. The underlying design of the tool considers three well-known summary quality criteria (coverage, faithfulness, and position bias), encapsulated in a guided assessment based on tailored visualizations. The tool complements existing approaches for locally debugging summarization models and improves upon them. The tool is available at

pdf bib
MeetDot : Videoconferencing with Live Translation CaptionsMeetDot: Videoconferencing with Live Translation Captions
Arkady Arkhangorodsky | Christopher Chu | Scot Fang | Yiqi Huang | Denglin Jiang | Ajay Nagesh | Boliang Zhang | Kevin Knight

We present MeetDot, a videoconferencing system with live translation captions overlaid on screen. The system aims to facilitate conversation between people who speak different languages, thereby reducing communication barriers between multilingual participants. Currently, our system supports speech and captions in 4 languages and combines automatic speech recognition (ASR) and machine translation (MT) in a cascade. We use the re-translation strategy to translate the streamed speech, resulting in caption flicker. Additionally, our system has very strict latency requirements to have acceptable call quality. We implement several features to enhance user experience and reduce their cognitive load, such as smooth scrolling captions and reducing caption flicker. The modular architecture allows us to integrate different ASR and MT services in our backend. Our system provides an integrated evaluation suite to optimize key intrinsic evaluation metrics such as accuracy, latency and erasure. Finally, we present an innovative cross-lingual word-guessing game as an extrinsic evaluation metric to measure end-to-end system performance. We plan to make our system open-source for research purposes.

pdf bib
LexiClean : An annotation tool for rapid multi-task lexical normalisationLexiClean: An annotation tool for rapid multi-task lexical normalisation
Tyler Bikaun | Tim French | Melinda Hodkiewicz | Michael Stewart | Wei Liu

NLP systems are often challenged by difficulties arising from noisy, non-standard, and domain specific corpora. The task of lexical normalisation aims to standardise such corpora, but currently lacks suitable tools to acquire high-quality annotated data to support deep learning based approaches. In this paper, we present LexiClean, the first open-source web-based annotation tool for multi-task lexical normalisation. LexiClean’s main contribution is support for simultaneous in situ token-level modification and annotation that can be rapidly applied corpus wide. We demonstrate the usefulness of our tool through a case study on two sets of noisy corpora derived from the specialised-domain of industrial mining. We show that LexiClean allows for the rapid and efficient development of high-quality parallel corpora. A demo of our system is available at :

pdf bib
T3-Vis : visual analytic for Training and fine-Tuning Transformers in NLPNLP
Raymond Li | Wen Xiao | Lanjun Wang | Hyeju Jang | Giuseppe Carenini

Transformers are the dominant architecture in NLP, but their training and fine-tuning is still very challenging. In this paper, we present the design and implementation of a visual analytic framework for assisting researchers in such process, by providing them with valuable insights about the model’s intrinsic properties and behaviours. Our framework offers an intuitive overview that allows the user to explore different facets of the model (e.g., hidden states, attention) through interactive visualization, and allows a suite of built-in algorithms that compute the importance of model components and different parts of the input sequence. Case studies and feedback from a user focus group indicate that the framework is useful, and suggest several improvements. Our framework is available at :

pdf bib
OpenFraming : Open-sourced Tool for Computational Framing Analysis of Multilingual DataOpenFraming: Open-sourced Tool for Computational Framing Analysis of Multilingual Data
Vibhu Bhatia | Vidya Prasad Akavoor | Sejin Paik | Lei Guo | Mona Jalal | Alyssa Smith | David Assefa Tofu | Edward Edberg Halim | Yimeng Sun | Margrit Betke | Prakash Ishwar | Derry Tanti Wijaya

When journalists cover a news story, they can cover the story from multiple angles or perspectives. These perspectives are called frames, and usage of one frame or another may influence public perception and opinion of the issue at hand. We develop a web-based system for analyzing frames in multilingual text documents. We propose and guide users through a five-step end-to-end computational framing analysis framework grounded in media framing theory in communication research. Users can use the framework to analyze multilingual text data, starting from the exploration of frames in user’s corpora and through review of previous framing literature (step 1-3) to frame classification (step 4) and prediction (step 5). The framework combines unsupervised and supervised machine learning and leverages a state-of-the-art (SoTA) multilingual language model, which can significantly enhance frame prediction performance while requiring a considerably small sample of manual annotations. Through the interactive website, anyone can perform the proposed computational framing analysis, making advanced computational analysis available to researchers without a programming background and bridging the digital divide within the communication research discipline in particular and the academic community in general. The system is available online at, via an API, or through our GitHub page

pdf bib
CroAno : A Crowd Annotation Platform for Improving Label Consistency of Chinese NER DatasetCroAno : A Crowd Annotation Platform for Improving Label Consistency of Chinese NER Dataset
Baoli Zhang | Zhucong Li | Zhen Gan | Yubo Chen | Jing Wan | Kang Liu | Jun Zhao | Shengping Liu | Yafei Shi

In this paper, we introduce CroAno, a web-based crowd annotation platform for the Chinese named entity recognition (NER). Besides some basic features for crowd annotation like fast tagging and data management, CroAno provides a systematic solution for improving label consistency of Chinese NER dataset. 1) Disagreement Adjudicator : CroAno uses a multi-dimensional highlight mode to visualize instance-level inconsistent entities and makes the revision process user-friendly. 2) Inconsistency Detector : CroAno employs a detector to locate corpus-level label inconsistency and provides users an interface to correct inconsistent entities in batches. 3) Prediction Error Analyzer : We deconstruct the entity prediction error of the model to six fine-grained entity error types. Users can employ this error system to detect corpus-level inconsistency from a model perspective. To validate the effectiveness of our platform, we use CroAno to revise two public datasets. In the two revised datasets, we get an improvement of +1.96 % and +2.57 % F1 respectively in model performance.

pdf bib
SeqAttack : On Adversarial Attacks for Named Entity RecognitionSeqAttack: On Adversarial Attacks for Named Entity Recognition
Walter Simoncini | Gerasimos Spanakis

Named Entity Recognition is a fundamental task in information extraction and is an essential element for various Natural Language Processing pipelines. Adversarial attacks have been shown to greatly affect the performance of text classification systems but knowledge about their effectiveness against named entity recognition models is limited. This paper investigates the effectiveness and portability of adversarial attacks from text classification to named entity recognition and the ability of adversarial training to counteract these attacks. We find that character-level and word-level attacks are the most effective, but adversarial training can grant significant protection at little to no expense of standard performance. Alongside our results, we also release SeqAttack, a framework to conduct adversarial attacks against token classification models (used in this work for named entity recognition) and a companion web application to inspect and cherry pick adversarial examples.

pdf bib
DRIFT : A Toolkit for Diachronic Analysis of Scientific LiteratureDRIFT: A Toolkit for Diachronic Analysis of Scientific Literature
Abheesht Sharma | Gunjan Chhablani | Harshit Pandey | Rajaswa Patil

In this work, we present to the NLP community, and to the wider research community as a whole, an application for the diachronic analysis of research corpora. We open source an easy-to-use tool coined DRIFT, which allows researchers to track research trends and development over the years. The analysis methods are collated from well-cited research works, with a few of our own methods added for good measure. Succinctly put, some of the analysis methods are : keyword extraction, word clouds, predicting declining / stagnant / growing trends using Productivity, tracking bi-grams using Acceleration plots, finding the Semantic Drift of words, tracking trends using similarity, etc. To demonstrate the utility and efficacy of our tool, we perform a case study on the cs. CL corpus of the arXiv repository and draw inferences from the analysis methods. The toolkit and the associated code are available here :