Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems

Steffen Eger, Yang Gao, Maxime Peyrard, Wei Zhao, Eduard Hovy (Editors)


Anthology ID:
2020.eval4nlp-1
Month:
November
Year:
2020
Address:
Online
Venues:
EMNLP | Eval4NLP
SIG:
Publisher:
Association for Computational Linguistics
URL:
https://aclanthology.org/2020.eval4nlp-1
DOI:
Bib Export formats:
BibTeX MODS XML EndNote

pdf bib
Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems
Steffen Eger | Yang Gao | Maxime Peyrard | Wei Zhao | Eduard Hovy

pdf bib
Item Response Theory for Efficient Human Evaluation of Chatbots
João Sedoc | Lyle Ungar

Conversational agent quality is currently assessed using human evaluation, and often requires an exorbitant number of comparisons to achieve statistical significance. In this paper, we introduce Item Response Theory (IRT) for chatbot evaluation, using a paired comparison in which annotators judge which system responds better to the next turn of a conversation. IRT is widely used in educational testing for simultaneously assessing the ability of test takers and the quality of test questions. It is similarly well suited for chatbot evaluation since it allows the assessment of both models and the prompts used to evaluate them. We use IRT to efficiently assess chatbots, and show that different examples from the evaluation set are better suited for comparing high-quality (nearer to human performance) than low-quality systems. Finally, we use IRT to reduce the number of evaluation examples assessed by human annotators while retaining discriminative power.

pdf bib
ViLBERTScore : Evaluating Image Caption Using Vision-and-Language BERTViLBERTScore: Evaluating Image Caption Using Vision-and-Language BERT
Hwanhee Lee | Seunghyun Yoon | Franck Dernoncourt | Doo Soon Kim | Trung Bui | Kyomin Jung

In this paper, we propose an evaluation metric for image captioning systems using both image and text information. Unlike the previous methods that rely on textual representations in evaluating the caption, our approach uses visiolinguistic representations. The proposed method generates image-conditioned embeddings for each token using ViLBERT from both generated and reference texts. Then, these contextual embeddings from each of the two sentence-pair are compared to compute the similarity score. Experimental results on three benchmark datasets show that our method correlates significantly better with human judgments than all existing metrics.

pdf bib
BLEU Neighbors : A Reference-less Approach to Automatic EvaluationBLEU Neighbors: A Reference-less Approach to Automatic Evaluation
Kawin Ethayarajh | Dorsa Sadigh

Evaluation is a bottleneck in the development of natural language generation (NLG) models. Automatic metrics such as BLEU rely on references, but for tasks such as open-ended generation, there are no references to draw upon. Although language diversity can be estimated using statistical measures such as perplexity, measuring language quality requires human evaluation. However, because human evaluation at scale is slow and expensive, it is used sparingly ; it can not be used to rapidly iterate on NLG models, in the way BLEU is used for machine translation. To this end, we propose BLEU Neighbors, a nearest neighbors model for estimating language quality by using the BLEU score as a kernel function. On existing datasets for chitchat dialogue and open-ended sentence generation, we find that on average the quality estimation from a BLEU Neighbors model has a lower mean squared error and higher Spearman correlation with the ground truth than individual human annotators. Despite its simplicity, BLEU Neighbors even outperforms state-of-the-art models on automatically grading essays, including models that have access to a gold-standard reference essay.

pdf bib
Artemis : A Novel Annotation Methodology for Indicative Single Document Summarization
Rahul Jha | Keping Bi | Yang Li | Mahdi Pakdaman | Asli Celikyilmaz | Ivan Zhiboedov | Kieran McDonald

We describe Artemis (Annotation methodology for Rich, Tractable, Extractive, Multi-domain, Indicative Summarization), a novel hierarchical annotation process that produces indicative summaries for documents from multiple domains. Current summarization evaluation datasets are single-domain and focused on a few domains for which naturally occurring summaries can be easily found, such as news and scientific articles. These are not sufficient for training and evaluation of summarization models for use in document management and information retrieval systems, which need to deal with documents from multiple domains. Compared to other annotation methods such as Relative Utility and Pyramid, Artemis is more tractable because judges do n’t need to look at all the sentences in a document when making an importance judgment for one of the sentences, while providing similarly rich sentence importance annotations. We describe the annotation process in detail and compare it with other similar evaluation systems. We also present analysis and experimental results over a sample set of 532 annotated documents.

pdf bib
Probabilistic Extension of Precision, Recall, and F1 Score for More Thorough Evaluation of Classification Models
Reda Yacouby | Dustin Axman

In pursuit of the perfect supervised NLP classifier, razor thin margins and low-resource test sets can make modeling decisions difficult. Popular metrics such as Accuracy, Precision, and Recall are often insufficient as they fail to give a complete picture of the model’s behavior. We present a probabilistic extension of Precision, Recall, and F1 score, which we refer to as confidence-Precision (cPrecision), confidence-Recall (cRecall), and confidence-F1 (cF1) respectively. The proposed metrics address some of the challenges faced when evaluating large-scale NLP systems, specifically when the model’s confidence score assignments have an impact on the system’s behavior. We describe four key benefits of our proposed metrics as compared to their threshold-based counterparts. Two of these benefits, which we refer to as robustness to missing values and sensitivity to model confidence score assignments are self-evident from the metrics’ definitions ; the remaining benefits, generalization, and functional consistency are demonstrated empirically.

pdf bib
On Aligning OpenIE Extractions with Knowledge Bases : A Case StudyOpenIE Extractions with Knowledge Bases: A Case Study
Kiril Gashteovski | Rainer Gemulla | Bhushan Kotnis | Sven Hertling | Christian Meilicke

Open information extraction (OIE) is the task of extracting relations and their corresponding arguments from a natural language text in un- supervised manner. Outputs of such systems are used for downstream tasks such as ques- tion answering and automatic knowledge base (KB) construction. Many of these downstream tasks rely on aligning OIE triples with refer- ence KBs. Such alignments are usually eval- uated w.r.t. a specific downstream task and, to date, no direct manual evaluation of such alignments has been performed. In this paper, we directly evaluate how OIE triples from the OPIEC corpus are related to the DBpedia KB w.r.t. information content. First, we investigate OPIEC triples and DBpedia facts having the same arguments by comparing the information on the OIE surface relation with the KB rela- tion. Second, we evaluate the expressibility of general OPIEC triples in DBpedia. We in- vestigate whetherand, if so, howa given OIE triple can be mapped to a single KB fact. We found that such mappings are not always possible because the information in the OIE triples tends to be more specific. Our evalua- tion suggests, however, that significant part of OIE triples can be expressed by means of KB formulas instead of individual facts.

pdf bib
ClusterDataSplit : Exploring Challenging Clustering-Based Data Splits for Model Performance EvaluationClusterDataSplit: Exploring Challenging Clustering-Based Data Splits for Model Performance Evaluation
Hanna Wecker | Annemarie Friedrich | Heike Adel

This paper adds to the ongoing discussion in the natural language processing community on how to choose a good development set. Motivated by the real-life necessity of applying machine learning models to different data distributions, we propose a clustering-based data splitting algorithm. It creates development (or test) sets which are lexically different from the training data while ensuring similar label distributions. Hence, we are able to create challenging cross-validation evaluation setups while abstracting away from performance differences resulting from label distribution shifts between training and test data. In addition, we present a Python-based tool for analyzing and visualizing data split characteristics and model performance. We illustrate the workings and results of our approach using a sentiment analysis and a patent classification task.

pdf bib
Best Practices for Crowd-based Evaluation of German Summarization : Comparing Crowd, Expert and Automatic EvaluationGerman Summarization: Comparing Crowd, Expert and Automatic Evaluation
Neslihan Iskender | Tim Polzehl | Sebastian Möller

One of the main challenges in the development of summarization tools is summarization quality evaluation. On the one hand, the human assessment of summarization quality conducted by linguistic experts is slow, expensive, and still not a standardized procedure. On the other hand, the automatic assessment metrics are reported not to correlate high enough with human quality ratings. As a solution, we propose crowdsourcing as a fast, scalable, and cost-effective alternative to expert evaluations to assess the intrinsic and extrinsic quality of summarization by comparing crowd ratings with expert ratings and automatic metrics such as ROUGE, BLEU, or BertScore on a German summarization data set. Our results provide a basis for best practices for crowd-based summarization evaluation regarding major influential factors such as the best annotation aggregation method, the influence of readability and reading effort on summarization evaluation, and the optimal number of crowd workers to achieve comparable results to experts, especially when determining factors such as overall quality, grammaticality, referential clarity, focus, structure & coherence, summary usefulness, and summary informativeness.

pdf bib
Evaluating Word Embeddings on Low-Resource Languages
Nathan Stringham | Mike Izbicki

The analogy task introduced by Mikolov et al. (2013) has become the standard metric for tuning the hyperparameters of word embedding models. In this paper, however, we argue that the analogy task is unsuitable for low-resource languages for two reasons : (1) it requires that word embeddings be trained on large amounts of text, and (2) analogies may not be well-defined in some low-resource settings. We solve these problems by introducing the OddOneOut and Topk tasks, which are specifically designed for model selection in the low-resource setting. We use these metrics to successfully tune hyperparameters for a low-resource emoji embedding task and word embeddings on 16 extinct languages. The largest of these languages (Ancient Hebrew) has a 41 million token dataset, and the smallest (Old Gujarati) has only a 1813 token dataset.