Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval)

Anya Belz, Shubham Agarwal, Yvette Graham, Ehud Reiter, Anastasia Shimorina (Editors)


Anthology ID:
2021.humeval-1
Month:
April
Year:
2021
Address:
Online
Venues:
EACL | HumEval
SIG:
Publisher:
Association for Computational Linguistics
URL:
https://aclanthology.org/2021.humeval-1
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
https://aclanthology.org/2021.humeval-1.pdf

pdf bib
Proceedings of the Workshop on Human Evaluation of NLP Systems (HumEval)
Anya Belz | Shubham Agarwal | Yvette Graham | Ehud Reiter | Anastasia Shimorina

pdf bib
Trading Off Diversity and Quality in Natural Language Generation
Hugh Zhang | Daniel Duckworth | Daphne Ippolito | Arvind Neelakantan

For open-ended language generation tasks such as storytelling or dialogue, choosing the right decoding algorithm is vital for controlling the tradeoff between generation quality and diversity. However, there presently exists no consensus on which decoding procedure is best or even the criteria by which to compare them. In this paper, we cast decoding as a tradeoff between response quality and diversity, and we perform the first large-scale evaluation of decoding methods along the entire quality-diversity spectrum. Our experiments confirm the existence of the likelihood trap : the counter-intuitive observation that high likelihood sequences are often surprisingly low quality. We also find that when diversity is a priority, all methods perform similarly, but when quality is viewed as more important, nucleus sampling (Holtzman et al., 2019) outperforms all other evaluated decoding algorithms.quality and diversity. However, there presently exists no consensus on which decoding procedure is best or even the criteria by which to compare them. In this paper, we cast decoding as a tradeoff between response quality and diversity, and we perform the first large-scale evaluation of decoding methods along the entire quality-diversity spectrum. Our experiments confirm the existence of the likelihood trap: the counter-intuitive observation that high likelihood sequences are often surprisingly low quality. We also find that when diversity is a priority, all methods perform similarly, but when quality is viewed as more important, nucleus sampling (Holtzman et al., 2019) outperforms all other evaluated decoding algorithms.

pdf bib
Is This Translation Error Critical? : Classification-Based Human and Automatic Machine Translation Evaluation Focusing on Critical Errors
Katsuhito Sudoh | Kosuke Takahashi | Satoshi Nakamura

This paper discusses a classification-based approach to machine translation evaluation, as opposed to a common regression-based approach in the WMT Metrics task. Recent machine translation usually works well but sometimes makes critical errors due to just a few wrong word choices. Our classification-based approach focuses on such errors using several error type labels, for practical machine translation evaluation in an age of neural machine translation. We made additional annotations on the WMT 2015-2017 Metrics datasets with fluency and adequacy labels to distinguish different types of translation errors from syntactic and semantic viewpoints. We present our human evaluation criteria for the corpus development and automatic evaluation experiments using the corpus. The human evaluation corpus will be publicly available upon publication.

pdf bib
Towards Objectively Evaluating the Quality of Generated Medical Summaries
Francesco Moramarco | Damir Juric | Aleksandar Savkov | Ehud Reiter

We propose a method for evaluating the quality of generated text by asking evaluators to count facts, and computing precision, recall, f-score, and accuracy from the raw counts. We believe this approach leads to a more objective and easier to reproduce evaluation. We apply this to the task of medical report summarisation, where measuring objective quality and accuracy is of paramount importance.

pdf bib
Reliability of Human Evaluation for Text Summarization : Lessons Learned and Challenges Ahead
Neslihan Iskender | Tim Polzehl | Sebastian Möller

Only a small portion of research papers with human evaluation for text summarization provide information about the participant demographics, task design, and experiment protocol. Additionally, many researchers use human evaluation as gold standard without questioning the reliability or investigating the factors that might affect the reliability of the human evaluation. As a result, there is a lack of best practices for reliable human summarization evaluation grounded by empirical evidence. To investigate human evaluation reliability, we conduct a series of human evaluation experiments, provide an overview of participant demographics, task design, experimental set-up and compare the results from different experiments. Based on our empirical analysis, we provide guidelines to ensure the reliability of expert and non-expert evaluations, and we determine the factors that might affect the reliability of the human evaluation.

pdf bib
Interrater Disagreement Resolution : A Systematic Procedure to Reach Consensus in Annotation Tasks
Yvette Oortwijn | Thijs Ossenkoppele | Arianna Betti

We present a systematic procedure for interrater disagreement resolution. The procedure is general, but of particular use in multiple-annotator tasks geared towards ground truth construction. We motivate our proposal by arguing that, barring cases in which the researchers’ goal is to elicit different viewpoints, interrater disagreement is a sign of poor quality in the design or the description of a task. Consensus among annotators, we maintain, should be striven for, through a systematic procedure for disagreement resolution such as the one we describe.