Proceedings of the 3rd Workshop on Machine Reading for Question Answering

Adam Fisch, Alon Talmor, Danqi Chen, Eunsol Choi, Minjoon Seo, Patrick Lewis, Robin Jia, Sewon Min (Editors)

Anthology ID:
Punta Cana, Dominican Republic
Association for Computational Linguistics
Bib Export formats:

pdf bib
Proceedings of the 3rd Workshop on Machine Reading for Question Answering
Adam Fisch | Alon Talmor | Danqi Chen | Eunsol Choi | Minjoon Seo | Patrick Lewis | Robin Jia | Sewon Min

pdf bib
MFAQ : a Multilingual FAQ DatasetMFAQ: a Multilingual FAQ Dataset
Maxime De Bruyn | Ehsan Lotfi | Jeska Buhmann | Walter Daelemans

In this paper, we present the first multilingual FAQ dataset publicly available. We collected around 6 M FAQ pairs from the web, in 21 different languages. Although this is significantly larger than existing FAQ retrieval datasets, it comes with its own challenges : duplication of content and uneven distribution of topics. We adopt a similar setup as Dense Passage Retrieval (DPR) and test various bi-encoders on this dataset. Our experiments reveal that a multilingual model based on XLM-RoBERTa achieves the best results, except for English. Lower resources languages seem to learn from one another as a multilingual model achieves a higher MRR than language-specific ones. Our qualitative analysis reveals the brittleness of the model on simple word changes. We publicly release our dataset, model, and training script.

pdf bib
Can Question Generation Debias Question Answering Models? A Case Study on QuestionContext Lexical Overlap
Kazutoshi Shinoda | Saku Sugawara | Akiko Aizawa

Question answering (QA) models for reading comprehension have been demonstrated to exploit unintended dataset biases such as questioncontext lexical overlap. This hinders QA models from generalizing to under-represented samples such as questions with low lexical overlap. Question generation (QG), a method for augmenting QA datasets, can be a solution for such performance degradation if QG can properly debias QA datasets. However, we discover that recent neural QG models are biased towards generating questions with high lexical overlap, which can amplify the dataset bias. Moreover, our analysis reveals that data augmentation with these QG models frequently impairs the performance on questions with low lexical overlap, while improving that on questions with high lexical overlap. To address this problem, we use a synonym replacement-based approach to augment questions with low lexical overlap. We demonstrate that the proposed data augmentation approach is simple yet effective to mitigate the degradation problem with only 70k synthetic examples.

pdf bib
Eliciting Bias in Question Answering Models through Ambiguity
Andrew Mao | Naveen Raman | Matthew Shu | Eric Li | Franklin Yang | Jordan Boyd-Graber

Question answering (QA) models use retriever and reader systems to answer questions. Reliance on training data by QA systems can amplify or reflect inequity through their responses. Many QA models, such as those for the SQuAD dataset, are trained and tested on a subset of Wikipedia articles which encode their own biases and also reproduce real-world inequality. Understanding how training data affects bias in QA systems can inform methods to mitigate inequity. We develop two sets of questions for closed and open domain questions respectively, which use ambiguous questions to probe QA models for bias. We feed three deep-learning-based QA systems with our question sets and evaluate responses for bias via the metrics. Using our metrics, we find that open-domain QA models amplify biases more than their closed-domain counterparts and propose that biases in the retriever surface more readily due to greater freedom of choice.

pdf bib
Bilingual Alignment Pre-Training for Zero-Shot Cross-Lingual Transfer
Ziqing Yang | Wentao Ma | Yiming Cui | Jiani Ye | Wanxiang Che | Shijin Wang

Multilingual pre-trained models have achieved remarkable performance on cross-lingual transfer learning. Some multilingual models such as mBERT, have been pre-trained on unlabeled corpora, therefore the embeddings of different languages in the models may not be aligned very well. In this paper, we aim to improve the zero-shot cross-lingual transfer performance by proposing a pre-training task named Word-Exchange Aligning Model (WEAM), which uses the statistical alignment information as the prior knowledge to guide cross-lingual word prediction. We evaluate our model on multilingual machine reading comprehension task MLQA and natural language interface task XNLI. The results show that WEAM can significantly improve the zero-shot performance.

pdf bib
Investigating Post-pretraining Representation Alignment for Cross-Lingual Question Answering
Fahim Faisal | Antonios Anastasopoulos

Human knowledge is collectively encoded in the roughly 6500 languages spoken around the world, but it is not distributed equally across languages. Hence, for information-seeking question answering (QA) systems to adequately serve speakers of all languages, they need to operate cross-lingually. In this work we investigate the capabilities of multilingually pretrained language models on cross-lingual QA. We find that explicitly aligning the representations across languages with a post-hoc finetuning step generally leads to improved performance. We additionally investigate the effect of data size as well as the language choice in this fine-tuning step, also releasing a dataset for evaluating cross-lingual QA systems.

pdf bib
Semantic Answer Similarity for Evaluating Question Answering Models
Julian Risch | Timo Möller | Julian Gutsch | Malte Pietsch

The evaluation of question answering models compares ground-truth annotations with model predictions. However, as of today, this comparison is mostly lexical-based and therefore misses out on answers that have no lexical overlap but are still semantically similar, thus treating correct answers as false. This underestimation of the true performance of models hinders user acceptance in applications and complicates a fair comparison of different models. Therefore, there is a need for an evaluation metric that is based on semantics instead of pure string similarity. In this short paper, we present SAS, a cross-encoder-based metric for the estimation of semantic answer similarity, and compare it to seven existing metrics. To this end, we create an English and a German three-way annotated evaluation dataset containing pairs of answers along with human judgment of their semantic similarity, which we release along with an implementation of the SAS metric and the experiments. We find that semantic similarity metrics based on recent transformer models correlate much better with human judgment than traditional lexical similarity metrics on our two newly created datasets and one dataset from related work.