Annual Meeting of the Association for Computational Linguistics (2022)


pdf (full)
bib (full)
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

pdf bib
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Smaranda Muresan | Preslav Nakov | Aline Villavicencio

pdf bib
AdapLeR Speeding up Inference by Adaptive Length ReductionAdapLeR: Speeding up Inference by Adaptive Length Reduction
Ali Modarressi | Hosein Mohebbi | Mohammad Taher Pilehvar

Pre trained language models have shown stellar performance in various downstream tasks But this usually comes at the cost of high latency and computation hindering their usage in resource limited settings In this work we propose a novel approach for reducing the computational cost of BERT with minimal loss in downstream performance Our method dynamically eliminates less contributing tokens through layers resulting in shorter lengths and consequently lower computational cost To determine the importance of each token representation we train a Contribution Predictor for each layer using a gradient based saliency method Our experiments on several diverse classification tasks show speedups up to 22x during inference time without much sacrifice in performance We also validate the quality of the selected tokens in our method using human annotations in the ERASER benchmark In comparison to other widely used strategies for selecting important tokens such as saliency and attention our proposed method has a significantly lower false positive rate in generating rationales Our code is freely available at

pdf bib
An Unsupervised Multiple-Task and Multiple-Teacher Model for Cross-lingual Named Entity Recognition
Zhuoran Li | Chunming Hu | Xiaohui Guo | Junfan Chen | Wenyi Qin | Richong Zhang

Cross-lingual named entity recognition task is one of the critical problems for evaluating the potential transfer learning techniques on low resource languages. Knowledge distillation using pre-trained multilingual language models between source and target languages have shown their superiority in transfer. However, existing cross-lingual distillation models merely consider the potential transferability between two identical single tasks across both domains. Other possible auxiliary tasks to improve the learning performance have not been fully investigated. In this study, based on the knowledge distillation framework and multi-task learning, we introduce the similarity metric model as an auxiliary task to improve the cross-lingual NER performance on the target domain. Specifically, an entity recognizer and a similarity evaluator are first trained in parallel as two teachers from the source domain. Then, two tasks in the student model are supervised by these teachers simultaneously. Empirical studies on the three datasets across 7 different languages confirm the effectiveness of the proposed model.

pdf bib
Discriminative Marginalized Probabilistic Neural Method for Multi-Document Summarization of Medical Literature
Gianluca Moro | Luca Ragazzi | Lorenzo Valgimigli | Davide Freddi

Although current state-of-the-art Transformer-based solutions succeeded in a wide range for single-document NLP tasks, they still struggle to address multi-input tasks such as multi-document summarization. Many solutions truncate the inputs, thus ignoring potential summary-relevant contents, which is unacceptable in the medical domain where each information can be vital. Others leverage linear model approximations to apply multi-input concatenation, worsening the results because all information is considered, even if it is conflicting or noisy with respect to a shared background. Despite the importance and social impact of medicine, there are no ad-hoc solutions for multi-document summarization. For this reason, we propose a novel discriminative marginalized probabilistic method (DAMEN) trained to discriminate critical information from a cluster of topic-related medical documents and generate a multi-document summary via token probability marginalization. Results prove we outperform the previous state-of-the-art on a biomedical dataset for multi-document summarization of systematic literature reviews. Moreover, we perform extensive ablation studies to motivate the design choices and prove the importance of each module of our method.

pdf bib
MISC: A Mixed Strategy-Aware Model integrating COMET for Emotional Support Conversation
Quan Tu | Yanran Li | Jianwei Cui | Bin Wang | Ji-Rong Wen | Rui Yan

Applying existing methods to emotional support conversation—which provides valuable assistance to people who are in need—has two major limitations: (a) they generally employ a conversation-level emotion label, which is too coarse-grained to capture user’s instant mental state; (b) most of them focus on expressing empathy in the response(s) rather than gradually reducing user’s distress. To address the problems, we propose a novel model \\textbf{MISC}, which firstly infers the user’s fine-grained emotional status, and then responds skillfully using a mixture of strategy. Experimental results on the benchmark dataset demonstrate the effectiveness of our method and reveal the benefits of fine-grained emotion understanding as well as mixed-up strategy modeling.

pdf bib
Cross-Utterance Conditioned VAE for Non-Autoregressive Text-to-Speech
Yang Li | Cheng Yu | Guangzhi Sun | Hua Jiang | Fanglei Sun | Weiqin Zu | Ying Wen | Yang Yang | Jun Wang

Modelling prosody variation is critical for synthesizing natural and expressive speech in end-to-end text-to-speech (TTS) systems. In this paper, a cross-utterance conditional VAE (CUC-VAE) is proposed to estimate a posterior probability distribution of the latent prosody features for each phoneme by conditioning on acoustic features, speaker information, and text features obtained from both past and future sentences. At inference time, instead of the standard Gaussian distribution used by VAE, CUC-VAE allows sampling from an utterance-specific prior distribution conditioned on cross-utterance information, which allows the prosody features generated by the TTS system to be related to the context and is more similar to how humans naturally produce prosody. The performance of CUC-VAE is evaluated via a qualitative listening test for naturalness, intelligibility and quantitative measurements, including word error rates and the standard deviation of prosody attributes. Experimental results on LJ-Speech and LibriTTS data show that the proposed CUC-VAE TTS system improves naturalness and prosody diversity with clear margins.

pdf bib
e-CARE: a New Dataset for Exploring Explainable Causal Reasoning
Li Du | Xiao Ding | Kai Xiong | Ting Liu | Bing Qin

Understanding causality has vital importance for various Natural Language Processing (NLP) applications. Beyond the labeled instances, conceptual explanations of the causality can provide deep understanding of the causal fact to facilitate the causal reasoning process. However, such explanation information still remains absent in existing causal reasoning resources. In this paper, we fill this gap by presenting a human-annotated explainable CAusal REasoning dataset (e-CARE), which contains over 20K causal reasoning questions, together with natural language formed explanations of the causal questions. Experimental results show that generating valid explanations for causal facts still remains especially challenging for the state-of-the-art models, and the explanation information can be helpful for promoting the accuracy and stability of causal reasoning models.

pdf bib
Improving Meta-learning for Low-resource Text Classification and Generation via Memory Imitation
Yingxiu Zhao | Zhiliang Tian | Huaxiu Yao | Yinhe Zheng | Dongkyu Lee | Yiping Song | Jian Sun | Nevin Zhang

Building models of natural language processing (NLP) is challenging in low-resource scenarios where limited data are available. Optimization-based meta-learning algorithms achieve promising results in low-resource scenarios by adapting a well-generalized model initialization to handle new tasks. Nonetheless, these approaches suffer from the memorization overfitting issue, where the model tends to memorize the meta-training tasks while ignoring support sets when adapting to new tasks. To address this issue, we propose a memory imitation meta-learning (MemIML) method that enhances the model’s reliance on support sets for task adaptation. Specifically, we introduce a task-specific memory module to store support set information and construct an imitation module to force query sets to imitate the behaviors of support sets stored in the memory. A theoretical analysis is provided to prove the effectiveness of our method, and empirical results also demonstrate that our method outperforms competitive baselines on both text classification and generation tasks.

pdf bib
Meta-learning via Language Model In-context Tuning
Yanda Chen | Ruiqi Zhong | Sheng Zha | George Karypis | He He

The goal of meta-learning is to learn to adapt to a new task with only a few labeled examples. Inspired by the recent progress in large language models, we propose \\textit{in-context tuning} (ICT), which recasts task adaptation and prediction as a simple sequence prediction problem: to form the input sequence, we concatenate the task instruction, labeled in-context examples, and the target input to predict; to meta-train the model to learn from in-context examples, we fine-tune a pre-trained language model (LM) to predict the target label given the input sequence on a collection of tasks.We benchmark our method on two collections of text classification tasks: LAMA and BinaryClfs. Compared to MAML which adapts the model through gradient descent, our method leverages the inductive bias of pre-trained LMs to perform pattern matching, and outperforms MAML by an absolute 6% average AUC-ROC score on BinaryClfs, gaining more advantage with increasing model size. Compared to non-fine-tuned in-context learning (i.e. prompting a raw LM), in-context tuning meta-trains the model to learn from in-context examples. On BinaryClfs, ICT improves the average AUC-ROC score by an absolute 10%, and reduces the variance due to example ordering by 6x and example choices by 2x.

pdf bib
Prompt-Based Rule Discovery and Boosting for Interactive Weakly-Supervised Learning
Rongzhi Zhang | Yue Yu | Pranav Shetty | Le Song | Chao Zhang

Weakly-supervised learning (WSL) has shown promising results in addressing label scarcity on many NLP tasks, but manually designing a comprehensive, high-quality labeling rule set is tedious and difficult. We study interactive weakly-supervised learning—the problem of iteratively and automatically discovering novel labeling rules from data to improve the WSL model. Our proposed model, named PRBoost, achieves this goal via iterative prompt-based rule discovery and model boosting. It uses boosting to identify large-error instances and discovers candidate rules from them by prompting pre-trained LMs with rule templates. The candidate rules are judged by human experts, and the accepted rules are used to generate complementary weak labels and strengthen the current model. Experiments on four tasks show PRBoost outperforms state-of-the-art WSL baselines up to 7.1%, and bridges the gaps with fully supervised models.

pdf bib
HIBRIDS: Attention with Hierarchical Biases for Structure-aware Long Document Summarization
Shuyang Cao | Lu Wang

Document structure is critical for efficient information consumption. However, it is challenging to encode it efficiently into the modern Transformer architecture. In this work, we present HIBRIDS, which injects Hierarchical Biases foR Incorporating Document Structure into attention score calculation. We further present a new task, hierarchical question-summary generation, for summarizing salient content in the source document into a hierarchy of questions and summaries, where each follow-up question inquires about the content of its parent question-summary pair. We also annotate a new dataset with 6,153 question-summary hierarchies labeled on government reports. Experiment results show that our model produces better question-summary hierarchies than comparisons on both hierarchy quality and content coverage, a finding also echoed by human judges. Additionally, our model improves the generation of long-form summaries from long government reports and Wikipedia articles, as measured by ROUGE scores.

pdf bib
De-Bias for Generative Extraction in Unified NER Task
Shuai Zhang | Yongliang Shen | Zeqi Tan | Yiquan Wu | Weiming Lu

Named entity recognition (NER) is a fundamental task to recognize specific types of entities from a given sentence. Depending on how the entities appear in the sentence, it can be divided into three subtasks, namely, Flat NER, Nested NER, and Discontinuous NER. Among the existing approaches, only the generative model can be uniformly adapted to these three subtasks. However, when the generative model is applied to NER, its optimization objective is not consistent with the task, which makes the model vulnerable to the incorrect biases. In this paper, we analyze the incorrect biases in the generation process from a causality perspective and attribute them to two confounders: pre-context confounder and entity-order confounder. Furthermore, we design Intra- and Inter-entity Deconfounding Data Augmentation methods to eliminate the above confounders according to the theory of backdoor adjustment. Experiments show that our method can improve the performance of the generative NER model in various datasets.

pdf bib
Learning Disentangled Semantic Representations for Zero-Shot Cross-Lingual Transfer in Multilingual Machine Reading Comprehension
Linjuan Wu | Shaojuan Wu | Xiaowang Zhang | Deyi Xiong | Shizhan Chen | Zhiqiang Zhuang | Zhiyong Feng

Multilingual pre-trained models are able to zero-shot transfer knowledge from rich-resource to low-resource languages in machine reading comprehension (MRC). However, inherent linguistic discrepancies in different languages could make answer spans predicted by zero-shot transfer violate syntactic constraints of the target language. In this paper, we propose a novel multilingual MRC framework equipped with a Siamese Semantic Disentanglement Model (S2DM) to disassociate semantics from syntax in representations learned by multilingual pre-trained models. To explicitly transfer only semantic knowledge to the target language, we propose two groups of losses tailored for semantic and syntactic encoding and disentanglement. Experimental results on three multilingual MRC datasets (i.e., XQuAD, MLQA, and TyDi QA) demonstrate the effectiveness of our proposed approach over models based on mBERT and XLM-100.

pdf bib
HiTab A Hierarchical Table Dataset for Question Answering and Natural Language GenerationHiTab: A Hierarchical Table Dataset for Question Answering and Natural Language Generation
Zhoujun Cheng | Haoyu Dong | Zhiruo Wang | Ran Jia | Jiaqi Guo | Yan Gao | Shi Han | Jian-Guang Lou | Dongmei Zhang

Tables are often created with hierarchies but existing works on table reasoning mainly focus on flat tables and neglect hierarchical tables Hierarchical tables challenge numerical reasoning by complex hierarchical indexing as well as implicit relationships of calculation and semantics We present a new dataset HiTab to study question answering QA and natural language generation NLG over hierarchical tables HiTab is a cross domain dataset constructed from a wealth of statistical reports and Wikipedia pages and has unique characteristics nearly all tables are hierarchical and QA pairs are not proposed by annotators from scratch but are revised from real and meaningful sentences authored by analysts to reveal complex numerical reasoning in statistical reports we provide fine grained annotations of quantity and entity alignment Experiments suggest that this HiTab presents a strong challenge for existing baselines and a valuable benchmark for future research Targeting hierarchical structure we devise a hierarchy aware logical form for symbolic reasoning over tables which shows high effectiveness Targeting table reasoning we leverage entity and quantity alignment to explore partially supervised training in QA and conditional generation in NLG and largely reduce spurious predictions in QA and produce better descriptions in NLG

pdf bib
FORTAP: Using Formulas for Numerical-Reasoning-Aware Table Pretraining
Zhoujun Cheng | Haoyu Dong | Ran Jia | Pengfei Wu | Shi Han | Fan Cheng | Dongmei Zhang

Tables store rich numerical data, but numerical reasoning over tables is still a challenge. In this paper, we find that the spreadsheet formula, a commonly used language to perform computations on numerical values in spreadsheets, is a valuable supervision for numerical reasoning in tables. Considering large amounts of spreadsheets available on the web, we propose FORTAP, the first exploration to leverage spreadsheet formulas for table pretraining. Two novel self-supervised pretraining objectives are derived from formulas, numerical reference prediction (NRP) and numerical calculation prediction (NCP). While our proposed objectives are generic for encoders, to better capture spreadsheet table layouts and structures, FORTAP is built upon TUTA, the first transformer-based method for spreadsheet table pretraining with tree attention. FORTAP outperforms state-of-the-art methods by large margins on three representative datasets of formula prediction, question answering, and cell type classification, showing the great potential of leveraging formulas for table pretraining.

pdf bib
Explanation Graph Generation via Pre-trained Language Models: An Empirical Study with Contrastive Learning
Swarnadeep Saha | Prateek Yadav | Mohit Bansal

Pre-trained sequence-to-sequence language models have led to widespread success in many natural language generation tasks. However, there has been relatively less work on analyzing their ability to generate structured outputs such as graphs. Unlike natural language, graphs have distinct structural and semantic properties in the context of a downstream NLP task, e.g., generating a graph that is connected and acyclic can be attributed to its structural constraints, while the semantics of a graph can refer to how meaningfully an edge represents the relation between two node concepts. In this work, we study pre-trained language models that generate explanation graphs in an end-to-end manner and analyze their ability to learn the structural constraints and semantics of such graphs. We first show that with limited supervision, pre-trained language models often generate graphs that either violate these constraints or are semantically incoherent. Since curating large amount of human-annotated graphs is expensive and tedious, we propose simple yet effective ways of graph perturbations via node and edge edit operations that lead to structurally and semantically positive and negative graphs. Next, we leverage these graphs in different contrastive learning models with Max-Margin and InfoNCE losses. Our methods lead to significant improvements in both structural and semantic accuracy of explanation graphs and also generalize to other similar graph generation tasks. Lastly, we show that human errors are the best negatives for contrastive learning and also that automatically generating more such human-like negative graphs can lead to further improvements.

pdf bib
Efficient Unsupervised Sentence Compression by Fine-tuning Transformers with Reinforcement Learning
Demian Ghalandari | Chris Hokamp | Georgiana Ifrim

Sentence compression reduces the length of text by removing non-essential content while preserving important facts and grammaticality. Unsupervised objective driven methods for sentence compression can be used to create customized models without the need for ground-truth training data, while allowing flexibility in the objective function(s) that are used for learning and inference. Recent unsupervised sentence compression approaches use custom objectives to guide discrete search; however, guided search is expensive at inference time. In this work, we explore the use of reinforcement learning to train effective sentence compression models that are also fast when generating predictions. In particular, we cast the task as binary sequence labelling and fine-tune a pre-trained transformer using a simple policy gradient approach. Our approach outperforms other unsupervised models while also being more efficient at inference time.

pdf bib
Tracing Origins: Coreference-aware Machine Reading Comprehension
Zhuosheng Zhang | Hai Zhao

Machine reading comprehension is a heavily-studied research and test field for evaluating new pre-trained language models (PrLMs) and fine-tuning strategies, and recent studies have enriched the pre-trained language models with syntactic, semantic and other linguistic information to improve the performance of the models. In this paper, we imitate the human reading process in connecting the anaphoric expressions and explicitly leverage the coreference information of the entities to enhance the word embeddings from the pre-trained language model, in order to highlight the coreference mentions of the entities that must be identified for coreference-intensive question answering in QUOREF, a relatively new dataset that is specifically designed to evaluate the coreference-related performance of a model. We use two strategies to fine-tune a pre-trained language model, namely, placing an additional encoder layer after a pre-trained language model to focus on the coreference mentions or constructing a relational graph convolutional network to model the coreference relations. We demonstrate that the explicit incorporation of coreference information in the fine-tuning stage performs better than the incorporation of the coreference information in pre-training a language model.

pdf bib
WatClaimCheck A new Dataset for Claim Entailment and InferenceWatClaimCheck: A new Dataset for Claim Entailment and Inference
Kashif Khan | Ruizhe Wang | Pascal Poupart

We contribute a new dataset for the task of automated fact checking and an evaluation of state of the art algorithms The dataset includes claims from speeches interviews social media and news articles review articles published by professional fact checkers and premise articles used by those professional fact checkers to support their review and verify the veracity of the claims An important challenge in the use of premise articles is the identification of relevant passages that will help to infer the veracity of a claim We show that transferring a dense passage retrieval model trained with review articles improves the retrieval quality of passages in premise articles We report results for the prediction of claim veracity by inference from premise articles

pdf bib
Bias Mitigation in Machine Translation Quality Estimation
Hanna Behnke | Marina Fomicheva | Lucia Specia

Machine Translation Quality Estimation QE aims to build predictive models to assess the quality of machine generated translations in the absence of reference translations While state of the art QE models have been shown to achieve good results they over rely on features that do not have a causal impact on the quality of a translation In particular there appears to be a partial input bias i.e. a tendency to assign high quality scores to translations that are fluent and grammatically correct even though they do not preserve the meaning of the source We analyse the partial input bias in further detail and evaluate four approaches to use auxiliary tasks for bias mitigation Two approaches use additional data to inform and support the main task while the other two are adversarial actively discouraging the model from learning the bias We compare the methods with respect to their ability to reduce the partial input bias while maintaining the overall performance We find that training a multitask architecture with an auxiliary binary classification task that utilises additional augmented data best achieves the desired effects and generalises well to different languages and quality metrics

pdf bib
Principled Paraphrase Generation with Parallel Corpora
Aitor Ormazabal | Mikel Artetxe | Aitor Soroa | Gorka Labaka | Eneko Agirre

Round trip Machine Translation MT is a popular choice for paraphrase generation which leverages readily available parallel corpora for supervision In this paper we formalize the implicit similarity function induced by this approach and show that it is susceptible to non paraphrase pairs sharing a single ambiguous translation Based on these insights we design an alternative similarity metric that mitigates this issue by requiring the entire translation distribution to match and implement a relaxation of it through the Information Bottleneck method Our approach incorporates an adversarial term into MT training in order to learn representations that encode as much information about the reference translation as possible while keeping as little information about the input as possible Paraphrases can be generated by decoding back to the source from this representation without having to generate pivot translations In addition to being more principled and efficient than round trip MT our approach offers an adjustable parameter to control the fidelity diversity trade off and obtains better results in our experiments

pdf bib
Composable Sparse Fine-Tuning for Cross-Lingual Transfer
Alan Ansell | Edoardo Ponti | Anna Korhonen | Ivan Vulić

Fine-tuning the entire set of parameters of a large pretrained model has become the mainstream approach for transfer learning. To increase its efficiency and prevent catastrophic forgetting and interference, techniques like adapters and sparse fine-tuning have been developed. Adapters are modular, as they can be combined to adapt a model towards different facets of knowledge (e.g., dedicated language and/or task adapters). Sparse fine-tuning is expressive, as it controls the behavior of all model components. In this work, we introduce a new fine-tuning method with both these desirable properties. In particular, we learn sparse, real-valued masks based on a simple variant of the Lottery Ticket Hypothesis. Task-specific masks are obtained from annotated data in a source language, and language-specific masks from masked language modeling in a target language. Both these masks can then be composed with the pretrained model. Unlike adapter-based fine-tuning, this method neither increases the number of parameters at inference time nor alters the original model architecture. Most importantly, it outperforms adapters in zero-shot cross-lingual transfer by a large margin in a series of multilingual benchmarks, including Universal Dependencies, MasakhaNER, and AmericasNLI. Based on an in-depth analysis, we additionally find that sparsity is crucial to prevent both 1) interference between the fine-tunings to be composed and 2) overfitting. We release the code and models at

pdf bib
Toward Annotator Group Bias in Crowdsourcing
Haochen Liu | Joseph Thekinen | Sinem Mollaoglu | Da Tang | Ji Yang | Youlong Cheng | Hui Liu | Jiliang Tang

Crowdsourcing has emerged as a popular approach for collecting annotated data to train supervised machine learning models However annotator bias can lead to defective annotations Though there are a few works investigating individual annotator bias the group effects in annotators are largely overlooked In this work we reveal that annotators within the same demographic group tend to show consistent group bias in annotation tasks and thus we conduct an initial study on annotator group bias We first empirically verify the existence of annotator group bias in various real world crowdsourcing datasets Then we develop a novel probabilistic graphical framework GroupAnno to capture annotator group bias with an extended Expectation Maximization EM algorithm We conduct experiments on both synthetic and real world datasets Experimental results demonstrate the effectiveness of our model in modeling annotator group bias in label aggregation and model learning over competitive baselines

pdf bib
BiTIIMT: A Bilingual Text-infilling Method for Interactive Machine Translation
Yanling Xiao | Lemao Liu | Guoping Huang | Qu Cui | Shujian Huang | Shuming Shi | Jiajun Chen

Interactive neural machine translation (INMT) is able to guarantee high-quality translations by taking human interactions into account. Existing IMT systems relying on lexical constrained decoding (LCD) enable humans to translate in a flexible translation order beyond the left-to-right. However, they typically suffer from two significant limitations in translation efficiency and quality due to the reliance on LCD. In this work, we propose a novel BiTIIMT system, Bilingual Text-Infilling for Interactive Neural Machine Translation. The key idea to BiTIIMT is Bilingual Text-infilling (BiTI) which aims to fill missing segments in a manually revised translation for a given source sentence. We propose a simple yet effective solution by casting this task as a sequence-to-sequence task. In this way, our system performs decoding without explicit constraints and makes full use of revised words for better translation prediction. Experiment results show that BiTiIMT performs significantly better and faster than state-of-the-art LCD-based IMT on three translation tasks.

pdf bib
Divide and Denoise: Learning from Noisy Labels in Fine-Grained Entity Typing with Cluster-Wise Loss Correction
Kunyuan Pang | Haoyu Zhang | Jie Zhou | Ting Wang

Fine-grained Entity Typing (FET) has made great progress based on distant supervision but still suffers from label noise. Existing FET noise learning methods rely on prediction distributions in an instance-independent manner, which causes the problem of confirmation bias. In this work, we propose a clustering-based loss correction framework named Feature Cluster Loss Correction (FCLC), to address these two problems. FCLC first train a coarse backbone model as a feature extractor and noise estimator. Loss correction is then applied to each feature cluster, learning directly from the noisy labels. Experimental results on three public datasets show that FCLC achieves the best performance over existing competitive systems. Auxiliary experiments further demonstrate that FCLC is stable to hyperparameters and it does help mitigate confirmation bias. We also find that in the extreme case of no clean data, the FCLC framework still achieves competitive performance.

pdf bib
Towards Robustness of Text-to-SQL Models Against Natural and Realistic Adversarial Table Perturbation
Xinyu Pi | Bing Wang | Yan Gao | Jiaqi Guo | Zhoujun Li | Jian-Guang Lou

The robustness of Text-to-SQL parsers against adversarial perturbations plays a crucial role in delivering highly reliable applications. Previous studies along this line primarily focused on perturbations in the natural language question side, neglecting the variability of tables. Motivated by this, we propose the Adversarial Table Perturbation (ATP) as a new attacking paradigm to measure robustness of Text-to-SQL models. Following this proposition, we curate ADVETA, the first robustness evaluation benchmark featuring natural and realistic ATPs. All tested state-of-the-art models experience dramatic performance drops on ADVETA, revealing significant room of improvement. To defense against ATP, we build a systematic adversarial training example generation framework tailored for better contextualization of tabular data. Experiments show that our approach brings models best robustness improvement against ATP, while also substantially boost model robustness against NL-side perturbations. We will release ADVETA and code to facilitate future research.

pdf bib
Metaphors in Pre-Trained Language Models: Probing and Generalization Across Datasets and Languages
Ehsan Aghazadeh | Mohsen Fayyaz | Yadollah Yaghoobzadeh

Human languages are full of metaphorical expressions. Metaphors help people understand the world by connecting new concepts and domains to more familiar ones. Large pre-trained language models (PLMs) are therefore assumed to encode metaphorical knowledge useful for NLP systems. In this paper, we investigate this hypothesis for PLMs, by probing metaphoricity information in their encodings, and by measuring the cross-lingual and cross-dataset generalization of this information. We present studies in multiple metaphor detection datasets and in four languages (i.e., English, Spanish, Russian, and Farsi). Our extensive experiments suggest that contextual representations in PLMs do encode metaphorical knowledge, and mostly in their middle layers. The knowledge is transferable between languages and datasets, especially when the annotation is consistent across training and testing sets. Our findings give helpful insights for both cognitive and NLP scientists.

pdf bib
bert2BERT: Towards Reusable Pretrained Language Models
Cheng Chen | Yichun Yin | Lifeng Shang | Xin Jiang | Yujia Qin | Fengyu Wang | Zhi Wang | Xiao Chen | Zhiyuan Liu | Qun Liu

In recent years, researchers tend to pre-train ever-larger language models to explore the upper limit of deep models. However, large language model pre-training costs intensive computational resources, and most of the models are trained from scratch without reusing the existing pre-trained models, which is wasteful. In this paper, we propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model through parameter initialization and significantly improve the pre-training efficiency of the large model. Specifically, we extend the previous function-preserving method proposed in computer vision on the Transformer-based language model, and further improve it by proposing a novel method, advanced knowledge for large model’s initialization. In addition, a two-stage learning method is proposed to further accelerate the pre-training. We conduct extensive experiments on representative PLMs (e.g., BERT and GPT) and demonstrate that (1) our method can save a significant amount of training cost compared with baselines including learning from scratch, StackBERT and MSLT; (2) our method is generic and applicable to different types of pre-trained models. In particular, bert2BERT saves about 45% and 47% computational cost of pre-training BERT_{\\rm BASE} and GPT_{\\rm BASE} by reusing the models of almost their half sizes.

pdf bib
"You might think about slightly revising the title”: Identifying Hedges in Peer-tutoring Interactions
Yann Raphalen | Chloé Clavel | Justine Cassell

Hedges have an important role in the management of rapport. In peer-tutoring, they are notably used by tutors in dyads experiencing low rapport to tone down the impact of instructions and negative feedback.Pursuing the objective of building a tutoring agent that manages rapport with teenagers in order to improve learning, we used a multimodal peer-tutoring dataset to construct a computational framework for identifying hedges. We compared approaches relying on pre-trained resources with others that integrate insights from the social science literature. Our best performance involved a hybrid approach that outperforms the existing baseline while being easier to interpret. We employ a model explainability tool to explore the features that characterize hedges in peer-tutoring conversations, and we identify some novel features, and the benefits of a such a hybrid model approach.

pdf bib
Efficient Cluster-Based k-Nearest-Neighbor Machine Translation
Dexin Wang | Kai Fan | Boxing Chen | Deyi Xiong

\n k-Nearest-Neighbor Machine Translation (kNN-MT) has been recently proposed as a non-parametric solution for domain adaptation in neural machine translation (NMT). It aims to alleviate the performance degradation of advanced MT systems in translating out-of-domain sentences by coordinating with an additional token-level feature-based retrieval module constructed from in-domain data. Previous studies (Khandelwal et al., 2021; Zheng et al., 2021) have already demonstrated that non-parametric NMT is even superior to models fine-tuned on out-of-domain data. In spite of this success, kNN retrieval is at the expense of high latency, in particular for large datastores. To make it practical, in this paper, we explore a more efficient kNN-MT and propose to use clustering to improve the retrieval efficiency. Concretely, we first propose a cluster-based Compact Network for feature reduction in a contrastive learning manner to compress context features into 90+% lower dimensional vectors. We then suggest a cluster-based pruning solution to filter out 10% 40% redundant nodes in large datastores while retaining translation quality. Our proposed methods achieve better or comparable performance while reducing up to 57% inference latency against the advanced non-parametric MT model on several machine translation benchmarks. Experimental results indicate that the proposed methods maintain the most useful information of the original datastore and the Compact Network shows good generalization on unseen domains. Codes are available at

pdf bib
Headed-Span-Based Projective Dependency Parsing
Songlin Yang | Kewei Tu

We propose a new method for projective dependency parsing based on headed spans. In a projective dependency tree, the largest subtree rooted at each word covers a contiguous sequence (i.e., a span) in the surface order. We call such a span marked by a root word headed span. A projective dependency tree can be represented as a collection of headed spans. We decompose the score of a dependency tree into the scores of the headed spans and design a novel O(n^3) dynamic programming algorithm to enable global training and exact inference. Our model achieves state-of-the-art or competitive results on PTB, CTB, and UD

pdf bib
Robust Lottery Tickets for Pre-trained Language Models
Rui Zheng | Bao Rong | Yuhao Zhou | Di Liang | Sirui Wang | Wei Wu | Tao Gui | Qi Zhang | Xuanjing Huang

Recent works on Lottery Ticket Hypothesis have shown that pre-trained language models (PLMs) contain smaller matching subnetworks(winning tickets) which are capable of reaching accuracy comparable to the original models. However, these tickets are proved to be notrobust to adversarial examples, and even worse than their PLM counterparts. To address this problem, we propose a novel method based on learning binary weight masks to identify robust tickets hidden in the original PLMs. Since the loss is not differentiable for the binary mask, we assign the hard concrete distribution to the masks and encourage their sparsity using a smoothing approximation of L0 regularization.Furthermore, we design an adversarial loss objective to guide the search for robust tickets and ensure that the tickets perform well bothin accuracy and robustness. Experimental results show the significant improvement of the proposed method over previous work on adversarial robustness evaluation.

pdf bib
IAM: A Comprehensive and Large-Scale Dataset for Integrated Argument Mining Tasks
Liying Cheng | Lidong Bing | Ruidan He | Qian Yu | Yan Zhang | Luo Si

Traditionally, a debate usually requires a manual preparation process, including reading plenty of articles, selecting the claims, identifying the stances of the claims, seeking the evidence for the claims, etc. As the AI debate attracts more attention these years, it is worth exploring the methods to automate the tedious process involved in the debating system. In this work, we introduce a comprehensive and large dataset named IAM, which can be applied to a series of argument mining tasks, including claim extraction, stance classification, evidence extraction, etc. Our dataset is collected from over 1k articles related to 123 topics. Near 70k sentences in the dataset are fully annotated based on their argument properties (e.g., claims, stances, evidence, etc.). We further propose two new integrated argument mining tasks associated with the debate preparation process: (1) claim extraction with stance classification (CESC) and (2) claim-evidence pair extraction (CEPE). We adopt a pipeline approach and an end-to-end method for each integrated task separately. Promising experimental results are reported to show the values and challenges of our proposed tasks, and motivate future research on argument mining.

pdf bib
CTRLEval: An Unsupervised Reference-Free Metric for Evaluating Controlled Text Generation
Pei Ke | Hao Zhou | Yankai Lin | Peng Li | Jie Zhou | Xiaoyan Zhu | Minlie Huang

Existing reference-free metrics have obvious limitations for evaluating controlled text generation models. Unsupervised metrics can only provide a task-agnostic evaluation result which correlates weakly with human judgments, whereas supervised ones may overfit task-specific data with poor generalization ability to other datasets. In this paper, we propose an unsupervised reference-free metric called CTRLEval, which evaluates controlled text generation from different aspects by formulating each aspect into multiple text infilling tasks. On top of these tasks, the metric assembles the generation probabilities from a pre-trained language model without any model training. Experimental results show that our metric has higher correlations with human judgments than other baselines, while obtaining better generalization of evaluating generated texts from different models and with different qualities.

pdf bib
Redistributing Low-Frequency Words: Making the Most of Monolingual Data in Non-Autoregressive Translation
Liang Ding | Longyue Wang | Shuming Shi | Dacheng Tao | Zhaopeng Tu

Knowledge distillation (KD) is the preliminary step for training non-autoregressive translation (NAT) models, which eases the training of NAT models at the cost of losing important information for translating low-frequency words. In this work, we provide an appealing alternative for NAT – monolingual KD, which trains NAT student on external monolingual data with AT teacher trained on the original bilingual data. Monolingual KD is able to transfer both the knowledge of the original bilingual data (implicitly encoded in the trained AT teacher model) and that of the new monolingual data to the NAT student model. Extensive experiments on eight WMT benchmarks over two advanced NAT models show that monolingual KD consistently outperforms the standard KD by improving low-frequency word translation, without introducing any computational cost. Monolingual KD enjoys desirable expandability, which can be further enhanced (when given more computational budget) by combining with the standard KD, a reverse monolingual KD, or enlarging the scale of monolingual data. Extensive analyses demonstrate that these techniques can be used together profitably to further recall the useful information lost in the standard KD. Encouragingly, combining with standard KD, our approach achieves 30.4 and 34.1 BLEU points on the WMT14 English-German and German-English datasets, respectively. Our code and trained models are freely available at

pdf bib
Alignment-Augmented Consistent Translation for Multilingual Open Information Extraction
Keshav Kolluru | Muqeeth Mohammed | Shubham Mittal | Soumen Chakrabarti | Mausam .

Progress with supervised Open Information Extraction (OpenIE) has been primarily limited to English due to the scarcity of training data in other languages. In this paper, we explore techniques to automatically convert English text for training OpenIE systems in other languages. We introduce the Alignment-Augmented Constrained Translation (AACTrans) model to translate English sentences and their corresponding extractions consistently with each other — with no changes to vocabulary or semantic meaning which may result from independent translations. Using the data generated with AACTrans, we train a novel two-stage generative OpenIE model, which we call Gen2OIE, that outputs for each sentence: 1) relations in the first stage and 2) all extractions containing the relation in the second stage. Gen2OIE increases relation coverage using a training data transformation technique that is generalizable to multiple languages, in contrast to existing models that use an English-specific training loss. Evaluations on 5 languages — Spanish, Portuguese, Chinese, Hindi and Telugu — show that the Gen2OIE with AACTrans data outperforms prior systems by a margin of 6-25% in F1.

pdf bib
Text-to-Table: A New Way of Information Extraction
Xueqing Wu | Jiacheng Zhang | Hang Li

We study a new problem setting of information extraction (IE), referred to as text-to-table. In text-to-table, given a text, one creates a table or several tables expressing the main content of the text, while the model is learned from text-table pair data. The problem setting differs from those of the existing methods for IE. First, the extraction can be carried out from long texts to large tables with complex structures. Second, the extraction is entirely data-driven, and there is no need to explicitly define the schemas. As far as we know, there has been no previous work that studies the problem. In this work, we formalize text-to-table as a sequence-to-sequence (seq2seq) problem. We first employ a seq2seq model fine-tuned from a pre-trained language model to perform the task. We also develop a new method within the seq2seq approach, exploiting two additional techniques in table generation: table constraint and table relation embeddings. We consider text-to-table as an inverse problem of the well-studied table-to-text, and make use of four existing table-to-text datasets in our experiments on text-to-table. Experimental results show that the vanilla seq2seq model can outperform the baseline methods of using relation extraction and named entity extraction. The results also show that our method can further boost the performances of the vanilla seq2seq model. We further discuss the main challenges of the proposed task. The code and data are available at

pdf bib
Accelerating Code Search with Deep Hashing and Code Classification
Wenchao Gu | Yanlin Wang | Lun Du | Hongyu Zhang | Shi Han | Dongmei Zhang | Michael Lyu

Code search is to search reusable code snippets from source code corpus based on natural languages queries Deep learning based methods on code search have shown promising results However previous methods focus on retrieval accuracy but lacked attention to the efficiency of the retrieval process We propose a novel method CoSHC to accelerate code search with deep hashing and code classification aiming to perform efficient code search without sacrificing too much accuracy To evaluate the effectiveness of CoSHC we apply our method \n on five code search models Extensive experimental results indicate that compared with previous code search baselines CoSHC can save more than of retrieval time meanwhile preserving at least of retrieval accuracy

pdf bib
Learning Disentangled Textual Representations via Statistical Measures of Similarity
Pierre Colombo | Guillaume Staerman | Nathan Noiry | Pablo Piantanida

When working with textual data a natural application of disentangled representations is the fair classification where the goal is to make predictions without being biased or influenced by sensible attributes that may be present in the data e.g. age gender or race Dominant approaches to disentangle a sensitive attribute from textual representations rely on learning simultaneously a penalization term that involves either an adversary loss e.g. a discriminator or an information measure e.g. mutual information However these methods require the training of a deep neural network with several parameter updates for each update of the representation model As a matter of fact the resulting nested optimization loop is both times consuming adding complexity to the optimization dynamic and requires a fine hyperparameter selection e.g. learning rates architecture In this work we introduce a family of regularizers for learning disentangled representations that do not require training These regularizers are based on statistical measures of similarity between the conditional probability distributions with respect to the sensible attributes Our novel regularizers do not require additional training are faster and do not involve additional tuning while achieving better results both when combined with pretrained and randomly initialized text encoders

pdf bib
GL-CLeF: A Global–Local Contrastive Learning Framework for Cross-lingual Spoken Language Understanding
Libo Qin | Qiguang Chen | Tianbao Xie | Qixin Li | Jian-Guang Lou | Wanxiang Che | Min-Yen Kan

Due to high data demands of current methods, attention to zero-shot cross-lingual spoken language understanding (SLU) has grown, as such approaches greatly reduce human annotation effort. However, existing models solely rely on shared parameters, which can only perform implicit alignment across languages. We present Global-Local Contrastive Learning Framework (GL-CLeF) to address this shortcoming. Specifically, we employ contrastive learning, leveraging bilingual dictionaries to construct multilingual views of the same utterance, then encourage their representations to be more similar than negative example pairs, which achieves to explicitly align representations of similar sentences across languages. In addition, a key step in GL-CLeF is a proposed Local and Global component, which achieves a fine-grained cross-lingual transfer (i.e., sentence-level Local intent transfer, token-level Local slot transfer, and semantic-level Global transfer across intent and slot). Experiments on MultiATIS++ show that GL-CLeF achieves the best performance and successfully pulls representations of similar sentences across languages closer.

pdf bib
Good Examples Make A Faster Learner: Simple Demonstration-based Learning for Low-resource NER
Dong-Ho Lee | Akshen Kadakia | Kangmin Tan | Mahak Agarwal | Xinyu Feng | Takashi Shibuya | Ryosuke Mitani | Toshiyuki Sekiya | Jay Pujara | Xiang Ren

Recent advances in prompt-based learning have shown strong results on few-shot text classification by using cloze-style templates.Similar attempts have been made on named entity recognition (NER) which manually design templates to predict entity types for every text span in a sentence. However, such methods may suffer from error propagation induced by entity span detection, high cost due to enumeration of all possible text spans, and omission of inter-dependencies among token labels in a sentence. Here we present a simple demonstration-based learning method for NER, which lets the input be prefaced by task demonstrations for in-context learning. We perform a systematic study on demonstration strategy regarding what to include (entity examples, with or without surrounding context), how to select the examples, and what templates to use. Results on in-domain learning and domain adaptation show that the model’s performance in low-resource settings can be largely improved with a suitable demonstration strategy (e.g., a 4-17% improvement on 25 train instances). We also find that good demonstration can save many labeled examples and consistency in demonstration contributes to better performance.

pdf bib
A Meta-framework for Spatiotemporal Quantity Extraction from Text
Qiang Ning | Ben Zhou | Hao Wu | Haoruo Peng | Chuchu Fan | Matt Gardner

News events are often associated with quantities (e.g., the number of COVID-19 patients or the number of arrests in a protest), and it is often important to extract their type, time, and location from unstructured text in order to analyze these quantity events. This paper thus formulates the NLP problem of spatiotemporal quantity extraction, and proposes the first meta-framework for solving it. This meta-framework contains a formalism that decomposes the problem into several information extraction tasks, a shareable crowdsourcing pipeline, and transformer-based baseline models. We demonstrate the meta-framework in three domains—the COVID-19 pandemic, Black Lives Matter protests, and 2020 California wildfires—to show that the formalism is general and extensible, the crowdsourcing pipeline facilitates fast and high-quality data annotation, and the baseline system can handle spatiotemporal quantity extraction well enough to be practically useful. We release all resources for future research on this topic at

pdf bib
Sequence-to-Sequence Knowledge Graph Completion and Question Answering
Apoorv Saxena | Adrian Kochsiek | Rainer Gemulla

Knowledge graph embedding (KGE) models represent each entity and relation of a knowledge graph (KG) with low-dimensional embedding vectors. These methods have recently been applied to KG link prediction and question answering over incomplete KGs (KGQA). KGEs typically create an embedding for each entity in the graph, which results in large model sizes on real-world graphs with millions of entities. For downstream tasks these atomic entity representations often need to be integrated into a multi stage pipeline, limiting their utility. We show that an off-the-shelf encoder-decoder Transformer model can serve as a scalable and versatile KGE model obtaining state-of-the-art results for KG link prediction and incomplete KG question answering. We achieve this by posing KG link prediction as a sequence-to-sequence task and exchange the triple scoring approach taken by prior KGE methods with autoregressive decoding. Such a simple but powerful method reduces the model size up to 98% compared to conventional KGE models while keeping inference time tractable. After finetuning this model on the task of KGQA over incomplete KGs, our approach outperforms baselines on multiple large-scale datasets without extensive hyperparameter tuning.

pdf bib
FIBER: Fill-in-the-Blanks as a Challenging Video Understanding Evaluation Framework
Santiago Castro | Ruoyao Wang | Pingxuan Huang | Ian Stewart | Oana Ignat | Nan Liu | Jonathan Stroud | Rada Mihalcea

We propose fill-in-the-blanks as a video understanding evaluation framework and introduce FIBER – a novel dataset consisting of 28,000 videos and descriptions in support of this evaluation framework. The fill-in-the-blanks setting tests a model’s understanding of a video by requiring it to predict a masked noun phrase in the caption of the video, given the video and the surrounding text. The FIBER benchmark does not share the weaknesses of the current state-of-the-art language-informed video understanding tasks, namely: (1) video question answering using multiple-choice questions, where models perform relatively well because they exploit linguistic biases in the task formulation, thus making our framework challenging for the current state-of-the-art systems to solve; and (2) video captioning, which relies on an open-ended evaluation framework that is often inaccurate because system answers may be perceived as incorrect if they differ in form from the ground truth. The FIBER dataset and our code are available at

pdf bib
KenMeSH: Knowledge-enhanced End-to-end Biomedical Text Labelling
Xindi Wang | Robert Mercer | Frank Rudzicz

Currently, Medical Subject Headings (MeSH) are manually assigned to every biomedical article published and subsequently recorded in the PubMed database to facilitate retrieving relevant information. With the rapid growth of the PubMed database, large-scale biomedical document indexing becomes increasingly important. MeSH indexing is a challenging task for machine learning, as it needs to assign multiple labels to each article from an extremely large hierachically organized collection. To address this challenge, we propose KenMeSH, an end-to-end model that combines new text features and a dynamic knowledge-enhanced mask attention that integrates document features with MeSH label hierarchy and journal correlation features to index MeSH terms. Experimental results show the proposed method achieves state-of-the-art performance on a number of measures.

pdf bib
A Taxonomy of Empathetic Questions in Social Dialogs
Ekaterina Svikhnushina | Iuliana Voinea | Anuradha Welivita | Pearl Pu

Effective question asking is a crucial component of a successful conversational chatbot It could help the bots manifest empathy and render the interaction more engaging by demonstrating attention to the speaker’s emotions However current dialog generation approaches do not model this subtle emotion regulation technique due to the lack of a taxonomy of questions and their purpose in social chitchat To address this gap we have developed an empathetic question taxonomy EQT with special attention paid to questions ability to capture communicative acts and their emotion regulation intents We further design a crowd sourcing task to annotate a large subset of the EmpatheticDialogues dataset with the established labels We use the crowd annotated data to develop automatic labeling tools and produce labels for the whole dataset Finally we employ information visualization techniques to summarize co occurrences of question acts and intents and their role in regulating interlocutors emotion These results reveal important question asking strategies in social dialogs The EQT classification scheme can facilitate computational analysis of questions in datasets More importantly it can inform future efforts in empathetic question generation using neural or hybrid methods

pdf bib
Enhanced Multi-Channel Graph Convolutional Network for Aspect Sentiment Triplet Extraction
Hao Chen | Zepeng Zhai | Fangxiang Feng | Ruifan Li | Xiaojie Wang

Aspect Sentiment Triplet Extraction (ASTE) is an emerging sentiment analysis task. Most of the existing studies focus on devising a new tagging scheme that enables the model to extract the sentiment triplets in an end-to-end fashion. However, these methods ignore the relations between words for ASTE task. In this paper, we propose an Enhanced Multi-Channel Graph Convolutional Network model (EMC-GCN) to fully utilize the relations between words. Specifically, we first define ten types of relations for ASTE task, and then adopt a biaffine attention module to embed these relations as an adjacent tensor between words in a sentence. After that, our EMC-GCN transforms the sentence into a multi-channel graph by treating words and the relation adjacent tensor as nodes and edges, respectively. Thus, relation-aware node representations can be learnt. Furthermore, we consider diverse linguistic features to enhance our EMC-GCN model. Finally, we design an effective refining strategy on EMC-GCN for word-pair representation refinement, which considers the implicit results of aspect and opinion extraction when determining whether word pairs match or not. Extensive experimental results on the benchmark datasets demonstrate that the effectiveness and robustness of our proposed model, which outperforms state-of-the-art methods significantly.

pdf bib
Learned Incremental Representations for Parsing
Nikita Kitaev | Thomas Lu | Dan Klein

We present an incremental syntactic representation that consists of assigning a single discrete label to each word in a sentence where the label is predicted using strictly incremental processing of a prefix of the sentence and the sequence of labels for a sentence fully determines a parse tree Our goal is to induce a syntactic representation that commits to syntactic choices only as they are incrementally revealed by the input in contrast with standard representations that must make output choices such as attachments speculatively and later throw out conflicting analyses Our learned representations achieve 93.72 F1 on the Penn Treebank with as few as bits per word and at bits per word they achieve 94.97 F1 which is comparable with other state of the art parsing models when using the same pre trained embeddings We also provide an analysis of the representations learned by our system investigating properties such as the interpretable syntactic features captured by the system and mechanisms for deferred resolution of syntactic ambiguities

pdf bib
Misinfo Reaction Frames: Reasoning about Readers’ Reactions to News Headlines
Saadia Gabriel | Skyler Hallinan | Maarten Sap | Pemi Nguyen | Franziska Roesner | Eunsol Choi | Yejin Choi

Even to a simple and short news headline, readers react in a multitude of ways: cognitively (e.g. inferring the writer’s intent), emotionally (e.g. feeling distrust), and behaviorally (e.g. sharing the news with their friends). Such reactions are instantaneous and yet complex, as they rely on factors that go beyond interpreting factual content of news.We propose Misinfo Reaction Frames (MRF), a pragmatic formalism for modeling how readers might react to a news headline. In contrast to categorical schema, our free-text dimensions provide a more nuanced way of understanding intent beyond being benign or malicious. We also introduce a Misinfo Reaction Frames corpus, a crowdsourced dataset of reactions to over 25k news headlines focusing on global crises: the Covid-19 pandemic, climate change, and cancer. Empirical results confirm that it is indeed possible for neural models to predict the prominent patterns of readers’ reactions to previously unseen news headlines. Additionally, our user study shows that displaying machine-generated MRF implications alongside news headlines to readers can increase their trust in real news while decreasing their trust in misinformation. Our work demonstrates the feasibility and importance of pragmatic inferences on news headlines to help enhance AI-guided misinformation detection and mitigation.

pdf bib
Achieving Conversational Goals with Unsupervised Post-hoc Knowledge Injection
Bodhisattwa Prasad Majumder | Harsh Jhamtani | Taylor Berg-Kirkpatrick | Julian McAuley

A limitation of current neural dialog models is that they tend to suffer from a lack of specificity and informativeness in generated responses, primarily due to dependence on training data that covers a limited variety of scenarios and conveys limited knowledge. One way to alleviate this issue is to extract relevant knowledge from external sources at decoding time and incorporate it into the dialog response. In this paper, we propose a post-hoc knowledge-injection technique where we first retrieve a diverse set of relevant knowledge snippets conditioned on both the dialog history and an initial response from an existing dialog model. We construct multiple candidate responses, individually injecting each retrieved snippet into the initial response using a gradient-based decoding method, and then select the final response with an unsupervised ranking step. Our experiments in goal-oriented and knowledge-grounded dialog settings demonstrate that human annotators judge the outputs from the proposed method to be more engaging and informative compared to responses from prior dialog systems. We further show that knowledge-augmentation promotes success in achieving conversational goals in both experimental settings.

pdf bib
Generated Knowledge Prompting for Commonsense Reasoning
Jiacheng Liu | Alisa Liu | Ximing Lu | Sean Welleck | Peter West | Ronan Le Bras | Yejin Choi | Hannaneh Hajishirzi

It remains an open question whether incorporating external knowledge benefits commonsense reasoning while maintaining the flexibility of pretrained sequence models To investigate this question we develop generated knowledge prompting which consists of generating knowledge from a language model then providing the knowledge as additional input when answering a question Our method does not require task specific supervision for knowledge integration or access to a structured knowledge base yet it improves performance of large scale state of the art models on four commonsense reasoning tasks achieving state of the art results on numerical commonsense NumerSense general commonsense CommonsenseQA 2.0 and scientific commonsense QASC benchmarks Generated knowledge prompting highlights large scale language models as flexible sources of external knowledge for improving commonsense reasoning Our code is available at \\\n

pdf bib
Training Data is More Valuable than You Think A Simple and Effective Method by Retrieving from Training Data
Shuohang Wang | Yichong Xu | Yuwei Fang | Yang Liu | Siqi Sun | Ruochen Xu | Chenguang Zhu | Michael Zeng

Retrieval based methods have been shown to be effective in NLP tasks via introducing external knowledge However the indexing and retrieving of large scale corpora bring considerable computational cost Surprisingly we found that REtrieving from the traINing datA REINA only can lead to significant gains on multiple NLG and NLU tasks We retrieve the labeled training instances most similar to the input text and then concatenate them with the input to feed into the model to generate the output Experimental results show that this simple method can achieve significantly better performance on a variety of NLU and NLG tasks including summarization machine translation language modeling and question answering tasks For instance our proposed method achieved state of the art results on XSum BigPatent and CommonsenseQA Our code is released

pdf bib
TruthfulQA Measuring How Models Mimic Human FalsehoodsTruthfulQA: Measuring How Models Mimic Human Falsehoods
Stephanie Lin | Jacob Hilton | Owain Evans

We propose a benchmark to measure whether a language model is truthful in generating answers to questions The benchmark comprises questions that span categories including health law finance and politics We crafted questions that some humans would answer falsely due to a false belief or misconception To perform well models must avoid generating false answers learned from imitating human texts We tested GPT-3 GPT Neo J GPT-2 and a T5 based model The best model was truthful on of questions while human performance was Models generated many false answers that mimic popular misconceptions and have the potential to deceive humans The largest models were generally the least truthful This contrasts with other NLP tasks where performance improves with model size However this result is expected if false answers are learned from the training distribution We suggest that scaling up models alone is less promising for improving truthfulness than fine tuning using training objectives other than imitation of text from the web

pdf bib
Right for the Right Reason Evidence Extraction for Trustworthy Tabular Reasoning
Vivek Gupta | Shuo Zhang | Alakananda Vempala | Yujie He | Temma Choji | Vivek Srikumar

When pre trained contextualized embedding based models developed for unstructured data are adapted for structured tabular data they perform admirably However recent probing studies show that these models use spurious correlations and often predict inference labels by focusing on false evidence or ignoring it altogether To study this issue we introduce the task of Trustworthy Tabular Reasoning where a model needs to extract evidence to be used for reasoning in addition to predicting the label As a case study we propose a two stage sequential prediction approach which includes an evidence extraction and an inference stage First we crowdsource evidence row labels and develop several unsupervised and supervised evidence extraction strategies for InfoTabS a tabular NLI benchmark Our evidence extraction strategy outperforms earlier baselines On the downstream tabular inference task using only the automatically extracted evidence as the premise our approach outperforms prior benchmarks

pdf bib
Direct Speech-to-Speech Translation With Discrete Units
Ann Lee | Peng-Jen Chen | Changhan Wang | Jiatao Gu | Sravya Popuri | Xutai Ma | Adam Polyak | Yossi Adi | Qing He | Yun Tang | Juan Pino | Wei-Ning Hsu

We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation. We tackle the problem by first applying a self-supervised discrete speech encoder on the target speech and then training a sequence-to-sequence speech-to-unit translation (S2UT) model to predict the discrete representations of the target speech. When target text transcripts are available, we design a joint speech and text training framework that enables the model to generate dual modality output (speech and text) simultaneously in the same inference pass. Experiments on the Fisher Spanish-English dataset show that the proposed framework yields improvement of 6.7 BLEU compared with a baseline direct S2ST model that predicts spectrogram features. When trained without any text transcripts, our model performance is comparable to models that predict spectrograms and are trained with text supervision, showing the potential of our system for translation between unwritten languages.

pdf bib
Dataset Geography Mapping Language Data to Language Users
Fahim Faisal | Yinkai Wang | Antonios Anastasopoulos

As language technologies become more ubiquitous there are increasing efforts towards expanding the language diversity and coverage of natural language processing NLP systems Arguably the most important factor influencing the quality of modern NLP systems is data availability In this work we study the geographical representativeness of NLP datasets aiming to quantify if and by how much do NLP datasets match the expected needs of the language speakers In doing so we use entity recognition and linking systems also making important observations about their cross lingual consistency and giving suggestions for more robust evaluation Last we explore some geographical and economic factors that may explain the observed dataset distributions

pdf bib
ILDAE: Instance-Level Difficulty Analysis of Evaluation Data
Neeraj Varshney | Swaroop Mishra | Chitta Baral

Knowledge of difficulty level of questions helps a teacher in several ways, such as estimating students’ potential quickly by asking carefully selected questions and improving quality of examination by modifying trivial and hard questions. Can we extract such benefits of instance difficulty in Natural Language Processing? To this end, we conduct Instance-Level Difficulty Analysis of Evaluation data (ILDAE) in a large-scale setup of 23 datasets and demonstrate its five novel applications: 1) conducting efficient-yet-accurate evaluations with fewer instances saving computational cost and time, 2) improving quality of existing evaluation datasets by repairing erroneous and trivial instances, 3) selecting the best model based on application requirements, 4) analyzing dataset characteristics for guiding future data creation, 5) estimating Out-of-Domain performance reliably. Comprehensive experiments for these applications lead to several interesting results, such as evaluation using just 5% instances (selected via ILDAE) achieves as high as 0.93 Kendall correlation with evaluation using complete dataset and computing weighted accuracy using difficulty scores leads to 5.2% higher correlation with Out-of-Domain performance. We release the difficulty scores and hope our work will encourage research in this important yet understudied field of leveraging instance difficulty in evaluations.

pdf bib
How Do We Answer Complex Questions: Discourse Structure of Long-form Answers
Fangyuan Xu | Junyi Jessy Li | Eunsol Choi

Long-form answers, consisting of multiple sentences, can provide nuanced and comprehensive answers to a broader set of questions. To better understand this complex and understudied task, we study the functional structure of long-form answers collected from three datasets, ELI5, WebGPT and Natural Questions. Our main goal is to understand how humans organize information to craft complex answers. We develop an ontology of six sentence-level functional roles for long-form answers, and annotate 3.9k sentences in 640 answer paragraphs. Different answer collection methods manifest in different discourse structures. We further analyze model-generated answers – finding that annotators agree less with each other when annotating model-generated answers compared to annotating human-written answers. Our annotated data enables training a strong classifier that can be used for automatic analysis. We hope our work can inspire future research on discourse-level modeling and evaluation of long-form QA systems.

pdf bib
ConditionalQA A Complex Reading Comprehension Dataset with Conditional AnswersConditionalQA: A Complex Reading Comprehension Dataset with Conditional Answers
Haitian Sun | William Cohen | Ruslan Salakhutdinov

We describe a Question Answering QA dataset that contains complex questions with conditional answers i.e. the answers are only applicable when certain conditions apply We call this dataset ConditionalQA In addition to conditional answers the dataset also features \n long context documents with information that is related in logically complex ways \n multi hop questions that require compositional logical reasoning \n a combination of extractive questions yes no questions questions with multiple answers and not answerable questions \n questions asked without knowing the answers We show that ConditionalQA is challenging for many of the existing QA models especially in selecting answer conditions We believe that this dataset will motivate further research in answering complex questions over long documents

pdf bib
An Investigation of the (In)effectiveness of Counterfactually Augmented Data
Nitish Joshi | He He

While pretrained language models achieve excellent performance on natural language understanding benchmarks, they tend to rely on spurious correlations and generalize poorly to out-of-distribution (OOD) data. Recent work has explored using counterfactually-augmented data (CAD)—data generated by minimally perturbing examples to flip the ground-truth label—to identify robust features that are invariant under distribution shift. However, empirical results using CAD during training for OOD generalization have been mixed. To explain this discrepancy, through a toy theoretical example and empirical analysis on two crowdsourced CAD datasets, we show that: (a) while features perturbed in CAD are indeed robust features, it may prevent the model from learning unperturbed robust features; and (b) CAD may exacerbate existing spurious correlations in the data. Our results thus show that the lack of perturbation diversity limits CAD’s effectiveness on OOD generalization, calling for innovative crowdsourcing procedures to elicit diverse perturbation of examples.

pdf bib
Inducing Positive Perspectives with Text Reframing
Caleb Ziems | Minzhi Li | Anthony Zhang | Diyi Yang

Sentiment transfer is one popular example of a text style transfer task where the goal is to reverse the sentiment polarity of a text With a sentiment reversal comes also a reversal in meaning We introduce a different but related task called positive reframing in which we neutralize a negative point of view and generate a more positive perspective for the author without contradicting the original meaning Our insistence on meaning preservation makes positive reframing a challenging and semantically rich task To facilitate rapid progress we introduce a large scale benchmark Positive Psychology Frames with 8,349 sentence pairs and 12,755 structured annotations to explain positive reframing in terms of six theoretically motivated reframing strategies Then we evaluate a set of state of the art text style transfer models and conclude by discussing key challenges and directions for future work

pdf bib
The Moral Integrity Corpus A Benchmark for Ethical Dialogue Systems
Caleb Ziems | Jane Yu | Yi-Chia Wang | Alon Halevy | Diyi Yang

Conversational agents have come increasingly closer to human competence in open domain dialogue settings however such models can reflect insensitive hurtful or entirely incoherent viewpoints that erode a user’s trust in the moral integrity of the system Moral deviations are difficult to mitigate because moral judgments are not universal and there may be multiple competing judgments that apply to a situation simultaneously In this work we introduce a new resource not to authoritatively resolve moral ambiguities but instead to facilitate systematic understanding of the intuitions values and moral judgments reflected in the utterances of dialogue systems The Moral Integrity Corpus MIC is such a resource which captures the moral assumptions of 38k prompt reply pairs using 99k distinct Rules of Thumb RoTs Each RoT reflects a particular moral conviction that can explain why a chatbot’s reply may appear acceptable or problematic We further organize RoTs with a set of moral and social attributes and benchmark performance for attribute classification Most importantly we show that current neural language models can automatically generate new RoTs that reasonably describe previously unseen interactions but they still struggle with certain scenarios Our findings suggest that MIC will be a useful resource for understanding and language models implicit moral assumptions and flexibly benchmarking the integrity of conversational agents To download the data see

pdf bib
Bag-of-Words vs. Graph vs. Sequence in Text Classification: Questioning the Necessity of Text-Graphs and the Surprising Strength of a Wide MLP
Lukas Galke | Ansgar Scherp

Graph neural networks have triggered a resurgence of graph-based text classification methods, defining today’s state of the art. We show that a wide multi-layer perceptron (MLP) using a Bag-of-Words (BoW) outperforms the recent graph-based models TextGCN and HeteGCN in an inductive text classification setting and is comparable with HyperGAT. Moreover, we fine-tune a sequence-based BERT and a lightweight DistilBERT model, which both outperform all state-of-the-art models. These results question the importance of synthetic graphs used in modern text classifiers. In terms of efficiency, DistilBERT is still twice as large as our BoW-based wide MLP, while graph-based models like TextGCN require setting up an \\mathcal{O}(N^2) graph, where N is the vocabulary plus corpus size. Finally, since Transformers need to compute \\mathcal{O}(L^2) attention weights with sequence length L, the MLP models show higher training and inference speeds on datasets with long sequences.

pdf bib
Generative Pretraining for Paraphrase Evaluation
Jack Weston | Raphael Lenain | Udeepa Meepegama | Emil Fristed

We introduce ParaBLEU a paraphrase representation learning model and evaluation metric for text generation Unlike previous approaches ParaBLEU learns to understand paraphrasis using generative conditioning as a pretraining objective ParaBLEU correlates more strongly with human judgements than existing metrics obtaining new state of the art results on the WMT Metrics Shared Task We show that our model is robust to data scarcity exceeding previous state of the art performance using only 50\\%$ of the available training data and surpassing BLEU ROUGE and METEOR with only labelled examples Finally we demonstrate that ParaBLEU can be used to conditionally generate novel paraphrases from a single demonstration which we use to confirm our hypothesis that it learns abstract generalized paraphrase representations

pdf bib
Word Segmentation as Unsupervised Constituency Parsing
Raquel G. Alhama

Word identification from continuous input is typically viewed as a segmentation task Experiments with human adults suggest that familiarity with syntactic structures in their native language also influences word identification in artificial languages however the relation between syntactic processing and word identification is yet unclear This work takes one step forward by exploring a radically different approach of word identification in which segmentation of a continuous input is viewed as a process isomorphic to unsupervised constituency parsing Besides formalizing the approach this study reports simulations of human experiments with DIORA Drozdov et al a neural unsupervised constituency parser Results show that this model can reproduce human behavior in word identification experiments suggesting that this is a viable approach to study word identification and its relation to syntactic processing

pdf bib
SafetyKit: First Aid for Measuring Safety in Open-domain Conversational Systems
Emily Dinan | Gavin Abercrombie | A. Bergman | Shannon Spruit | Dirk Hovy | Y-Lan Boureau | Verena Rieser

The social impact of natural language processing and its applications has received increasing attention. In this position paper, we focus on the problem of safety for end-to-end conversational AI. We survey the problem landscape therein, introducing a taxonomy of three observed phenomena: the Instigator, Yea-Sayer, and Impostor effects. We then empirically assess the extent to which current tools can measure these effects and current systems display them. We release these tools as part of a “first aid kit” (SafetyKit) to quickly assess apparent safety concerns. Our results show that, while current tools are able to provide an estimate of the relative safety of systems in various settings, they still have several shortcomings. We suggest several future directions and discuss ethical considerations.

pdf bib
The Paradox of the Compositionality of Natural Language A Neural Machine Translation Case Study
Verna Dankers | Elia Bruni | Dieuwke Hupkes

Obtaining human like performance in NLP is often argued to require compositional generalisation Whether neural networks exhibit this ability is usually studied by training models on highly compositional synthetic data However compositionality in natural language is much more complex than the rigid arithmetic like version such data adheres to and artificial compositionality tests thus do not allow us to determine how neural models deal with more realistic forms of compositionality In this work we re instantiate three compositionality tests from the literature and reformulate them for neural machine translation NMT Our results highlight that i unfavourably models trained on more data are more compositional ii models are sometimes less compositional than expected but sometimes more exemplifying that different levels of compositionality are required and models are not always able to modulate between them correctly iii some of the non compositional behaviours are mistakes whereas others reflect the natural variation in data Apart from an empirical study our work is a call to action we should rethink the evaluation of compositionality in neural networks and develop benchmarks using real data to evaluate compositionality on natural language where composing meaning is not as straightforward as doing the math

pdf bib
LexGLUE A Benchmark Dataset for Legal Language Understanding in EnglishLexGLUE: A Benchmark Dataset for Legal Language Understanding in English
Ilias Chalkidis | Abhik Jana | Dirk Hartung | Michael Bommarito | Ion Androutsopoulos | Daniel Katz | Nikolaos Aletras

Laws and their interpretations legal arguments and agreements are typically expressed in writing leading to the production of vast corpora of legal text Their analysis which is at the center of legal practice becomes increasingly elaborate as these collections grow in size Natural language understanding NLU technologies can be a valuable tool to support legal practitioners in these endeavors Their usefulness however largely depends on whether current state of the art models can generalize across various tasks in the legal domain To answer this currently open question we introduce the Legal General Language Understanding Evaluation LexGLUE benchmark a collection of datasets for evaluating model performance across a diverse set of legal NLU tasks in a standardized way We also provide an evaluation and analysis of several generic and legal oriented models demonstrating that the latter consistently offer performance improvements across multiple tasks

pdf bib
SRL4ESemantic Role Labeling for Emotions: A Unified Evaluation Framework
Cesare Campagnano | Simone Conia | Roberto Navigli

In the field of sentiment analysis, several studies have highlighted that a single sentence may express multiple, sometimes contrasting, sentiments and emotions, each with its own experiencer, target and/or cause. To this end, over the past few years researchers have started to collect and annotate data manually, in order to investigate the capabilities of automatic systems not only to distinguish between emotions, but also to capture their semantic constituents. However, currently available gold datasets are heterogeneous in size, domain, format, splits, emotion categories and role labels, making comparisons across different works difficult and hampering progress in the area. In this paper, we tackle this issue and present a unified evaluation framework focused on Semantic Role Labeling for Emotions (SRL4E), in which we unify several datasets tagged with emotions and semantic roles by using a common labeling scheme. We use SRL4E as a benchmark to evaluate how modern pretrained language models perform and analyze where we currently stand in this task, hoping to provide the tools to facilitate studies in this complex area.

pdf bib
Context Matters: A Pragmatic Study of PLMs’ Negation Understanding
Reto Gubelmann | Siegfried Handschuh

In linguistics, there are two main perspectives on negation: a semantic and a pragmatic view. So far, research in NLP on negation has almost exclusively adhered to the semantic view. In this article, we adopt the pragmatic paradigm to conduct a study of negation understanding focusing on transformer-based PLMs. Our results differ from previous, semantics-based studies and therefore help to contribute a more comprehensive – and, given the results, much more optimistic – picture of the PLMs’ negation understanding.

pdf bib
Identifying Moments of Change from Longitudinal User Text
Adam Tsakalidis | Federico Nanni | Anthony Hills | Jenny Chim | Jiayu Song | Maria Liakata

Identifying changes in individuals behaviour and mood as observed via content shared on online platforms is increasingly gaining importance Most research to date on this topic focuses on either a identifying individuals at risk or with a certain mental health condition given a batch of posts or b providing equivalent labels at the post level A disadvantage of such work is the lack of a strong temporal component and the inability to make longitudinal assessments following an individual’s trajectory and allowing timely interventions Here we define a new task that of identifying moments of change in individuals on the basis of their shared content online The changes we consider are sudden shifts in mood switches or gradual mood progression escalations We have created detailed guidelines for capturing moments of change and a corpus of manually annotated user timelines 18.7 K posts We have developed a variety of baseline models drawing inspiration from related tasks and show that the best performance is obtained through context aware sequential modelling We also introduce new metrics for capturing rare events in temporal windows

pdf bib
Semi-Supervised Formality Style Transfer with Consistency Training
Ao Liu | An Wang | Naoaki Okazaki

Formality style transfer (FST) is a task that involves paraphrasing an informal sentence into a formal one without altering its meaning. To address the data-scarcity problem of existing parallel datasets, previous studies tend to adopt a cycle-reconstruction scheme to utilize additional unlabeled data, where the FST model mainly benefits from target-side unlabeled sentences. In this work, we propose a simple yet effective semi-supervised framework to better utilize source-side unlabeled sentences based on consistency training. Specifically, our approach augments pseudo-parallel data obtained from a source-side informal sentence by enforcing the model to generate similar outputs for its perturbed version. Moreover, we empirically examined the effects of various data perturbation methods and propose effective data filtering strategies to improve our framework. Experimental results on the GYAFC benchmark demonstrate that our approach can achieve state-of-the-art results, even with less than 40% of the parallel data.

pdf bib
Rare and Zero-shot Word Sense Disambiguation using Z-Reweighting
Ying Su | Hongming Zhang | Yangqiu Song | Tong Zhang

Word sense disambiguation (WSD) is a crucial problem in the natural language processing (NLP) community. Current methods achieve decent performance by utilizing supervised learning and large pre-trained language models. However, the imbalanced training dataset leads to poor performance on rare senses and zero-shot senses. There are more training instances and senses for words with top frequency ranks than those with low frequency ranks in the training dataset. We investigate the statistical relation between word frequency rank and word sense number distribution. Based on the relation, we propose a Z-reweighting method on the word level to adjust the training on the imbalanced dataset. The experiments show that the Z-reweighting strategy achieves performance gain on the standard English all words WSD benchmark. Moreover, the strategy can help models generalize better on rare and zero-shot senses.

pdf bib
WikiDiverse A Multimodal Entity Linking Dataset with Diversified Contextual Topics and Entity TypesWikiDiverse: A Multimodal Entity Linking Dataset with Diversified Contextual Topics and Entity Types
Xuwu Wang | Junfeng Tian | Min Gui | Zhixu Li | Rui Wang | Ming Yan | Lihan Chen | Yanghua Xiao

Multimodal Entity Linking MEL which aims at linking mentions with multimodal contexts to the referent entities from a knowledge base e.g. Wikipedia is an essential task for many multimodal applications Although much attention has been paid to MEL the shortcomings of existing MEL datasets including limited contextual topics and entity types simplified mention ambiguity and restricted availability have caused great obstacles to the research and application of MEL In this paper we present WikiDiverse a high quality human annotated MEL dataset with diversified contextual topics and entity types from Wikinews which uses Wikipedia as the corresponding knowledge base A well tailored annotation procedure is adopted to ensure the quality of the dataset Based on WikiDiverse a sequence of well designed MEL models with intra modality and inter modality attentions are implemented which utilize the visual information of images more adequately than existing MEL models do Extensive experimental analyses are conducted to investigate the contributions of different modalities in terms of MEL facilitating the future research on this task

pdf bib
DialogVED: A Pre-trained Latent Variable Encoder-Decoder Model for Dialog Response Generation
Wei Chen | Yeyun Gong | Song Wang | Bolun Yao | Weizhen Qi | Zhongyu Wei | Xiaowu Hu | Bartuer Zhou | Yi Mao | Weizhu Chen | Biao Cheng | Nan Duan

Dialog response generation in open domain is an important research topic where the main challenge is to generate relevant and diverse responses. In this paper, we propose a new dialog pre-training framework called DialogVED, which introduces continuous latent variables into the enhanced encoder-decoder pre-training framework to increase the relevance and diversity of responses. With the help of a large dialog corpus (Reddit), we pre-train the model using the following 4 tasks, used in training language models (LMs) and Variational Autoencoders (VAEs) literature: 1) masked language model; 2) response generation; 3) bag-of-words prediction; and 4) KL divergence reduction. We also add additional parameters to model the turn structure in dialogs to improve the performance of the pre-trained model. We conduct experiments on PersonaChat, DailyDialog, and DSTC7-AVSD benchmarks for response generation. Experimental results show that our model achieves the new state-of-the-art results on all these datasets.

pdf bib
Contextual Fine-to-Coarse Distillation for Coarse-grained Response Selection in Open-Domain Conversations
Wei Chen | Yeyun Gong | Can Xu | Huang Hu | Bolun Yao | Zhongyu Wei | Zhihao Fan | Xiaowu Hu | Bartuer Zhou | Biao Cheng | Daxin Jiang | Nan Duan

We study the problem of coarse-grained response selection in retrieval-based dialogue systems. The problem is equally important with fine-grained response selection, but is less explored in existing literature. In this paper, we propose a Contextual Fine-to-Coarse (CFC) distilled model for coarse-grained response selection in open-domain conversations. In our CFC model, dense representations of query, candidate contexts and responses is learned based on the multi-tower architecture using contextual matching, and richer knowledge learned from the one-tower architecture (fine-grained) is distilled into the multi-tower architecture (coarse-grained) to enhance the performance of the retriever. To evaluate the performance of the proposed model, we construct two new datasets based on the Reddit comments dump and Twitter corpus. Extensive experimental results on the two datasets show that the proposed method achieves huge improvement over all evaluation metrics compared with traditional baseline methods.

pdf bib
Packed Levitated Marker for Entity and Relation Extraction
Deming Ye | Yankai Lin | Peng Li | Maosong Sun

Recent entity and relation extraction works focus on investigating how to obtain a better span representation from the pre trained encoder However a major limitation of existing works is that they ignore the interrelation between spans pairs In this work we propose a novel span representation approach named Packed Levitated Markers PL Marker to consider the interrelation between the spans pairs by strategically packing the markers in the encoder In particular we propose a neighborhood oriented packing strategy which considers the neighbor spans integrally to better model the entity boundary information Furthermore for those more complicated span pair classification tasks we design a subject oriented packing strategy which packs each subject and all its objects to model the interrelation between the same subject span pairs The experimental results show that with the enhanced marker feature our model advances baselines on six NER benchmarks and obtains a 4.1%-4.3 strict relation F1 improvement with higher speed over previous state of the art models on ACE04 and ACE05 Our code and models are publicly available at

pdf bib
KG-FiD: Infusing Knowledge Graph in Fusion-in-Decoder for Open-Domain Question Answering
Donghan Yu | Chenguang Zhu | Yuwei Fang | Wenhao Yu | Shuohang Wang | Yichong Xu | Xiang Ren | Yiming Yang | Michael Zeng

Current Open-Domain Question Answering (ODQA) models typically include a retrieving module and a reading module, where the retriever selects potentially relevant passages from open-source documents for a given question, and the reader produces an answer based on the retrieved passages. The recently proposed Fusion-in-Decoder (FiD) framework is a representative example, which is built on top of a dense passage retriever and a generative reader, achieving the state-of-the-art performance. In this paper we further improve the FiD approach by introducing a knowledge-enhanced version, namely KG-FiD. Our new model uses a knowledge graph to establish the structural relationship among the retrieved passages, and a graph neural network (GNN) to re-rank the passages and select only a top few for further processing. Our experiments on common ODQA benchmark datasets (Natural Questions and TriviaQA) demonstrate that KG-FiD can achieve comparable or better performance in answer prediction than FiD, with less than 40% of the computation cost.

pdf bib
CICERO A Dataset for Contextualized Commonsense Inference in DialoguesCICERO: A Dataset for Contextualized Commonsense Inference in Dialogues
Deepanway Ghosal | Siqi Shen | Navonil Majumder | Rada Mihalcea | Soujanya Poria

This paper addresses the problem of dialogue reasoning with contextualized commonsense inference We curate CICERO a dataset of dyadic conversations with five types of utterance level reasoning based inferences cause subsequent event prerequisite motivation and emotional reaction The dataset contains 53,105 of such inferences from 5,672 dialogues We use this dataset to solve relevant generative and discriminative tasks generation of cause and subsequent event generation of prerequisite motivation and listener’s emotional reaction and selection of plausible alternatives Our results ascertain the value of such dialogue centric commonsense knowledge datasets It is our hope that CICERO will open new research avenues into commonsense based dialogue reasoning

pdf bib
A Comparative Study of Faithfulness Metrics for Model Interpretability Methods
Chun Sik Chan | Huanqi Kong | Liang Guanqing

Interpretable methods to reveal the internal reasoning processes behind machine learning models have attracted increasing attention in recent years To quantify the extent to which the identified interpretations truly reflect the intrinsic decision making mechanisms various faithfulness evaluation metrics have been proposed However we find that different faithfulness metrics show conflicting preferences when comparing different interpretations Motivated by this observation we aim to conduct a comprehensive and comparative study of the widely adopted faithfulness metrics In particular we introduce two assessment dimensions namely diagnosticity and complexity Diagnosticity refers to the degree to which the faithfulness metric favors relatively faithful interpretations over randomly generated ones and complexity is measured by the average number of model forward passes According to the experimental results we find that sufficiency and comprehensiveness metrics have higher diagnosticity and lower complexity than the other faithfulness metrics

pdf bib
Pass off Fish Eyes for Pearls: Attacking Model Selection of Pre-trained Models
Biru Zhu | Yujia Qin | Fanchao Qi | Yangdong Deng | Zhiyuan Liu | Maosong Sun | Ming Gu

Selecting an appropriate pre-trained model (PTM) for a specific downstream task typically requires significant efforts of fine-tuning. To accelerate this process, researchers propose feature-based model selection (FMS) methods, which assess PTMs’ transferability to a specific task in a fast way without fine-tuning. In this work, we argue that current FMS methods are vulnerable, as the assessment mainly relies on the static features extracted from PTMs. However, such features are derived without training PTMs on downstream tasks, and are not necessarily reliable indicators for the PTM’s transferability. To validate our viewpoints, we design two methods to evaluate the robustness of FMS: (1) model disguise attack, which post-trains an inferior PTM with a contrastive objective, and (2) evaluation data selection, which selects a subset of the data points for FMS evaluation based on K-means clustering. Experimental results prove that both methods can successfully make FMS mistakenly judge the transferability of PTMs. Moreover, we find that these two methods can further be combined with the backdoor attack to misguide the FMS to select poisoned models. To the best of our knowledge, this is the first work to demonstrate the defects of current FMS algorithms and evaluate their potential security risks. By identifying previously unseen risks of FMS, our study indicates new directions for improving the robustness of FMS.

pdf bib
Educational Question Generation of Children Storybooks via Question Type Distribution Learning and Event-centric Summarization
Zhenjie Zhao | Yufang Hou | Dakuo Wang | Mo Yu | Chengzhong Liu | Xiaojuan Ma

Generating educational questions of fairytales or storybooks is vital for improving children’s literacy ability. However, it is challenging to generate questions that capture the interesting aspects of a fairytale story with educational meaningfulness. In this paper, we propose a novel question generation method that first learns the question type distribution of an input story paragraph, and then summarizes salient events which can be used to generate high-cognitive-demand questions. To train the event-centric summarizer, we finetune a pre-trained transformer-based sequence-to-sequence model using silver samples composed by educational question-answer pairs. On a newly proposed educational question-answering dataset FairytaleQA, we show good performance of our method on both automatic and human evaluation metrics. Our work indicates the necessity of decomposing question type distribution learning and event-centric summary generation for educational question generation.

pdf bib
A Neural Network Architecture for Program Understanding Inspired by Human Behaviors
Renyu Zhu | Lei Yuan | Xiang Li | Ming Gao | Wenyuan Cai

Program understanding is a fundamental task in program language processing Despite the success existing works fail to take human behaviors as reference in understanding programs In this paper we consider human behaviors and propose the PGNN EK model that consists of two main components On the one hand inspired by the divide and conquer reading behaviors of humans we present a partitioning based graph neural network model PGNN on the upgraded AST of codes On the other hand to characterize human behaviors of resorting to other resources to help code comprehension we transform raw codes with external knowledge and apply pre training techniques for information extraction Finally we combine the two embeddings generated from the two components to output code embeddings We conduct extensive experiments to show the superior performance of PGNN EK on the code summarization and code clone detection tasks In particular to show the generalization ability of our model we release a new dataset that is more challenging for code clone detection and could advance the development of the community Our codes and data are publicly available at

pdf bib
Dynamic Prefix-Tuning for Generative Template-based Event Extraction
Xiao Liu | Heyan Huang | Ge Shi | Bo Wang

We consider event extraction in a generative manner with template-based conditional generation.Although there is a rising trend of casting the task of event extraction as a sequence generation problem with prompts, these generation-based methods have two significant challenges, including using suboptimal prompts and static event type information.In this paper, we propose a generative template-based event extraction method with dynamic prefix (GTEE-DynPref) by integrating context information with type-specific prefixes to learn a context-specific prefix for each context.Experimental results show that our model achieves competitive results with the state-of-the-art classification-based model OneIE on ACE 2005 and achieves the best performances on ERE.Additionally, our model is proven to be portable to new types of events effectively.

pdf bib
Noisy Channel Language Model Prompting for Few-Shot Text Classification
Sewon Min | Mike Lewis | Hannaneh Hajishirzi | Luke Zettlemoyer

We introduce a noisy channel approach for language model prompting in few-shot text classification. Instead of computing the likelihood of the label given the input (referred as direct models), channel models compute the conditional probability of the input given the label, and are thereby required to explain every word in the input. We use channel models for recently proposed few-shot learning methods with no or very limited updates to the language model parameters, via either in-context demonstration or prompt tuning. Our experiments show that, for both methods, channel models significantly outperform their direct counterparts, which we attribute to their stability, i.e., lower variance and higher worst-case accuracy. We also present extensive ablations that provide recommendations for when to use channel prompt tuning instead of other competitive models (e.g., direct head tuning): channel prompt tuning is preferred when the number of training examples is small, labels in the training data are imbalanced, or generalization to unseen labels is required.

pdf bib
-former: Infinite Memory Transformer
Pedro Henrique Martins | Zita Marinho | Andre Martins

Transformers are unable to model long-term memories effectively, since the amount of computation they need to perform grows with the context length. While variations of efficient transformers have been proposed, they all have a finite memory capacity and are forced to drop old information. In this paper, we propose the \\infty-former, which extends the vanilla transformer with an unbounded long-term memory. By making use of a continuous-space attention mechanism to attend over the long-term memory, the \\infty-former’s attention complexity becomes independent of the context length, trading off memory length with precision.In order to control where precision is more important, \\infty-former maintains “sticky memories,” being able to model arbitrarily long contexts while keeping the computation budget fixed.Experiments on a synthetic sorting task, language modeling, and document grounded dialogue generation demonstrate the \\infty-former’s ability to retain information from long sequences.

pdf bib
Non-neural Models Matter: a Re-evaluation of Neural Referring Expression Generation Systems
Fahime Same | Guanyi Chen | Kees Van Deemter

In recent years, neural models have often outperformed rule-based and classic Machine Learning approaches in NLG. These classic approaches are now often disregarded, for example when new neural models are evaluated. We argue that they should not be overlooked, since, for some tasks, well-designed non-neural approaches achieve better performance than neural ones. In this paper, the task of generating referring expressions in linguistic context is used as an example. We examined two very different English datasets (WEBNLG and WSJ), and evaluated each algorithm using both automatic and human evaluations.Overall, the results of these evaluations suggest that rule-based systems with simple rule sets achieve on-par or better performance on both datasets compared to state-of-the-art neural REG systems. In the case of the more realistic dataset, WSJ, a machine learning-based system with well-designed linguistic features performed best. We hope that our work can encourage researchers to consider non-neural models in future.

pdf bib
Predicate-Argument Based Bi-Encoder for Paraphrase Identification
Qiwei Peng | David Weir | Julie Weeds | Yekun Chai

Paraphrase identification involves identifying whether a pair of sentences express the same or similar meanings. While cross-encoders have achieved high performances across several benchmarks, bi-encoders such as SBERT have been widely applied to sentence pair tasks. They exhibit substantially lower computation complexity and are better suited to symmetric tasks. In this work, we adopt a bi-encoder approach to the paraphrase identification task, and investigate the impact of explicitly incorporating predicate-argument information into SBERT through weighted aggregation. Experiments on six paraphrase identification datasets demonstrate that, with a minimal increase in parameters, the proposed model is able to outperform SBERT/SRoBERTa significantly. Further, ablation studies reveal that the predicate-argument based component plays a significant role in the performance gain.

pdf bib
Neural Machine Translation with Phrase-Level Universal Visual Representations
Qingkai Fang | Yang Feng

Multimodal machine translation (MMT) aims to improve neural machine translation (NMT) with additional visual information, but most existing MMT methods require paired input of source sentence and image, which makes them suffer from shortage of sentence-image pairs. In this paper, we propose a phrase-level retrieval-based method for MMT to get visual information for the source input from existing sentence-image data sets so that MMT can break the limitation of paired sentence-image input. Our method performs retrieval at the phrase level and hence learns visual information from pairs of source phrase and grounded region, which can mitigate data sparsity. Furthermore, our method employs the conditional variational auto-encoder to learn visual representations which can filter redundant visual information and only retain visual information related to the phrase. Experiments show that the proposed method significantly outperforms strong baselines on multiple MMT datasets, especially when the textual context is limited.

pdf bib
SpeechT5: Unified-Modal Encoder-Decoder Pre-Training for Spoken Language Processing
Junyi Ao | Rui Wang | Long Zhou | Chengyi Wang | Shuo Ren | Yu Wu | Shujie Liu | Tom Ko | Qing Li | Yu Zhang | Zhihua Wei | Yao Qian | Jinyu Li | Furu Wei

Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning. The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets. After preprocessing the input speech/text through the pre-nets, the shared encoder-decoder network models the sequence-to-sequence transformation, and then the post-nets generate the output in the speech/text modality based on the output of the decoder. Leveraging large-scale unlabeled speech and text data, we pre-train SpeechT5 to learn a unified-modal representation, hoping to improve the modeling capability for both speech and text. To align the textual and speech information into this unified semantic space, we propose a cross-modal vector quantization approach that randomly mixes up speech/text states with latent units as the interface between encoder and decoder. Extensive evaluations show the superiority of the proposed SpeechT5 framework on a wide variety of spoken language processing tasks, including automatic speech recognition, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification.

pdf bib
Unified Structure Generation for Universal Information Extraction
Yaojie Lu | Qing Liu | Dai Dai | Xinyan Xiao | Hongyu Lin | Xianpei Han | Le Sun | Hua Wu

Information extraction suffers from its varying targets heterogeneous structures and demand specific schemas In this paper we propose a unified text to structure generation framework namely UIE which can universally model different IE tasks adaptively generate targeted structures and collaboratively learn general IE abilities from different knowledge sources Specifically UIE uniformly encodes different extraction structures via a structured extraction language adaptively generates target extractions via a schema based prompt mechanism structural schema instructor and captures the common IE abilities via a large scale pretrained text to structure model Experiments show that UIE achieved the state of the art performance on IE tasks datasets and on all supervised low resource and few shot settings for a wide range of entity relation event and sentiment extraction tasks and their unification These results verified the effectiveness universality and transferability of UIE

pdf bib
Pre-training to Match for Unified Low-shot Relation Extraction
Fangchao Liu | Hongyu Lin | Xianpei Han | Boxi Cao | Le Sun

Low-shot relation extraction (RE) aims to recognize novel relations with very few or even no samples, which is critical in real scenario application. Few-shot and zero-shot RE are two representative low-shot RE tasks, which seem to be with similar target but require totally different underlying abilities. In this paper, we propose Multi-Choice Matching Networks to unify low-shot relation extraction. To fill in the gap between zero-shot and few-shot RE, we propose the triplet-paraphrase meta-training, which leverages triplet paraphrase to pre-train zero-shot label matching ability and uses meta-learning paradigm to learn few-shot instance summarizing ability. Experimental results on three different low-shot RE tasks show that the proposed method outperforms strong baselines by a large margin, and achieve the best performance on few-shot RE leaderboard.

pdf bib
Feeding What You Need by Understanding What You Learned
Xiaoqiang Wang | Bang Liu | Fangli Xu | Bo Long | Siliang Tang | Lingfei Wu

Machine Reading Comprehension MRC reveals the ability to understand a given text passage and answer questions based on it Existing research works in MRC rely heavily on large size models and corpus to improve the performance evaluated by metrics such as Exact Match EM$ and F_1$. However such a paradigm lacks sufficient interpretation to model capability and can not efficiently train a model with a large corpus In this paper we argue that a deep understanding of model capabilities and data properties can help us feed a model with appropriate training data based on its learning status Specifically we design an MRC capability assessment framework that assesses model capabilities in an explainable and multi dimensional manner Based on it we further uncover and disentangle the connections between various data properties and model performance Finally to verify the effectiveness of the proposed MRC capability assessment framework we incorporate it into a curriculum learning pipeline and devise a Capability Boundary Breakthrough Curriculum CBBC strategy which performs a model capability based training to maximize the data value and improve training efficiency Extensive experiments demonstrate that our approach significantly improves performance achieving up to an 11.22\\% 8.71\\% improvement of EM$ F_1 on MRC tasksEM) and F_1. However, such a paradigm lacks sufficient interpretation to model capability and can not efficiently train a model with a large corpus. In this paper, we argue that a deep understanding of model capabilities and data properties can help us feed a model with appropriate training data based on its learning status. Specifically, we design an MRC capability assessment framework that assesses model capabilities in an explainable and multi-dimensional manner. Based on it, we further uncover and disentangle the connections between various data properties and model performance. Finally, to verify the effectiveness of the proposed MRC capability assessment framework, we incorporate it into a curriculum learning pipeline and devise a Capability Boundary Breakthrough Curriculum (CBBC) strategy, which performs a model capability-based training to maximize the data value and improve training efficiency. Extensive experiments demonstrate that our approach significantly improves performance, achieving up to an 11.22% / 8.71% improvement of EM / F_1 on MRC tasks.

pdf bib
Probing Simile Knowledge from Pre-trained Language Models
Weijie Chen | Yongzhu Chang | Rongsheng Zhang | Jiashu Pu | Guandan Chen | Le Zhang | Yadong Xi | Yijiang Chen | Chang Su

Simile interpretation (SI) and simile generation (SG) are challenging tasks for NLP because models require adequate world knowledge to produce predictions. Previous works have employed many hand-crafted resources to bring knowledge-related into models, which is time-consuming and labor-intensive. In recent years, pre-trained language models (PLMs) based approaches have become the de-facto standard in NLP since they learn generic knowledge from a large corpus. The knowledge embedded in PLMs may be useful for SI and SG tasks. Nevertheless, there are few works to explore it. In this paper, we probe simile knowledge from PLMs to solve the SI and SG tasks in the unified framework of simile triple completion for the first time. The backbone of our framework is to construct masked sentences with manual patterns and then predict the candidate words in the masked position. In this framework, we adopt a secondary training process (Adjective-Noun mask Training) with the masked language model (MLM) loss to enhance the prediction diversity of candidate words in the masked position. Moreover, pattern ensemble (PE) and pattern search (PS) are applied to improve the quality of predicted words. Finally, automatic and human evaluations demonstrate the effectiveness of our framework in both SI and SG tasks.

pdf bib
Entailment Graph Learning with Textual Entailment and Soft Transitivity
Zhibin Chen | Yansong Feng | Dongyan Zhao

Typed entailment graphs try to learn the entailment relations between predicates from text and model them as edges between predicate nodes The construction of entailment graphs usually suffers from severe sparsity and unreliability of distributional similarity We propose a two stage method Entailment Graph with Textual Entailment and Transitivity EGT2 EGT2 learns the local entailment relations by recognizing the textual entailment between template sentences formed by typed CCG parsed predicates Based on the generated local graph EGT2 then uses three novel soft transitivity constraints to consider the logical transitivity in entailment structures Experiments on benchmark datasets show that EGT2 can well model the transitivity in entailment graph to alleviate the sparsity and leads to signifcant improvement over current state of the art methods

pdf bib
Continual Pre-training of Language Models for Math Problem Understanding with Syntax-Aware Memory Network
Zheng Gong | Kun Zhou | Xin Zhao | Jing Sha | Shijin Wang | Ji-Rong Wen

In this paper, we study how to continually pre-train language models for improving the understanding of math problems. Specifically, we focus on solving a fundamental challenge in modeling math problems, how to fuse the semantics of textual description and formulas, which are highly different in essence. To address this issue, we propose a new approach called COMUS to continually pre-train language models for math problem understanding with syntax-aware memory network. In this approach, we first construct the math syntax graph to model the structural semantic information, by combining the parsing trees of the text and formulas, and then design the syntax-aware memory networks to deeply fuse the features from the graph and text. With the help of syntax relations, we can model the interaction between the token from the text and its semantic-related nodes within the formulas, which is helpful to capture fine-grained semantic correlations between texts and formulas. Besides, we devise three continual pre-training tasks to further align and fuse the representations of the text and math syntax graph. Experimental results on four tasks in the math domain demonstrate the effectiveness of our approach. Our code and data are publicly available at the link: blue

pdf bib
Pre-training and Fine-tuning Neural Topic Model: A Simple yet Effective Approach to Incorporating External Knowledge
Linhai Zhang | Xuemeng Hu | Boyu Wang | Deyu Zhou | Qian-Wen Zhang | Yunbo Cao

Recent years have witnessed growing interests in incorporating external knowledge such as pre-trained word embeddings (PWEs) or pre-trained language models (PLMs) into neural topic modeling. However, we found that employing PWEs and PLMs for topic modeling only achieved limited performance improvements but with huge computational overhead. In this paper, we propose a novel strategy to incorporate external knowledge into neural topic modeling where the neural topic model is pre-trained on a large corpus and then fine-tuned on the target dataset. Experiments have been conducted on three datasets and results show that the proposed approach significantly outperforms both current state-of-the-art neural topic models and some topic modeling approaches enhanced with PWEs or PLMs. Moreover, further study shows that the proposed approach greatly reduces the need for the huge size of training data.

pdf bib
Just Rank Rethinking Evaluation with Word and Sentence Similarities
Bin Wang | C.-c. Kuo | Haizhou Li

Word and sentence embeddings are useful feature representations in natural language processing However intrinsic evaluation for embeddings lags far behind and there has been no significant update since the past decade Word and sentence similarity tasks have become the de facto evaluation method It leads models to overfit to such evaluations negatively impacting embedding models development This paper first points out the problems using semantic similarity as the gold standard for word and sentence embedding evaluations Further we propose a new intrinsic evaluation method called EvalRank which shows a much stronger correlation with downstream tasks Extensive experiments are conducted based on models and popular datasets to certify our judgments Finally the practical evaluation toolkit is released for future benchmarking purposes

pdf bib
CLIP Models are Few-Shot Learners: Empirical Studies on VQA and Visual Entailment
Haoyu Song | Li Dong | Weinan Zhang | Ting Liu | Furu Wei

CLIP has shown a remarkable zero-shot capability on a wide range of vision tasks. Previously, CLIP is only regarded as a powerful visual encoder. However, after being pre-trained by language supervision from a large amount of image-caption pairs, CLIP itself should also have acquired some few-shot abilities for vision-language tasks. In this work, we empirically show that CLIP can be a strong vision-language few-shot learner by leveraging the power of language. We first evaluate CLIP’s zero-shot performance on a typical visual question answering task and demonstrate a zero-shot cross-modality transfer capability of CLIP on the visual entailment task. Then we propose a parameter-efficient fine-tuning strategy to boost the few-shot performance on the vqa task. We achieve competitive zero/few-shot results on the visual question answering and visual entailment tasks without introducing any additional pre-training procedure.

pdf bib
SalesBot: Transitioning from Chit-Chat to Task-Oriented Dialogues
Ssu Chiu | Maolin Li | Yen-Ting Lin | Yun-Nung Chen

Dialogue systems are usually categorized into two types, open-domain and task-oriented. The first one focuses on chatting with users and making them engage in the conversations, where selecting a proper topic to fit the dialogue context is essential for a successful dialogue. The other one focuses on a specific task instead of casual talks, e.g., finding a movie on Friday night, playing a song. These two directions have been studied separately due to their different purposes. However, how to smoothly transition from social chatting to task-oriented dialogues is important for triggering the business opportunities, and there is no any public data focusing on such scenarios. Hence, this paper focuses on investigating the conversations starting from open-domain social chatting and then gradually transitioning to task-oriented purposes, and releases a large-scale dataset with detailed annotations for encouraging this research direction. To achieve this goal, this paper proposes a framework to automatically generate many dialogues without human involvement, in which any powerful open-domain dialogue generation model can be easily leveraged. The human evaluation shows that our generated dialogue data has a natural flow at a reasonable quality, showing that our released data has a great potential of guiding future research directions and commercial activities. Furthermore, the released models allow researchers to automatically generate unlimited dialogues in the target scenarios, which can greatly benefit semi-supervised and unsupervised approaches.

pdf bib
ReACC: A Retrieval-Augmented Code Completion Framework
Shuai Lu | Nan Duan | Hojae Han | Daya Guo | Seung-won Hwang | Alexey Svyatkovskiy

Code completion, which aims to predict the following code token(s) according to the code context, can improve the productivity of software development. Recent work has proved that statistical language modeling with transformers can greatly improve the performance in the code completion task via learning from large-scale source code datasets. However, current approaches focus only on code context within the file or project, i.e. internal context. Our distinction is utilizing ”external” context, inspired by human behaviors of copying from the related code snippets when writing code. Specifically, we propose a retrieval-augmented code completion framework, leveraging both lexical copying and referring to code with similar semantics by retrieval. We adopt a stage-wise training approach that combines a source code retriever and an auto-regressive language model for programming language. We evaluate our approach in the code completion task in Python and Java programming languages, achieving a state-of-the-art performance on CodeXGLUE benchmark.

pdf bib
Does Recommend-Revise Produce Reliable Annotations? An Analysis on Missing Instances in DocRED
Quzhe Huang | Shibo Hao | Yuan Ye | Shengqi Zhu | Yansong Feng | Dongyan Zhao

DocRED is a widely used dataset for document-level relation extraction. In the large-scale annotation, a recommend-revise scheme is adopted to reduce the workload. Within this scheme, annotators are provided with candidate relation instances from distant supervision, and they then manually supplement and remove relational facts based on the recommendations. However, when comparing DocRED with a subset relabeled from scratch, we find that this scheme results in a considerable amount of false negative samples and an obvious bias towards popular entities and relations. Furthermore, we observe that the models trained on DocRED have low recall on our relabeled dataset and inherit the same bias in the training data. Through the analysis of annotators’ behaviors, we figure out the underlying reason for the problems above: the scheme actually discourages annotators from supplementing adequate instances in the revision phase. We appeal to future research to take into consideration the issues with the recommend-revise scheme when designing new models and annotation schemes. The relabeled dataset is released at, to serve as a more reliable test set of document RE models.

pdf bib
An Empirical Study of Memorization in NLPNLP
Xiaosen Zheng | Jing Jiang

A recent study by Feldman proposed a long tail theory to explain the memorization behavior of deep learning models However memorization has not been empirically verified in the context of NLP a gap addressed by this work In this paper we use three different NLP tasks to check if the long tail theory holds Our experiments demonstrate that top ranked memorized training instances are likely atypical and removing the top memorized training instances leads to a more serious drop in test accuracy compared with removing training instances randomly Furthermore we develop an attribution method to better understand why a training instance is memorized We empirically show that our memorization attribution method is faithful and share our interesting finding that the top memorized parts of a training instance tend to be features negatively correlated with the class label

pdf bib
Guided Attention Multimodal Multitask Financial Forecasting with Inter-Company Relationships and Global and Local News
Gary Ang | Ee-Peng Lim

Most works on financial forecasting use information directly associated with individual companies (e.g., stock prices, news on the company) to predict stock returns for trading. We refer to such company-specific information as local information. Stock returns may also be influenced by global information (e.g., news on the economy in general), and inter-company relationships. Capturing such diverse information is challenging due to the low signal-to-noise ratios, different time-scales, sparsity and distributions of global and local information from different modalities. In this paper, we propose a model that captures both global and local multimodal information for investment and risk management-related forecasting tasks. Our proposed Guided Attention Multimodal Multitask Network (GAME) model addresses these challenges by using novel attention modules to guide learning with global and local information from different modalities and dynamic inter-company relationship networks. Our extensive experiments show that GAME outperforms other state-of-the-art models in several forecasting tasks and important real-world application case studies.

pdf bib
Universal Conditional Masked Language Pre-training for Neural Machine Translation
Pengfei Li | Liangyou Li | Meng Zhang | Minghao Wu | Qun Liu

Pre-trained sequence-to-sequence models have significantly improved Neural Machine Translation (NMT). Different from prior works where pre-trained models usually adopt an unidirectional decoder, this paper demonstrates that pre-training a sequence-to-sequence model but with a bidirectional decoder can produce notable performance gains for both Autoregressive and Non-autoregressive NMT. Specifically, we propose CeMAT, a conditional masked language model pre-trained on large-scale bilingual and monolingual corpora in many languages. We also introduce two simple but effective methods to enhance the CeMAT, aligned code-switching & masking and dynamic dual-masking. We conduct extensive experiments and show that our CeMAT can achieve significant performance improvement for all scenarios from low- to extremely high-resource languages, i.e., up to +14.4 BLEU on low resource and +7.9 BLEU improvements on average for Autoregressive NMT. For Non-autoregressive NMT, we demonstrate it can also produce consistent performance gains, i.e., up to +5.3 BLEU. To the best of our knowledge, this is the first work to pre-train a unified model for fine-tuning on both NMT tasks. Code, data, and pre-trained models are available at

pdf bib
Achieving Reliable Human Assessment of Open-Domain Dialogue Systems
Tianbo Ji | Yvette Graham | Gareth Jones | Chenyang Lyu | Qun Liu

Evaluation of open-domain dialogue systems is highly challenging and development of better techniques is highlighted time and again as desperately needed. Despite substantial efforts to carry out reliable live evaluation of systems in recent competitions, annotations have been abandoned and reported as too unreliable to yield sensible results. This is a serious problem since automatic metrics are not known to provide a good indication of what may or may not be a high-quality conversation. Answering the distress call of competitions that have emphasized the urgent need for better evaluation techniques in dialogue, we present the successful development of human evaluation that is highly reliable while still remaining feasible and low cost. Self-replication experiments reveal almost perfectly repeatable results with a correlation of r=0.969. Furthermore, due to the lack of appropriate methods of statistical significance testing, the likelihood of potential improvements to systems occurring due to chance is rarely taken into account in dialogue evaluation, and the evaluation we propose facilitates application of standard tests. Since we have developed a highly reliable evaluation method, new insights into system performance can be revealed. We therefore include a comparison of state-of-the-art models (i) with and without personas, to measure the contribution of personas to conversation quality, as well as (ii) prescribed versus freely chosen topics. Interestingly with respect to personas, results indicate that personas do not positively contribute to conversation quality as expected.

pdf bib
ASPECTNEWS: Aspect-Oriented Summarization of News Documents
Ojas Ahuja | Jiacheng Xu | Akshay Gupta | Kevin Horecka | Greg Durrett

Generic summaries try to cover an entire document and query-based summaries try to answer document-specific questions. But real users’ needs often fall in between these extremes and correspond to aspects, high-level topics discussed among similar types of documents. In this paper, we collect a dataset of realistic aspect-oriented summaries, AspectNews, which covers different subtopics about articles in news sub-domains. We annotate data across two domains of articles, earthquakes and fraud investigations, where each article is annotated with two distinct summaries focusing on different aspects for each domain. A system producing a single generic summary cannot concisely satisfy both aspects. Our focus in evaluation is how well existing techniques can generalize to these domains without seeing in-domain training data, so we turn to techniques to construct synthetic training data that have been used in query-focused summarization work. We compare several training schemes that differ in how strongly keywords are used and how oracle summaries are extracted. Our evaluation shows that our final approach yields (a) focused summaries, better than those from a generic summarization system or from keyword matching; (b) a system sensitive to the choice of keywords.

pdf bib
MemSum: Extractive Summarization of Long Documents Using Multi-Step Episodic Markov Decision Processes
Nianlong Gu | Elliott Ash | Richard Hahnloser

We introduce MemSum (Multi-step Episodic Markov decision process extractive SUMmarizer), a reinforcement-learning-based extractive summarizer enriched at each step with information on the current extraction history. When MemSum iteratively selects sentences into the summary, it considers a broad information set that would intuitively also be used by humans in this task: 1) the text content of the sentence, 2) the global text context of the rest of the document, and 3) the extraction history consisting of the set of sentences that have already been extracted. With a lightweight architecture, MemSum obtains state-of-the-art test-set performance (ROUGE) in summarizing long documents taken from PubMed, arXiv, and GovReport. Ablation studies demonstrate the importance of local, global, and history information. A human evaluation confirms the high quality and low redundancy of the generated summaries, stemming from MemSum’s awareness of extraction history.

pdf bib
Accurate Online Posterior Alignments for Principled Lexically-Constrained Decoding
Soumya Chatterjee | Sunita Sarawagi | Preethi Jyothi

Online alignment in machine translation refers to the task of aligning a target word to a source word when the target sequence has only been partially decoded. Good online alignments facilitate important applications such as lexically constrained translation where user-defined dictionaries are used to inject lexical constraints into the translation model. We propose a novel posterior alignment technique that is truly online in its execution and superior in terms of alignment error rates compared to existing methods. Our proposed inference technique jointly considers alignment and token probabilities in a principled manner and can be seamlessly integrated within existing constrained beam-search decoding algorithms. On five language pairs, including two distant language pairs, we achieve consistent drop in alignment error rates. When deployed on seven lexically constrained translation tasks, we achieve significant improvements in BLEU specifically around the constrained positions.

pdf bib
Letters From the Past Modeling Historical Sound Change Through Diachronic Character Embeddings
Sidsel Boldsen | Patrizia Paggio

While a great deal of work has been done on NLP approaches to lexical semantic change detection other aspects of language change have received less attention from the NLP community In this paper we address the detection of sound change through historical spelling We propose that a sound change can be captured by comparing the relative distance through time between the distributions of the characters involved before and after the change has taken place We model these distributions using PPMI character embeddings We verify this hypothesis in synthetic data and then test the methods ability to trace the well known historical change of lenition of plosives in Danish historical sources We show that the models are able to identify several of the changes under consideration and to uncover meaningful contexts in which they appeared The methodology has the potential to contribute to the study of open questions such as the relative chronology of sound shifts and their geographical distribution

pdf bib
Reducing Position Bias in Simultaneous Machine Translation with Length-Aware Framework
Shaolei Zhang | Yang Feng

Simultaneous machine translation (SiMT) starts translating while receiving the streaming source inputs, and hence the source sentence is always incomplete during translating. Different from the full-sentence MT using the conventional seq-to-seq architecture, SiMT often applies prefix-to-prefix architecture, which forces each target word to only align with a partial source prefix to adapt to the incomplete source in streaming inputs. However, the source words in the front positions are always illusoryly considered more important since they appear in more prefixes, resulting in position bias, which makes the model pay more attention on the front source positions in testing. In this paper, we first analyze the phenomenon of position bias in SiMT, and develop a Length-Aware Framework to reduce the position bias by bridging the structural gap between SiMT and full-sentence MT. Specifically, given the streaming inputs, we first predict the full-sentence length and then fill the future source position with positional encoding, thereby turning the streaming inputs into a pseudo full-sentence. The proposed framework can be integrated into most existing SiMT methods to further improve performance. Experiments on two representative SiMT methods, including the state-of-the-art adaptive policy, show that our method successfully reduces the position bias and thereby achieves better SiMT performance.

pdf bib
ParaDetox Detoxification with Parallel DataParaDetox: Detoxification with Parallel Data
Varvara Logacheva | Daryna Dementieva | Sergey Ustyantsev | Daniil Moskovskiy | David Dale | Irina Krotova | Nikita Semenov | Alexander Panchenko

We present a novel pipeline for the collection of parallel data for the detoxification task We collect non toxic paraphrases for over 10,000 English toxic sentences We also show that this pipeline can be used to distill a large existing corpus of paraphrases to get toxic neutral sentence pairs We release two parallel corpora which can be used for the training of detoxification models To the best of our knowledge these are the first parallel datasets for this task We describe our pipeline in detail to make it fast to set up for a new language or domain thus contributing to faster and easier development of new parallel resources We train several detoxification models on the collected data and compare them with several baselines and state of the art unsupervised approaches We conduct both automatic and manual evaluations All models trained on parallel data outperform the state of the art unsupervised models by a large margin This suggests that our novel datasets can boost the performance of detoxification systems

pdf bib
Language-Agnostic Meta-Learning for Low-Resource Text-to-Speech with Articulatory Features
Florian Lux | Thang Vu

While neural text-to-speech systems perform remarkably well in high-resource scenarios, they cannot be applied to the majority of the over 6,000 spoken languages in the world due to a lack of appropriate training data. In this work, we use embeddings derived from articulatory vectors rather than embeddings derived from phoneme identities to learn phoneme representations that hold across languages. In conjunction with language agnostic meta learning, this enables us to fine-tune a high-quality text-to-speech model on just 30 minutes of data in a previously unseen language spoken by a previously unseen speaker.

pdf bib
What Makes Reading Comprehension Questions Difficult
Saku Sugawara | Nikita Nangia | Alex Warstadt | Samuel Bowman

For a natural language understanding benchmark to be useful in research it has to consist of examples that are diverse and difficult enough to discriminate among current and near future state of the art systems However we do not yet know how best to select text sources to collect a variety of challenging examples In this study we crowdsource multiple choice reading comprehension questions for passages taken from seven qualitatively distinct sources analyzing what attributes of passages contribute to the difficulty and question types of the collected examples To our surprise we find that passage source length and readability measures do not significantly affect question difficulty Through our manual annotation of seven reasoning types we observe several trends between passage sources and reasoning types e.g. logical reasoning is more often required in questions written for technical passages These results suggest that when creating a new benchmark dataset selecting a diverse set of passages can help ensure a diverse range of question types but that passage difficulty need not be a priority

pdf bib
Challenges and Strategies in Cross-Cultural NLP
Daniel Hershcovich | Stella Frank | Heather Lent | Miryam de Lhoneux | Mostafa Abdou | Stephanie Brandl | Emanuele Bugliarello | Laura Cabello Piqueras | Ilias Chalkidis | Ruixiang Cui | Constanza Fierro | Katerina Margatina | Phillip Rust | Anders Søgaard

Various efforts in the Natural Language Processing (NLP) community have been made to accommodate linguistic diversity and serve speakers of many different languages. However, it is important to acknowledge that speakers and the content they produce and require, vary not just by language, but also by culture. Although language and culture are tightly linked, there are important differences. Analogous to cross-lingual and multilingual NLP, cross-cultural and multicultural NLP considers these differences in order to better serve users of NLP systems. We propose a principled framework to frame these efforts, and survey existing and potential strategies.

pdf bib
Prototypical Verbalizer for Prompt-based Few-shot Tuning
Ganqu Cui | Shengding Hu | Ning Ding | Longtao Huang | Zhiyuan Liu

Prompt-based tuning for pre-trained language models (PLMs) has shown its effectiveness in few-shot learning. Typically, prompt-based tuning wraps the input text into a cloze question. To make predictions, the model maps the output words to labels via a verbalizer, which is either manually designed or automatically built. However, manual verbalizers heavily depend on domain-specific prior knowledge and human efforts, while finding appropriate label words automatically still remains challenging.In this work, we propose the prototypical verbalizer (ProtoVerb) which is built directly from training data. Specifically, ProtoVerb learns prototype vectors as verbalizers by contrastive learning. In this way, the prototypes summarize training instances and are able to enclose rich class-level semantics. We conduct experiments on both topic classification and entity typing tasks, and the results demonstrate that ProtoVerb significantly outperforms current automatic verbalizers, especially when training data is extremely scarce. More surprisingly, ProtoVerb consistently boosts prompt-based tuning even on untuned PLMs, indicating an elegant non-tuning way to utilize PLMs. Our codes are avaliable at

pdf bib
Clickbait Spoiling via Question Answering and Passage Retrieval
Matthias Hagen | Maik Fröbe | Artur Jurk | Martin Potthast

We introduce and study the task of clickbait spoiling generating a short text that satisfies the curiosity induced by a clickbait post Clickbait links to a web page and advertises its contents by arousing curiosity instead of providing an informative summary Our contributions are approaches to classify the type of spoiler needed i.e. a phrase or a passage and to generate appropriate spoilers A large scale evaluation and error analysis on a new corpus of 5,000 manually spoiled clickbait posts --- the Webis Clickbait Spoiling Corpus~2022 --shows that our spoiler type classifier achieves an accuracy of~80\\% while the question answering model DeBERTa large outperforms all others in generating spoilers for both types

pdf bib
Incorporating Hierarchy into Text Encoder a Contrastive Learning Approach for Hierarchical Text Classification
Zihan Wang | Peiyi Wang | Lianzhe Huang | Xin Sun | Houfeng Wang

Hierarchical text classification is a challenging subtask of multi label classification due to its complex label hierarchy Existing methods encode text and label hierarchy separately and mix their representations for classification where the hierarchy remains unchanged for all input text Instead of modeling them separately in this work we propose Hierarchy guided Contrastive Learning HGCLR to directly embed the hierarchy into a text encoder During training HGCLR constructs positive samples for input text under the guidance of the label hierarchy By pulling together the input text and its positive sample the text encoder can learn to generate the hierarchy aware text representation independently Therefore after training the HGCLR enhanced text encoder can dispense with the redundant hierarchy Extensive experiments on three benchmark datasets verify the effectiveness of HGCLR

pdf bib
Hyperlink-induced Pre-training for Passage Retrieval in Open-domain Question Answering
Jiawei Zhou | Xiaoguang Li | Lifeng Shang | Lan Luo | Ke Zhan | Enrui Hu | Xinyu Zhang | Hao Jiang | Zhao Cao | Fan Yu | Xin Jiang | Qun Liu | Lei Chen

To alleviate the data scarcity problem in training question answering systems, recent works propose additional intermediate pre-training for dense passage retrieval (DPR). However, there still remains a large discrepancy between the provided upstream signals and the downstream question-passage relevance, which leads to less improvement. To bridge this gap, we propose the HyperLink-induced Pre-training (HLP), a method to pre-train the dense retriever with the text relevance induced by hyperlink-based topology within Web documents. We demonstrate that the hyperlink-based structures of dual-link and co-mention can provide effective relevance signals for large-scale pre-training that better facilitate downstream passage retrieval. We investigate the effectiveness of our approach across a wide range of open-domain QA datasets under zero-shot, few-shot, multi-hop, and out-of-domain scenarios. The experiments show our HLP outperforms the BM25 by up to 7 points as well as other pre-training methods by more than 10 points in terms of top-20 retrieval accuracy under the zero-shot scenario. Furthermore, HLP significantly outperforms other pre-training methods under the other scenarios.

pdf bib
CAMERO Consistency Regularized Ensemble of Perturbed Language Models with Weight SharingCAMERO: Consistency Regularized Ensemble of Perturbed Language Models with Weight Sharing
Chen Liang | Pengcheng He | Yelong Shen | Weizhu Chen | Tuo Zhao

Model ensemble is a popular approach to produce a low variance and well generalized model However it induces large memory and inference costs which is often not affordable for real world deployment Existing work has resorted to sharing weights among models However when increasing the proportion of the shared weights the resulting models tend to be similar and the benefits of using model ensemble diminish To retain ensemble benefits while maintaining a low memory cost we propose a consistency regularized ensemble learning approach based on perturbed models named CAMERO Specifically we share the weights of bottom layers across all models and apply different perturbations to the hidden representations for different models which can effectively promote the model diversity Meanwhile we apply a prediction consistency regularizer across the perturbed models to control the variance due to the model diversity Our experiments using large language models demonstrate that CAMERO significantly improves the generalization performance of the ensemble model Specifically CAMERO outperforms the standard ensemble of BERT base models on the GLUE benchmark by 0.7 with a significantly smaller model size 114.2$M vs. 880.6$M

pdf bib
Interpretability for Language Learners Using Example-Based Grammatical Error Correction
Masahiro Kaneko | Sho Takase | Ayana Niwa | Naoaki Okazaki

Grammatical Error Correction (GEC) should not focus only on high accuracy of corrections but also on interpretability for language learning.However, existing neural-based GEC models mainly aim at improving accuracy, and their interpretability has not been explored.A promising approach for improving interpretability is an example-based method, which uses similar retrieved examples to generate corrections. In addition, examples are beneficial in language learning, helping learners understand the basis of grammatically incorrect/correct texts and improve their confidence in writing.Therefore, we hypothesize that incorporating an example-based method into GEC can improve interpretability as well as support language learners.In this study, we introduce an Example-Based GEC (EB-GEC) that presents examples to language learners as a basis for a correction result.The examples consist of pairs of correct and incorrect sentences similar to a given input and its predicted correction.Experiments demonstrate that the examples presented by EB-GEC help language learners decide to accept or refuse suggestions from the GEC output.Furthermore, the experiments also show that retrieved examples improve the accuracy of corrections.

pdf bib
UniXcoder: Unified Cross-Modal Pre-training for Code Representation
Daya Guo | Shuai Lu | Nan Duan | Yanlin Wang | Ming Zhou | Jian Yin

Pre-trained models for programming languages have recently demonstrated great success on code intelligence. To support both code-related understanding and generation tasks, recent works attempt to pre-train unified encoder-decoder models. However, such encoder-decoder framework is sub-optimal for auto-regressive tasks, especially code completion that requires a decoder-only manner for efficient inference. In this paper, we present UniXcoder, a unified cross-modal pre-trained model for programming language. The model utilizes mask attention matrices with prefix adapters to control the behavior of the model and leverages cross-modal contents like AST and code comment to enhance code representation. To encode AST that is represented as a tree in parallel, we propose a one-to-one mapping method to transform AST in a sequence structure that retains all structural information from the tree. Furthermore, we propose to utilize multi-modal contents to learn representation of code fragment with contrastive learning, and then align representations among programming languages using a cross-modal generation task. We evaluate UniXcoder on five code-related tasks over nine datasets. To further evaluate the performance of code fragment representation, we also construct a dataset for a new task, called zero-shot code-to-code search. Results show that our model achieves state-of-the-art performance on most tasks and analysis reveals that comment and AST can both enhance UniXcoder.

pdf bib
mLUKE: The Power of Entity Representations in Multilingual Pretrained Language Models
Ryokan Ri | Ikuya Yamada | Yoshimasa Tsuruoka

Recent studies have shown that multilingual pretrained language models can be effectively improved with cross-lingual alignment information from Wikipedia entities.However, existing methods only exploit entity information in pretraining and do not explicitly use entities in downstream tasks.In this study, we explore the effectiveness of leveraging entity representations for downstream cross-lingual tasks.We train a multilingual language model with 24 languages with entity representations and showthe model consistently outperforms word-based pretrained models in various cross-lingual transfer tasks.We also analyze the model and the key insight is that incorporating entity representations into the input allows us to extract more language-agnostic features.We also evaluate the model with a multilingual cloze prompt task with the mLAMA dataset.We show that entity-based prompt elicits correct factual knowledge more likely than using only word representations.

pdf bib
ABC: Attention with Bounded-memory Control
Hao Peng | Jungo Kasai | Nikolaos Pappas | Dani Yogatama | Zhaofeng Wu | Lingpeng Kong | Roy Schwartz | Noah Smith

Transformer architectures have achieved state- of-the-art results on a variety of natural language processing (NLP) tasks. However, their attention mechanism comes with a quadratic complexity in sequence lengths, making the computational overhead prohibitive, especially for long sequences. Attention context can be seen as a random-access memory with each token taking a slot. Under this perspective, the memory size grows linearly with the sequence length, and so does the overhead of reading from it. One way to improve the efficiency is to bound the memory size. We show that disparate approaches can be subsumed into one abstraction, attention with bounded-memory control (ABC), and they vary in their organization of the memory. ABC reveals new, unexplored possibilities. First, it connects several efficient attention variants that would otherwise seem apart. Second, this abstraction gives new insights—an established approach (Wang et al., 2020b) previously thought to not be applicable in causal attention, actually is. Last, we present a new instance of ABC, which draws inspiration from existing ABC approaches, but replaces their heuristic memory-organizing functions with a learned, contextualized one. Our experiments on language modeling, machine translation, and masked language model finetuning show that our approach outperforms previous efficient attention models; compared to the strong transformer baselines, it significantly improves the inference time and space efficiency with no or negligible accuracy loss.

pdf bib
Adapting Coreference Resolution Models through Active Learning
Michelle Yuan | Patrick Xia | Chandler May | Benjamin Van Durme | Jordan Boyd-Graber

Neural coreference resolution models trained on one dataset may not transfer to new low resource domains Active learning mitigates this problem by sampling a small subset of data for annotators to label While active learning is well defined for classification tasks its application to coreference resolution is neither well defined nor fully understood This paper explores how to actively label coreference examining sources of model uncertainty and document reading costs We compare uncertainty sampling strategies and their advantages through thorough error analysis In both synthetic and human experiments labeling spans within the same document is more effective than annotating spans across documents The findings contribute to a more realistic development of coreference resolution models

pdf bib
Overcoming a Theoretical Limitation of Self-Attention
David Chiang | Peter Cholak

Although transformers are remarkably effective for many tasks, there are some surprisingly easy-looking regular languages that they struggle with. Hahn shows that for languages where acceptance depends on a single input symbol, a transformer’s classification decisions get closer and closer to random guessing (that is, a cross-entropy of 1) as input strings get longer and longer. We examine this limitation using two languages: PARITY, the language of bit strings with an odd number of 1s, and FIRST, the language of bit strings starting with a 1. We demonstrate three ways of overcoming the limitation implied by Hahn’s lemma. First, we settle an open question by constructing a transformer that recognizes PARITY with perfect accuracy, and similarly for FIRST. Second, we use layer normalization to bring the cross-entropy of both models arbitrarily close to zero. Third, when transformers need to focus on a single position, as for FIRST, we find that they can fail to generalize to longer strings; we offer a simple remedy to this problem that also improves length generalization in machine translation.

pdf bib
Prediction Difference Regularization against Perturbation for Neural Machine Translation
Dengji Guo | Zhengrui Ma | Min Zhang | Yang Feng

Regularization methods applying input perturbation have drawn considerable attention and have been frequently explored for NMT tasks in recent years Despite their simplicity and effectiveness we argue that these methods are limited by the under fitting of training data In this paper we utilize prediction difference for ground truth tokens to analyze the fitting of token level samples and find that under fitting is almost as common as over fitting We introduce prediction difference regularization PD R a simple and effective method that can reduce over fitting and under fitting at the same time For all token level samples PD R minimizes the prediction difference between the original pass and the input perturbed pass making the model less sensitive to small input changes thus more robust to both perturbations and under fitted training data Experiments on three widely used WMT translation tasks show that our approach can significantly improve over existing perturbation regularization methods On WMT16 En De task our model achieves 1.80 SacreBLEU improvement over vanilla transformer

pdf bib
Make the Best of Cross-lingual Transfer: Evidence from POS Tagging with over 100 Languages
Wietse de Vries | Martijn Wieling | Malvina Nissim

Cross-lingual transfer learning with large multilingual pre-trained models can be an effective approach for low-resource languages with no labeled training data. Existing evaluations of zero-shot cross-lingual generalisability of large pre-trained models use datasets with English training data, and test data in a selection of target languages. We explore a more extensive transfer learning setup with 65 different source languages and 105 target languages for part-of-speech tagging. Through our analysis, we show that pre-training of both source and target language, as well as matching language families, writing systems, word order systems, and lexical-phonetic distance significantly impact cross-lingual performance. The findings described in this paper can be used as indicators of which factors are important for effective zero-shot cross-lingual transfer to zero- and low-resource languages.

pdf bib
How Do Seq2Seq Models Perform on End-to-End Data-to-Text Generation?
Xunjian Yin | Xiaojun Wan

With the rapid development of deep learning, Seq2Seq paradigm has become prevalent for end-to-end data-to-text generation, and the BLEU scores have been increasing in recent years. However, it is widely recognized that there is still a gap between the quality of the texts generated by models and the texts written by human. In order to better understand the ability of Seq2Seq models, evaluate their performance and analyze the results, we choose to use Multidimensional Quality Metric(MQM) to evaluate several representative Seq2Seq models on end-to-end data-to-text generation. We annotate the outputs of five models on four datasets with eight error types and find that 1) copy mechanism is helpful for the improvement in Omission and Inaccuracy Extrinsic errors but it increases other types of errors such as Addition; 2) pre-training techniques are highly effective, and pre-training strategy and model size are very significant; 3) the structure of the dataset also influences the model’s performance greatly; 4) some specific types of errors are generally challenging for seq2seq models.

pdf bib
LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding
Jiapeng Wang | Lianwen Jin | Kai Ding

Structured document understanding has attracted considerable attention and made significant progress recently, owing to its crucial role in intelligent document processing. However, most existing related models can only deal with the document data of specific language(s) (typically English) included in the pre-training collection, which is extremely limited. To address this issue, we propose a simple yet effective Language-independent Layout Transformer (LiLT) for structured document understanding. LiLT can be pre-trained on the structured documents of a single language and then directly fine-tuned on other languages with the corresponding off-the-shelf monolingual/multilingual pre-trained textual models. Experimental results on eight languages have shown that LiLT can achieve competitive or even superior performance on diverse widely-used downstream benchmarks, which enables language-independent benefit from the pre-training of document layout structure. Code and model are publicly available at

pdf bib
Can Unsupervised Knowledge Transfer from Social Discussions Help Argument Mining
Subhabrata Dutta | Jeevesh Juneja | Dipankar Das | Tanmoy Chakraborty

Identifying argument components from unstructured texts and predicting the relationships expressed among them are two primary steps of argument mining The intrinsic complexity of these tasks demands powerful learning models While pretrained Transformer based Language Models LM have been shown to provide state of the art results over different NLP tasks the scarcity of manually annotated data and the highly domain dependent nature of argumentation restrict the capabilities of such models In this work we propose a novel transfer learning strategy to overcome these challenges We utilize argumentation rich social discussions from the subreddit as a source of unsupervised argumentative discourse aware knowledge by finetuning pretrained LMs on a selectively masked language modeling task Furthermore we introduce a novel prompt based strategy for inter component relation prediction that compliments our proposed finetuning method while leveraging on the discourse context Exhaustive experiments show the generalization capability of our method on these two tasks over within domain as well as out of domain datasets outperforming several existing and employed strong baselinesChangeMyView subreddit as a source of unsupervised, argumentative discourse-aware knowledge by finetuning pretrained LMs on a selectively masked language modeling task. Furthermore, we introduce a novel prompt-based strategy for inter-component relation prediction that compliments our proposed finetuning method while leveraging on the discourse context. Exhaustive experiments show the generalization capability of our method on these two tasks over within-domain as well as out-of-domain datasets, outperforming several existing and employed strong baselines.

pdf bib
Entity-based Neural Local Coherence Modeling
Sungho Jeon | Michael Strube

In this paper, we propose an entity-based neural local coherence model which is linguistically more sound than previously proposed neural coherence models. Recent neural coherence models encode the input document using large-scale pretrained language models. Hence their basis for computing local coherence are words and even sub-words. The analysis of their output shows that these models frequently compute coherence on the basis of connections between (sub-)words which, from a linguistic perspective, should not play a role. Still, these models achieve state-of-the-art performance in several end applications. In contrast to these models, we compute coherence on the basis of entities by constraining the input to noun phrases and proper names. This provides us with an explicit representation of the most important items in sentences leading to the notion of focus. This brings our model linguistically in line with pre-neural models of computing coherence. It also gives us better insight into the behaviour of the model thus leading to better explainability. Our approach is also in accord with a recent study (O’Connor and Andreas, 2021), which shows that most usable information is captured by nouns and verbs in transformer-based language models. We evaluate our model on three downstream tasks showing that it is not only linguistically more sound than previous models but also that it outperforms them in end applications.

pdf bib
LinkBERT Pretraining Language Models with Document LinksLinkBERT: Pretraining Language Models with Document Links
Michihiro Yasunaga | Jure Leskovec | Percy Liang

Language model LM pretraining captures various knowledge from text corpora helping downstream tasks However existing methods such as BERT model a single document and do not capture dependencies or knowledge that span across documents In this work we propose LinkBERT an LM pretraining method that leverages links between documents e.g. hyperlinks Given a text corpus we view it as a graph of documents and create LM inputs by placing linked documents in the same context We then pretrain the LM with two joint self supervised objectives masked language modeling and our new proposal document relation prediction We show that LinkBERT outperforms BERT on various downstream tasks across two domains the general domain pretrained on Wikipedia with hyperlinks and biomedical domain pretrained on PubMed with citation links LinkBERT is especially effective for multi hop reasoning and few shot QA +5 absolute improvement on HotpotQA and TriviaQA and our biomedical LinkBERT sets new states of the art on various BioNLP tasks +7 on BioASQ and USMLE We release our pretrained models LinkBERT and BioLinkBERT as well as code and data

pdf bib
Situated Dialogue Learning through Procedural Environment Generation
Prithviraj Ammanabrolu | Renee Jia | Mark Riedl

We teach goal driven agents to interactively act and speak in situated environments by training on generated curriculums Our agents operate in LIGHT Urbanek et al 2019)---a large scale crowd sourced fantasy text adventure game wherein an agent perceives and interacts with the world through textual natural language Goals in this environment take the form of character based quests consisting of personas and motivations We augment LIGHT by learning to procedurally generate additional novel textual worlds and quests to create a curriculum of steadily increasing difficulty for training agents to achieve such goals In particular we measure curriculum difficulty in terms of the rarity of the quest in the original training distribution --- an easier environment is one that is more likely to have been found in the unaugmented dataset An ablation study shows that this method of learning from the tail of a distribution results in significantly higher generalization abilities as measured by zero shot performance on never before seen quests

pdf bib
Program Transfer for Answering Complex Questions over Knowledge Bases
Shulin Cao | Jiaxin Shi | Zijun Yao | Xin Lv | Jifan Yu | Lei Hou | Juanzi Li | Zhiyuan Liu | Jinghui Xiao

Program induction for answering complex questions over knowledge bases KBs aims to decompose a question into a multi step program whose execution against the KB produces the final answer Learning to induce programs relies on a large number of parallel question program pairs for the given KB However for most KBs the gold program annotations are usually lacking making learning difficult In this paper we propose the approach of program transfer which aims to leverage the valuable program annotations on the rich resourced KBs as external supervision signals to aid program induction for the low resourced KBs that lack program annotations For program transfer we design a novel two stage parsing framework with an efficient ontology guided pruning strategy First a sketch parser translates the question into a high level program sketch which is the composition of functions Second given the question and sketch an argument parser searches the detailed arguments from the KB for functions During the searching we incorporate the KB ontology to prune the search space The experiments on ComplexWebQuestions and WebQuestionSP show that our method outperforms SOTA methods significantly demonstrating the effectiveness of program transfer and our framework Our codes and datasets can be obtained from

pdf bib
PPT: Pre-trained Prompt Tuning for Few-shot Learning
Yuxian Gu | Xu Han | Zhiyuan Liu | Minlie Huang

Prompts for pre-trained language models (PLMs) have shown remarkable performance by bridging the gap between pre-training tasks and various downstream tasks. Among these methods, prompt tuning, which freezes PLMs and only tunes soft prompts, provides an efficient and effective solution for adapting large-scale PLMs to downstream tasks. However, prompt tuning is yet to be fully explored. In our pilot experiments, we find that prompt tuning performs comparably with conventional full-model tuning when downstream data are sufficient, whereas it is much worse under few-shot learning settings, which may hinder the application of prompt tuning. We attribute this low performance to the manner of initializing soft prompts. Therefore, in this work, we propose to pre-train prompts by adding soft prompts into the pre-training stage to obtain a better initialization. We name this Pre-trained Prompt Tuning framework “PPT”. To ensure the generalization of PPT, we formulate similar classification tasks into a unified task form and pre-train soft prompts for this unified task. Extensive experiments show that tuning pre-trained prompts for downstream tasks can reach or even outperform full-model fine-tuning under both full-data and few-shot settings. Our approach is effective and efficient for using large-scale PLMs in practice.

pdf bib
Deduplicating Training Data Makes Language Models Better
Katherine Lee | Daphne Ippolito | Andrew Nystrom | Chiyuan Zhang | Douglas Eck | Chris Callison-Burch | Nicholas Carlini

We find that existing language modeling datasets contain many near duplicate examples and long repetitive substrings As a result over of the unprompted output of language models trained on these datasets is copied verbatim from the training data We develop two tools that allow us to deduplicate training datasets --- for example removing from C4 a single word English sentence that is repeated over 60,000 times Deduplication allows us to train models that emit memorized text ten times less frequently and require fewer training steps to achieve the same or better accuracy We can also reduce train test overlap which affects over of the validation set of standard datasets thus allowing for more accurate evaluation Code for deduplication is released at

pdf bib
Internet-Augmented Dialogue Generation
Mojtaba Komeili | Kurt Shuster | Jason Weston

The largest store of continually updating knowledge on our planet can be accessed via internet search. In this work we study giving access to this information to conversational agents. Large language models, even though they store an impressive amount of knowledge within their weights, are known to hallucinate facts when generating dialogue (Shuster et al., 2021); moreover, those facts are frozen in time at the point of model training. In contrast, we propose an approach that learns to generate an internet search query based on the context, and then conditions on the search results to finally generate a response, a method that can employ up-to-the-minute relevant information. We train and evaluate such models on a newly collected dataset of human-human conversations whereby one of the speakers is given access to internet search during knowledgedriven discussions in order to ground their responses. We find that search-query based access of the internet in conversation provides superior performance compared to existing approaches that either use no augmentation or FAISS-based retrieval (Lewis et al., 2020b).

pdf bib
Knowledge Neurons in Pretrained Transformers
Damai Dai | Li Dong | Yaru Hao | Zhifang Sui | Baobao Chang | Furu Wei

Large scale pretrained language models are surprisingly good at recalling factual knowledge presented in the training corpus In this paper we present preliminary studies on how factual knowledge is stored in pretrained Transformers by introducing the concept of knowledge neurons Specifically we examine the fill in the blank cloze task for BERT Given a relational fact we propose a knowledge attribution method to identify the neurons that express the fact We find that the activation of such knowledge neurons is positively correlated to the expression of their corresponding facts In our case studies we attempt to leverage knowledge neurons to edit such as update and erase specific factual knowledge without fine tuning Our results shed light on understanding the storage of knowledge within pretrained Transformers

pdf bib
Few-Shot Learning with Siamese Networks and Label Tuning
Thomas Müller | Guillermo Pérez-Torró | Marc Franco-Salvador

We study the problem of building text classifiers with little or no training data, commonly known as zero and few-shot text classification. In recent years, an approach based on neural textual entailment models has been found to give strong results on a diverse range of tasks. In this work, we show that with proper pre-training, Siamese Networks that embed texts and labels offer a competitive alternative. These models allow for a large reduction in inference cost: constant in the number of labels rather than linear. Furthermore, we introduce label tuning, a simple and computationally efficient approach that allows to adapt the models in a few-shot setup by only changing the label embeddings. While giving lower performance than model fine-tuning, this approach has the architectural advantage that a single encoder can be shared by many different tasks.

pdf bib
Generating Biographies on Wikipedia: The Impact of Gender Bias on the Retrieval-Based Generation of Women Biographies
Angela Fan | Claire Gardent

Generating factual, long-form text such as Wikipedia articles raises three key challenges: how to gather relevant evidence, how to structure information into well-formed text, and how to ensure that the generated text is factually correct. We address these by developing a model for English text that uses a retrieval mechanism to identify relevant supporting information on the web and a cache-based pre-trained encoder-decoder to generate long-form biographies section by section, including citation information. To assess the impact of available web evidence on the output text, we compare the performance of our approach when generating biographies about women (for which less information is available on the web) vs. biographies generally. To this end, we curate a dataset of 1,500 biographies about women. We analyze our generated text to understand how differences in available web evidence data affect generation. We evaluate the factuality, fluency, and quality of the generated texts using automatic metrics and human evaluation. We hope that these techniques can be used as a starting point for human writers, to aid in reducing the complexity inherent in the creation of long-form, factual text.

pdf bib
Uncertainty Determines the Adequacy of the Mode and the Tractability of Decoding in Sequence-to-Sequence Models
Felix Stahlberg | Ilia Kulikov | Shankar Kumar

In many natural language processing (NLP) tasks the same input (e.g. source sentence) can have multiple possible outputs (e.g. translations). To analyze how this ambiguity (also known as intrinsic uncertainty) shapes the distribution learned by neural sequence models we measure sentence-level uncertainty by computing the degree of overlap between references in multi-reference test sets from two different NLP tasks: machine translation (MT) and grammatical error correction (GEC). At both the sentence- and the task-level, intrinsic uncertainty has major implications for various aspects of search such as the inductive biases in beam search and the complexity of exact search. In particular, we show that well-known pathologies such as a high number of beam search errors, the inadequacy of the mode, and the drop in system performance with large beam sizes apply to tasks with high level of ambiguity such as MT but not to less uncertain tasks such as GEC. Furthermore, we propose a novel exact n-best search algorithm for neural sequence models, and show that intrinsic uncertainty affects model uncertainty as the model tends to overly spread out the probability mass for uncertain tasks and sentences.

pdf bib
FlipDA: Effective and Robust Data Augmentation for Few-Shot Learning
Jing Zhou | Yanan Zheng | Jie Tang | Li Jian | Zhilin Yang

Most previous methods for text data augmentation are limited to simple tasks and weak baselines. We explore data augmentation on hard tasks (i.e., few-shot natural language understanding) and strong baselines (i.e., pretrained models with over one billion parameters). Under this setting, we reproduced a large number of previous augmentation methods and found that these methods bring marginal gains at best and sometimes degrade the performance much. To address this challenge, we propose a novel data augmentation method FlipDA that jointly uses a generative model and a classifier to generate label-flipped data. Central to the idea of FlipDA is the discovery that generating label-flipped data is more crucial to the performance than generating label-preserved data. Experiments show that FlipDA achieves a good tradeoff between effectiveness and robustness—it substantially improves many tasks while not negatively affecting the others.

pdf bib
Bilingual alignment transfers to multilingual alignment for unsupervised parallel text mining
Chih-chan Tien | Shane Steinert-Threlkeld

This work presents methods for learning cross lingual sentence representations using paired or unpaired bilingual texts We hypothesize that the cross lingual alignment strategy is transferable and therefore a model trained to align only two languages can encode multilingually more aligned representations We thus introduce dual pivot transfer training on one language pair and evaluating on other pairs To study this theory we design unsupervised models trained on unpaired sentences and single pair supervised models trained on bitexts both based on the unsupervised language model XLM R with its parameters frozen The experiments evaluate the models as universal sentence encoders on the task of unsupervised bitext mining on two datasets where the unsupervised model reaches the state of the art of unsupervised retrieval and the alternative single pair supervised model approaches the performance of multilingually supervised models The results suggest that bilingual training techniques as proposed can be applied to get sentence representations with multilingual alignment

pdf bib
Pyramid-BERT: Reducing Complexity via Successive Core-set based Token Selection
Xin Huang | Ashish Khetan | Rene Bidart | Zohar Karnin

Transformer-based language models such as BERT (CITATION) have achieved the state-of-the-art performance on various NLP tasks, but are computationally prohibitive. A recent line of works use various heuristics to successively shorten sequence length while transforming tokens through encoders, in tasks such as classification and ranking that require a single token embedding for prediction.We present a novel solution to this problem, called Pyramid-BERT where we replace previously used heuristics with a core-set based token selection method justified by theoretical results. The core-set based token selection technique allows us to avoid expensive pre-training, gives a space-efficient fine tuning, and thus makes it suitable to handle longer sequence lengths. We provide extensive experiments establishing advantages of pyramid BERT over several baselines and existing works on the GLUE benchmarks and Long Range Arena (CITATION) datasets.


pdf (full)
bib (full)
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

pdf bib
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Smaranda Muresan | Preslav Nakov | Aline Villavicencio

pdf bib
Investigating person-specific errors in chat-oriented dialogue systems
Koh Mitsuda | Ryuichiro Higashinaka | Tingxuan Li | Sen Yoshida

Creating chatbots to behave like real people is important in terms of believability. Errors in general chatbots and chatbots that follow a rough persona have been studied, but those in chatbots that behave like real people have not been thoroughly investigated. We collected a large amount of user interactions of a generation-based chatbot trained from large-scale dialogue data of a specific character, i.e., target person, and analyzed errors related to that person. We found that person-specific errors can be divided into two types: errors in attributes and those in relations, each of which can be divided into two levels: self and other. The correspondence with an existing taxonomy of errors was also investigated, and person-specific errors that should be addressed in the future were clarified.


pdf (full)
bib (full)
Findings of the Association for Computational Linguistics: ACL 2022

pdf bib
Findings of the Association for Computational Linguistics: ACL 2022
Smaranda Muresan | Preslav Nakov | Aline Villavicencio

pdf bib
RelationPrompt: Leveraging Prompts to Generate Synthetic Data for Zero-Shot Relation Triplet Extraction
Yew Ken Chia | Lidong Bing | Soujanya Poria | Luo Si

Despite the importance of relation extraction in building and representing knowledge, less research is focused on generalizing to unseen relations types. We introduce the task setting of Zero-Shot Relation Triplet Extraction (ZeroRTE) to encourage further research in low-resource relation extraction methods. Given an input sentence, each extracted triplet consists of the head entity, relation label, and tail entity where the relation label is not seen at the training stage. To solve ZeroRTE, we propose to synthesize relation examples by prompting language models to generate structured texts. Concretely, we unify language model prompts and structured text approaches to design a structured prompt template for generating synthetic relation samples when conditioning on relation label prompts (RelationPrompt). To overcome the limitation for extracting multiple relation triplets in a sentence, we design a novel Triplet Search Decoding method. Experiments on FewRel and Wiki-ZSL datasets show the efficacy of RelationPrompt for the ZeroRTE task and zero-shot relation classification. Our code and data are available at

pdf bib
Table-based Fact Verification with Self-adaptive Mixture of Experts
Yuxuan Zhou | Xien Liu | Kaiyin Zhou | Ji Wu

The table-based fact verification task has recently gained widespread attention and yet remains to be a very challenging problem. It inherently requires informative reasoning over natural language together with different numerical and logical reasoning on tables (e.g., count, superlative, comparative). Considering that, we exploit mixture-of-experts and present in this paper a new method: Self-adaptive Mixture-of-Experts Network (SaMoE). Specifically, we have developed a mixture-of-experts neural network to recognize and execute different types of reasoning—the network is composed of multiple experts, each handling a specific part of the semantics for reasoning, whereas a management module is applied to decide the contribution of each expert network to the verification result. A self-adaptive method is developed to teach the management module combining results of different experts more efficiently without external knowledge. The experimental results illustrate that our framework achieves 85.1% accuracy on the benchmark dataset TabFact, comparable with the previous state-of-the-art models. We hope our framework can serve as a new baseline for table-based verification. Our code is available at

pdf bib
LEVEN: A Large-Scale Chinese Legal Event Detection Dataset
Feng Yao | Chaojun Xiao | Xiaozhi Wang | Zhiyuan Liu | Lei Hou | Cunchao Tu | Juanzi Li | Yun Liu | Weixing Shen | Maosong Sun

Recognizing facts is the most fundamental step in making judgments, hence detecting events in the legal documents is important to legal case analysis tasks. However, existing Legal Event Detection (LED) datasets only concern incomprehensive event types and have limited annotated data, which restricts the development of LED methods and their downstream applications. To alleviate these issues, we present LEVEN a large-scale Chinese LEgal eVENt detection dataset, with 8,116 legal documents and 150,977 human-annotated event mentions in 108 event types. Not only charge-related events, LEVEN also covers general events, which are critical for legal case understanding but neglected in existing LED datasets. To our knowledge, LEVEN is the largest LED dataset and has dozens of times the data scale of others, which shall significantly promote the training and evaluation of LED methods. The results of extensive experiments indicate that LED is challenging and needs further effort. Moreover, we simply utilize legal events as side information to promote downstream applications. The method achieves improvements of average 2.2 points precision in low-resource judgment prediction, and 1.5 points mean average precision in unsupervised case retrieval, which suggests the fundamentality of LED. The source code and dataset can be obtained from

pdf bib
RuCCoN Clinical Concept Normalization in RussianRuCCoN: Clinical Concept Normalization in Russian
Alexandr Nesterov | Galina Zubkova | Zulfat Miftahutdinov | Vladimir Kokh | Elena Tutubalina | Artem Shelmanov | Anton Alekseev | Manvel Avetisian | Andrey Chertok | Sergey Nikolenko

We present RuCCoN a new dataset for clinical concept normalization in Russian manually annotated by medical professionals It contains over 16,028 entity mentions manually linked to over 2,409 unique concepts from the Russian language part of the UMLS ontology We provide train test splits for different settings stratified zero shot and CUI less and present strong baselines obtained with state of the art models such as SapBERT At present Russian medical NLP is lacking in both datasets and trained models and we view this work as an important step towards filling this gap Our dataset and annotation guidelines are available at

pdf bib
Dynamically Refined Regularization for Improving Cross-corpora Hate Speech Detection
Tulika Bose | Nikolaos Aletras | Irina Illina | Dominique Fohr

Hate speech classifiers exhibit substantial performance degradation when evaluated on datasets different from the source. This is due to learning spurious correlations between words that are not necessarily relevant to hateful language, and hate speech labels from the training corpus. Previous work has attempted to mitigate this problem by regularizing specific terms from pre-defined static dictionaries. While this has been demonstrated to improve the generalizability of classifiers, the coverage of such methods is limited and the dictionaries require regular manual updates from human experts. In this paper, we propose to automatically identify and reduce spurious correlations using attribution methods with dynamic refinement of the list of terms that need to be regularized during training. Our approach is flexible and improves the cross-corpora performance over previous work independently and in combination with pre-defined dictionaries.

pdf bib
Visualizing the Relationship Between Encoded Linguistic Information and Task Performance
Jiannan Xiang | Huayang Li | Defu Lian | Guoping Huang | Taro Watanabe | Lemao Liu

Probing is popular to analyze whether linguistic information can be captured by a well trained deep neural model but it is hard to answer how the change of the encoded linguistic information will affect task performance To this end we study the dynamic relationship between the encoded linguistic information and task performance from the viewpoint of Pareto Optimality Its key idea is to obtain a set of models which are Pareto optimal in terms of both objectives From this viewpoint we propose a method to optimize the Pareto optimal models by formalizing it as a multi objective optimization problem We conduct experiments on two popular NLP tasks i.e. machine translation and language modeling and investigate the relationship between several kinds of linguistic information and task performances Experimental results demonstrate that the proposed method is better than a baseline method Our empirical findings suggest that some syntactic information is helpful for NLP tasks whereas encoding more syntactic information does not necessarily lead to better performance because the model architecture is also an important factor

pdf bib
Efficient Argument Structure Extraction with Transfer Learning and Active Learning
Xinyu Hua | Lu Wang

The automation of extracting argument structures faces a pair of challenges on encoding long term contexts to facilitate comprehensive understanding and improving data efficiency since constructing high quality argument structures is time consuming In this work we propose a novel context aware Transformer based argument structure prediction model which on five different domains significantly outperforms models that rely on features or only encode limited contexts To tackle the difficulty of data annotation we examine two complementary methods i transfer learning to leverage existing annotated data to boost model performance in a new target domain and ii active learning to strategically identify a small amount of samples for annotation We further propose model independent sample acquisition strategies which can be generalized to diverse domains With extensive experiments we show that our simple yet effective acquisition strategies yield competitive results against three strong comparisons Combined with transfer learning substantial F1 score boost can be further achieved during the early iterations of active learning across domains

pdf bib
SyMCoM - Syntactic Measure of Code Mixing A Study Of English-Hindi Code-Mixing
Prashant Kodali | Anmol Goel | Monojit Choudhury | Manish Shrivastava | Ponnurangam Kumaraguru

Code mixing is the linguistic phenomenon where bilingual speakers tend to switch between two or more languages in conversations. Recent work on code-mixing in computational settings has leveraged social media code mixed texts to train NLP models. For capturing the variety of code mixing in, and across corpus, Language ID (LID) tags based measures (CMI) have been proposed. Syntactical variety/patterns of code-mixing and their relationship vis-a-vis computational model’s performance is under explored. In this work, we investigate a collection of English(en)-Hindi(hi) code-mixed datasets from a syntactic lens to propose, SyMCoM, an indicator of syntactic variety in code-mixed text, with intuitive theoretical bounds. We train SoTA en-hi PoS tagger, accuracy of 93.4%, to reliably compute PoS tags on a corpus, and demonstrate the utility of SyMCoM by applying it on various syntactical categories on a collection of datasets, and compare datasets using the measure.

pdf bib
Classification without (Proper) Representation: Political Heterogeneity in Social Media and Its Implications for Classification and Behavioral Analysis
Kenan Alkiek | Bohan Zhang | David Jurgens

Reddit is home to a broad spectrum of political activity, and users signal their political affiliations in multiple ways—from self-declarations to community participation. Frequently, computational studies have treated political users as a single bloc, both in developing models to infer political leaning and in studying political behavior. Here, we test this assumption of political users and show that commonly-used political-inference models do not generalize, indicating heterogeneous types of political users. The models remain imprecise at best for most users, regardless of which sources of data or methods are used. Across a 14-year longitudinal analysis, we demonstrate that the choice in definition of a political user has significant implications for behavioral analysis. Controlling for multiple factors, political users are more toxic on the platform and inter-party interactions are even more toxic—but not all political users behave this way. Last, we identify a subset of political users who repeatedly flip affiliations, showing that these users are the most controversial of all, acting as provocateurs by more frequently bringing up politics, and are more likely to be banned, suspended, or deleted.

pdf bib
Hierarchical Inductive Transfer for Continual Dialogue Learning
Shaoxiong Feng | Xuancheng Ren | Kan Li | Xu Sun

Pre trained models have achieved excellent performance on the dialogue task However for the continual increase of online chit chat scenarios directly fine tuning these models for each of the new tasks not only explodes the capacity of the dialogue system on the embedded devices but also causes knowledge forgetting on pre trained models and knowledge interference among diverse dialogue tasks In this work we propose a hierarchical inductive transfer framework to learn and deploy the dialogue skills continually and efficiently First we introduce the adapter module into pre trained models for learning new dialogue tasks As the only trainable module it is beneficial for the dialogue system on the embedded devices to acquire new dialogue skills with negligible additional parameters Then for alleviating knowledge interference between tasks yet benefiting the regularization between them we further design hierarchical inductive transfer that enables new tasks to use general knowledge in the base adapter without being misled by diverse knowledge in task specific adapters Empirical evaluation and analysis indicate that our framework obtains comparable performance under deployment friendly model capacity

pdf bib
A Simple yet Effective Relation Information Guided Approach for Few-Shot Relation Extraction
Yang Liu | Jinpeng Hu | Xiang Wan | Tsung-Hui Chang

Few-Shot Relation Extraction aims at predicting the relation for a pair of entities in a sentence by training with a few labelled examples in each relation. Some recent works have introduced relation information (i.e., relation labels or descriptions) to assist model learning based on Prototype Network. However, most of them constrain the prototypes of each relation class implicitly with relation information, generally through designing complex network structures, like generating hybrid features, combining with contrastive learning or attention networks. We argue that relation information can be introduced more explicitly and effectively into the model. Thus, this paper proposes a direct addition approach to introduce relation information. Specifically, for each relation class, the relation representation is first generated by concatenating two views of relations (i.e., [CLS] token embedding and the mean value of embeddings of all tokens) and then directly added to the original prototype for both train and prediction. Experimental results on the benchmark dataset FewRel 1.0 show significant improvements and achieve comparable results to the state-of-the-art, which demonstrates the effectiveness of our proposed approach. Besides, further analyses verify that the direct addition is a much more effective way to integrate the relation representations and the original prototypes.

pdf bib
MIMICause: Representation and automatic extraction of causal relation types from clinical notes
Vivek Khetan | Md Imbesat Rizvi | Jessica Huber | Paige Bartusiak | Bogdan Sacaleanu | Andrew Fano

Understanding causal narratives communicated in clinical notes can help make strides towards personalized healthcare. Extracted causal information from clinical notes can be combined with structured EHR data such as patients’ demographics, diagnoses, and medications. This will enhance healthcare providers’ ability to identify aspects of a patient’s story communicated in the clinical notes and help make more informed decisions. In this work, we propose annotation guidelines, develop an annotated corpus and provide baseline scores to identify types and direction of causal relations between a pair of biomedical concepts in clinical notes; communicated implicitly or explicitly, identified either in a single sentence or across multiple sentences. We annotate a total of 2714 de-identified examples sampled from the 2018 n2c2 shared task dataset and train four different language model based architectures. Annotation based on our guidelines achieved a high inter-annotator agreement i.e. Fleiss’ kappa (\\kappa) score of 0.72, and our model for identification of causal relations achieved a macro F1 score of 0.56 on the test data. The high inter-annotator agreement for clinical text shows the quality of our annotation guidelines while the provided baseline F1 score sets the direction for future research towards understanding narratives in clinical texts.

pdf bib
Fact-Tree Reasoning for N-ary Question Answering over Knowledge Graphs
Yao Zhang | Peiyao Li | Hongru Liang | Adam Jatowt | Zhenglu Yang

Current Question Answering over Knowledge Graphs (KGQA) task mainly focuses on performing answer reasoning upon KGs with binary facts. However, it neglects the n-ary facts, which contain more than two entities. In this work, we highlight a more challenging but under-explored task: n-ary KGQA, i.e., answering n-ary facts questions upon n-ary KGs. Nevertheless, the multi-hop reasoning framework popular in binary KGQA task is not directly applicable on n-ary KGQA. We propose two feasible improvements: 1) upgrade the basic reasoning unit from entity or relation to fact, and 2) upgrade the reasoning structure from chain to tree. Therefore, we propose a novel fact-tree reasoning framework, FacTree, which integrates the above two upgrades. FacTree transforms the question into a fact tree and performs iterative fact reasoning on the fact tree to infer the correct answer. Experimental results on the n-ary KGQA dataset we constructed and two binary KGQA benchmarks demonstrate the effectiveness of FacTree compared with state-of-the-art methods.

pdf bib
Mukayese Turkish NLP Strikes BackTurkish NLP Strikes Back
Ali Safaya | Emirhan Kurtuluş | Arda Goktogan | Deniz Yuret

Having sufficient resources for language X lifts it from the under resourced languages class but not necessarily from the under researched class In this paper we address the problem of the absence of organized benchmarks in the Turkish language We demonstrate that languages such as Turkish are left behind the state of the art in NLP applications As a solution we present Mukayese a set of NLP benchmarks for the Turkish language that contains several NLP tasks We work on one or more datasets for each benchmark and present two or more baselines Moreover we present four new benchmarking datasets in Turkish for language modeling sentence segmentation and spell checking All datasets and baselines are available under

pdf bib
Distinguishing Non-natural from Natural Adversarial Samples for More Robust Pre-trained Language Model
Jiayi Wang | Rongzhou Bao | Zhuosheng Zhang | Hai Zhao

Recently, the problem of robustness of pre-trained language models (PrLMs) has received increasing research interest. Latest studies on adversarial attacks achieve high attack success rates against PrLMs, claiming that PrLMs are not robust. However, we find that the adversarial samples that PrLMs fail are mostly non-natural and do not appear in reality. We question the validity of the current evaluation of robustness of PrLMs based on these non-natural adversarial samples and propose an anomaly detector to evaluate the robustness of PrLMs with more natural adversarial samples. We also investigate two applications of the anomaly detector: (1) In data augmentation, we employ the anomaly detector to force generating augmented data that are distinguished as non-natural, which brings larger gains to the accuracy of PrLMs. (2) We apply the anomaly detector to a defense framework to enhance the robustness of PrLMs. It can be used to defend all types of attacks and achieves higher accuracy on both adversarial samples and compliant samples than other defense frameworks.

pdf bib
GRS Combining Generation and Revision in Unsupervised Sentence SimplificationGRS: Combining Generation and Revision in Unsupervised Sentence Simplification
Mohammad Dehghan | Dhruv Kumar | Lukasz Golab

We propose GRS an unsupervised approach to sentence simplification that combines text generation and text revision We start with an iterative framework in which an input sentence is revised using explicit edit operations and add paraphrasing as a new edit operation This allows us to combine the advantages of generative and revision based approaches paraphrasing captures complex edit operations and the use of explicit edit operations in an iterative manner provides controllability and interpretability We demonstrate these advantages of GRS compared to existing methods on the Newsela and ASSET datasets

pdf bib
Distributed NLI Learning to Predict Human Opinion Distributions for Language ReasoningNLI: Learning to Predict Human Opinion Distributions for Language Reasoning
Xiang Zhou | Yixin Nie | Mohit Bansal

We introduce distributed NLI a new NLU task with a goal to predict the distribution of human judgements for natural language inference We show that by applying additional distribution estimation methods namely Monte Carlo MC Dropout Deep Ensemble Re Calibration and Distribution Distillation models can capture human judgement distribution more effectively than the softmax baseline We show that MC Dropout is able to achieve decent performance without any distribution annotations while Re Calibration can give further improvements with extra distribution annotations suggesting the value of multiple annotations for one example in modeling the distribution of human judgements Despite these improvements the best results are still far below the estimated human upper bound indicating that predicting the distribution of human judgements is still an open challenging problem with a large room for improvements We showcase the common errors for MC Dropout and Re Calibration Finally we give guidelines on the usage of these methods with different levels of data availability and encourage future work on modeling the human opinion distribution for language reasoning

pdf bib
What to Learn, and How: Toward Effective Learning from Rationales
Samuel Carton | Surya Kanoria | Chenhao Tan

Learning from rationales seeks to augment model prediction accuracy using human-annotated rationales (i.e. subsets of input tokens) that justify their chosen labels, often in the form of intermediate or multitask supervision. While intuitive, this idea has proven elusive in practice. We make two observations about human rationales via empirical analyses:1) maximizing rationale supervision accuracy is not necessarily the optimal objective for improving model accuracy; 2) human rationales vary in whether they provide sufficient information for the model to exploit for prediction.Building on these insights, we propose several novel loss functions and learning strategies, and evaluate their effectiveness on three datasets with human rationales. Our results demonstrate consistent improvements over baselines in both label and rationale accuracy, including a 3% accuracy improvement on MultiRC. Our work highlights the importance of understanding properties of human explanations and exploiting them accordingly in model training.

pdf bib
Learning to Robustly Aggregate Labeling Functions for Semi-supervised Data Programming
Ayush Maheshwari | Krishnateja Killamsetty | Ganesh Ramakrishnan | Rishabh Iyer | Marina Danilevsky | Lucian Popa

A critical bottleneck in supervised machine learning is the need for large amounts of labeled data which is expensive and time-consuming to obtain. Although a small amount of labeled data cannot be used to train a model, it can be used effectively for the generation of humaninterpretable labeling functions (LFs). These LFs, in turn, have been used to generate a large amount of additional noisy labeled data in a paradigm that is now commonly referred to as data programming. Previous methods of generating LFs do not attempt to use the given labeled data further to train a model, thus missing opportunities for improving performance. Additionally, since the LFs are generated automatically, they are likely to be noisy, and naively aggregating these LFs can lead to suboptimal results. In this work, we propose an LF-based bi-level optimization framework WISDOM to solve these two critical limitations. WISDOM learns a joint model on the (same) labeled dataset used for LF induction along with any unlabeled data in a semi-supervised manner, and more critically, reweighs each LF according to its goodness, influencing its contribution to the semi-supervised loss using a robust bi-level optimization algorithm. We show that WISDOM significantly outperforms prior approaches on several text classification datasets.

pdf bib
Cross-lingual Inference with A Chinese Entailment Graph
Tianyi Li | Sabine Weber | Mohammad Javad Hosseini | Liane Guillou | Mark Steedman

Predicate entailment detection is a crucial task for question-answering from text, where previous work has explored unsupervised learning of entailment graphs from typed open relation triples. In this paper, we present the first pipeline for building Chinese entailment graphs, which involves a novel high-recall open relation extraction (ORE) method and the first Chinese fine-grained entity typing dataset under the FIGER type ontology. Through experiments on the Levy-Holt dataset, we verify the strength of our Chinese entailment graph, and reveal the cross-lingual complementarity: on the parallel Levy-Holt dataset, an ensemble of Chinese and English entailment graphs outperforms both monolingual graphs, and raises unsupervised SOTA by 4.7 AUC points.

pdf bib
Graph Neural Networks for Multiparallel Word Alignment
Ayyoob Imani | Lütfi Kerem Senel | Masoud Jalili Sabet | François Yvon | Hinrich Schuetze

After a period of decrease interest in word alignments is increasing again for their usefulness in domains such as typological research cross lingual annotation projection and machine translation Generally alignment algorithms only use bitext and do not make use of the fact that many parallel corpora are multiparallel Here we compute high quality word alignments between multiple language pairs by considering all language pairs together First we create a multiparallel word alignment graph joining all bilingual word alignment pairs in one graph Next we use graph neural networks GNNs to exploit the graph structure Our GNN approach i utilizes information about the meaning position and language of the input words ii incorporates information from multiple parallel sentences iii adds and removes edges from the initial alignments and iv yields a prediction model that can generalize beyond the training sentences We show that community detection algorithms can provide valuable information for multiparallel word alignment Our method outperforms previous work on three word alignment datasets and on a downstream task

pdf bib
Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR ErrorsASR Errors
Yang Wu | Yanyan Zhao | Hao Yang | Song Chen | Bing Qin | Xiaohuan Cao | Wenting Zhao

Multimodal sentiment analysis has attracted increasing attention and lots of models have been proposed However the performance of the state of the art models decreases sharply when they are deployed in the real world We find that the main reason is that real world applications can only access the text outputs by the automatic speech recognition ASR models which may be with errors because of the limitation of model capacity Through further analysis of the ASR outputs we find that in some cases the sentiment words the key sentiment elements in the textual modality are recognized as other words which makes the sentiment of the text change and hurts the performance of multimodal sentiment analysis models directly To address this problem we propose the sentiment word aware multimodal refinement model SWRM which can dynamically refine the erroneous sentiment words by leveraging multimodal sentiment clues Specifically we first use the sentiment word position detection module to obtain the most possible position of the sentiment word in the text and then utilize the multimodal sentiment word refinement module to dynamically refine the sentiment word embeddings The refined embeddings are taken as the textual inputs of the multimodal feature fusion module to predict the sentiment labels We conduct extensive experiments on the real world datasets including MOSI Speechbrain MOSI IBM and MOSI iFlytek and the results demonstrate the effectiveness of our model which surpasses the current state of the art models on three datasets Furthermore our approach can be adapted for other multimodal feature fusion models easily

pdf bib
End-to-End Speech Translation for Code Switched Speech
Orion Weller | Matthias Sperber | Telmo Pires | Hendra Setiawan | Christian Gollan | Dominic Telaar | Matthias Paulik

Code switching (CS) refers to the phenomenon of interchangeably using words and phrases from different languages. CS can pose significant accuracy challenges to NLP, due to the often monolingual nature of the underlying systems. In this work, we focus on CS in the context of English/Spanish conversations for the task of speech translation (ST), generating and evaluating both transcript and translation. To evaluate model performance on this task, we create a novel ST corpus derived from existing public data sets. We explore various ST architectures across two dimensions: cascaded (transcribe then translate) vs end-to-end (jointly transcribe and translate) and unidirectional (source -> target) vs bidirectional (source <-> target). We show that our ST architectures, and especially our bidirectional end-to-end architecture, perform well on CS speech, even when no CS training data is used.

pdf bib
Capture Human Disagreement Distributions by Calibrated Networks for Natural Language Inference
Yuxia Wang | Minghan Wang | Yimeng Chen | Shimin Tao | Jiaxin Guo | Chang Su | Min Zhang | Hao Yang

Natural Language Inference NLI datasets contain examples with highly ambiguous labels due to its subjectivity Several recent efforts have been made to acknowledge and embrace the existence of ambiguity and explore how to capture the human disagreement distribution In contrast with directly learning from gold ambiguity labels relying on special resource we argue that the model has naturally captured the human ambiguity distribution as long as its calibrated i.e. the predictive probability can reflect the true correctness likelihood Our experiments show that when model is well calibrated either by label smoothing or temperature scaling it can obtain competitive performance as prior work on both divergence scores between predictive probability and the true human opinion distribution and the accuracy This reveals the overhead of collecting gold ambiguity labels can be cut by broadly solving how to calibrate the NLI network

pdf bib
Efficient, Uncertainty-based Moderation of Neural Networks Text Classifiers
Jakob Smedegaard Andersen | Walid Maalej

To maximize the accuracy and increase the overall acceptance of text classifiers, we propose a framework for the efficient, in-operation moderation of classifiers’ output. Our framework focuses on use cases in which F1-scores of modern Neural Networks classifiers (ca. 90%) are still inapplicable in practice. We suggest a semi-automated approach that uses prediction uncertainties to pass unconfident, probably incorrect classifications to human moderators. To minimize the workload, we limit the human moderated data to the point where the accuracy gains saturate and further human effort does not lead to substantial improvements. A series of benchmarking experiments based on three different datasets and three state-of-the-art classifiers show that our framework can improve the classification F1-scores by 5.1 to 11.2% (up to approx. 98 to 99%), while reducing the moderation load up to 73.3% compared to a random moderation.

pdf bib
Open Vocabulary Extreme Classification Using Generative Models
Daniel Simig | Fabio Petroni | Pouya Yanki | Kashyap Popat | Christina Du | Sebastian Riedel | Majid Yazdani

The extreme multi label classification XMC task aims at tagging content with a subset of labels from an extremely large label set The label vocabulary is typically defined in advance by domain experts and assumed to capture all necessary tags However in real world scenarios this label set although large is often incomplete and experts frequently need to refine it To develop systems that simplify this process we introduce the task of open vocabulary XMC OXMC): given a piece of content predict a set of labels some of which may be outside of the known tag set Hence in addition to not having training data for some labelsas is the case in zero shot classificationmodels need to invent some labels on thefly We propose GROOV a fine tuned seq2seq model for OXMC that generates the set of labels as a flat sequence and is trained using a novel loss independent of predicted label order We show the efficacy of the approach experimenting with popular XMC datasets for which GROOV is able to predict meaningful labels outside the given vocabulary while performing on par with state of the art solutions for known labels

pdf bib
Decomposed Meta-Learning for Few-Shot Named Entity Recognition
Tingting Ma | Huiqiang Jiang | Qianhui Wu | Tiejun Zhao | Chin-Yew Lin

Few-shot named entity recognition (NER) systems aim at recognizing novel-class named entities based on only a few labeled examples. In this paper, we present a decomposed meta-learning approach which addresses the problem of few-shot NER by sequentially tackling few-shot span detection and few-shot entity typing using meta-learning. In particular, we take the few-shot span detection as a sequence labeling problem and train the span detector by introducing the model-agnostic meta-learning (MAML) algorithm to find a good model parameter initialization that could fast adapt to new entity classes. For few-shot entity typing, we propose MAML-ProtoNet, i.e., MAML-enhanced prototypical networks to find a good embedding space that can better distinguish text span representations from different entity classes. Extensive experiments on various benchmarks show that our approach achieves superior performance over prior methods.

pdf bib
Logic-Driven Context Extension and Data Augmentation for Logical Reasoning of Text
Siyuan Wang | Wanjun Zhong | Duyu Tang | Zhongyu Wei | Zhihao Fan | Daxin Jiang | Ming Zhou | Nan Duan

Logical reasoning of text requires identifying critical logical structures in the text and performing inference over them. Existing methods for logical reasoning mainly focus on contextual semantics of text while struggling to explicitly model the logical inference process. In this paper, we not only put forward a logic-driven context extension framework but also propose a logic-driven data augmentation algorithm. The former follows a three-step reasoning paradigm, and each step is respectively to extract logical expressions as elementary reasoning units, symbolically infer the implicit expressions following equivalence laws and extend the context to validate the options. The latter augments literally similar but logically different instances and incorporates contrastive learning to better capture logical information, especially logical negative and conditional relationships. We conduct experiments on two benchmark datasets, ReClor and LogiQA. The results show that our method achieves state-of-the-art performance on both datasets, and even surpasses human performance on the ReClor dataset.

pdf bib
Document-Level Relation Extraction with Adaptive Focal Loss and Knowledge Distillation
Qingyu Tan | Ruidan He | Lidong Bing | Hwee Tou Ng

Document-level Relation Extraction (DocRE) is a more challenging task compared to its sentence-level counterpart. It aims to extract relations from multiple sentences at once. In this paper, we propose a semi-supervised framework for DocRE with three novel components. Firstly, we use an axial attention module for learning the interdependency among entity-pairs, which improves the performance on two-hop relations. Secondly, we propose an adaptive focal loss to tackle the class imbalance problem of DocRE. Lastly, we use knowledge distillation to overcome the differences between human annotated data and distantly supervised data. We conducted experiments on two DocRE datasets. Our model consistently outperforms strong baselines and its performance exceeds the previous SOTA by 1.36 F1 and 1.46 Ign_F1 score on the DocRED leaderboard.

pdf bib
How Pre-trained Language Models Capture Factual Knowledge? A Causal-Inspired Analysis
Shaobo Li | Xiaoguang Li | Lifeng Shang | Zhenhua Dong | Chengjie Sun | Bingquan Liu | Zhenzhou Ji | Xin Jiang | Qun Liu

Recently, there has been a trend to investigate the factual knowledge captured by Pre-trained Language Models (PLMs). Many works show the PLMs’ ability to fill in the missing factual words in cloze-style prompts such as ”Dante was born in [MASK].” However, it is still a mystery how PLMs generate the results correctly: relying on effective clues or shortcut patterns? We try to answer this question by a causal-inspired analysis that quantitatively measures and evaluates the word-level patterns that PLMs depend on to generate the missing words. We check the words that have three typical associations with the missing words: knowledge-dependent, positionally close, and highly co-occurred. Our analysis shows: (1) PLMs generate the missing factual words more by the positionally close and highly co-occurred words than the knowledge-dependent words; (2) the dependence on the knowledge-dependent words is more effective than the positionally close and highly co-occurred words. Accordingly, we conclude that the PLMs capture the factual knowledge ineffectively because of depending on the inadequate associations.

pdf bib
Ranking-Constrained Learning with Rationales for Text Classification
Juanyan Wang | Manali Sharma | Mustafa Bilgic

We propose a novel approach that jointly utilizes the labels and elicited rationales for text classification to speed up the training of deep learning models with limited training data. We define and optimize a ranking-constrained loss function that combines cross-entropy loss with ranking losses as rationale constraints. We evaluate our proposed rationale-augmented learning approach on three human-annotated datasets, and show that our approach provides significant improvements over classification approaches that do not utilize rationales as well as other state-of-the-art rationale-augmented baselines.

pdf bib
The impact of lexical and grammatical processing on generating code from natural language
Nathanaël Beau | Benoit Crabbé

Considering the seq2seq architecture of Yin and Neubig for natural language to code translation we identify four key components of importance grammatical constraints lexical preprocessing input representations and copy mechanisms To study the impact of these components we use a state of the art architecture that relies on BERT encoder and a grammar based decoder for which a formalization is provided The paper highlights the importance of the lexical substitution component in the current natural language to code systems

pdf bib
Your fairness may vary Pretrained language model fairness in toxic text classification
Ioana Baldini | Dennis Wei | Karthikeyan Natesan Ramamurthy | Moninder Singh | Mikhail Yurochkin

The popularity of pretrained language models in natural language processing systems calls for a careful evaluation of such models in down stream tasks which have a higher potential for societal impact The evaluation of such systems usually focuses on accuracy measures Our findings in this paper call for attention to be paid to fairness measures as well Through the analysis of more than a dozen pretrained language models of varying sizes on two toxic text classification tasks English we demonstrate that focusing on accuracy measures alone can lead to models with wide variation in fairness characteristics Specifically we observe that fairness can vary even more than accuracy with increasing training data size and different random initializations At the same time we find that little of the fairness variation is explained by model size despite claims in the literature To improve model fairness without retraining we show that two post processing methods developed for structured tabular data can be successfully applied to a range of pretrained language models Warning This paper contains samples of offensive text

pdf bib
Improving Neural Political Statement Classification with Class Hierarchical Information
Erenay Dayanik | Andre Blessing | Nico Blokker | Sebastian Haunss | Jonas Kuhn | Gabriella Lapesa | Sebastian Pado

Many tasks in text based computational social science CSS involve \n the classification of political statements into categories based on a domain specific codebook In order to be useful for CSS analysis these categories must be fine grained The typically skewed distribution of fine grained categories however results in \n a challenging classification problem on the NLP side This paper proposes to make use of the hierarchical relations among categories typically present in such codebooks \n e.g. markets and taxation are both subcategories of economy while borders is a subcategory of security We use these ontological relations as prior knowledge to establish additional constraints on the learned model thus \n improving performance overall and in particular for infrequent categories We evaluate several lightweight variants of this intuition by extending state of the art transformer based text \n classifiers on two datasets and multiple languages We find the most consistent improvement for an approach based on regularization

pdf bib
Why don’t people use character-level machine translation?
Jindřich Libovický | Helmut Schmid | Alexander Fraser

We present a literature and empirical survey that critically assesses the state of the art in character-level modeling for machine translation (MT). Despite evidence in the literature that character-level systems are comparable with subword systems, they are virtually never used in competitive setups in WMT competitions. We empirically show that even with recent modeling innovations in character-level natural language processing, character-level MT systems still struggle to match their subword-based counterparts. Character-level MT systems show neither better domain robustness, nor better morphological generalization, despite being often so motivated. However, we are able to show robustness towards source side noise and that translation quality does not degrade with increasing beam size at decoding time.

pdf bib
Automatic Speech Recognition and Query By Example for Creole Languages Documentation
Cécile Macaire | Didier Schwab | Benjamin Lecouteux | Emmanuel Schang

We investigate the exploitation of self supervised models for two Creole languages with few resources Gwadloupyen and Morisien Automatic language processing tools are almost non existent for these two languages We propose to use about one hour of annotated data to design an automatic speech recognition system for each language We evaluate how much data is needed to obtain a query by example system that is usable by linguists Moreover our experiments show that multilingual self supervised models are not necessarily the most efficient for Creole languages

pdf bib
Long Time No See! Open-Domain Conversation with Long-Term Persona Memory
Xinchao Xu | Zhibin Gou | Wenquan Wu | Zheng-Yu Niu | Hua Wu | Haifeng Wang | Shihang Wang

Most of the open-domain dialogue models tend to perform poorly in the setting of long-term human-bot conversations. The possible reason is that they lack the capability of understanding and memorizing long-term dialogue history information. To address this issue, we present a novel task of Long-term Memory Conversation (LeMon) and then build a new dialogue dataset DuLeMon and a dialogue generation framework with Long-Term Memory (LTM) mechanism (called PLATO-LTM). This LTM mechanism enables our system to accurately extract and continuously update long-term persona memory without requiring multiple-session dialogue datasets for model training. To our knowledge, this is the first attempt to conduct real-time dynamic management of persona information of both parties, including the user and the bot. Results on DuLeMon indicate that PLATO-LTM can significantly outperform baselines in terms of long-term dialogue consistency, leading to better dialogue engagingness.

pdf bib
Breaking Down Multilingual Machine Translation
Ting-Rui Chiang | Yi-Pei Chen | Yi-Ting Yeh | Graham Neubig

While multilingual training is now an essential ingredient in machine translation MT systems recent work has demonstrated that it has different effects in different multilingual settings such as many to one one to many and many to many learning These training settings expose the encoder and the decoder in a machine translation model with different data distributions In this paper we examine how different varieties of multilingual training contribute to learning these two components of the MT model Specifically we compare bilingual models with encoders and/or decoders initialized by multilingual training We show that multilingual training is beneficial to encoders in general while it only benefits decoders for low resource languages LRLs We further find the important attention heads for each language pair and compare their correlations during inference Our analysis sheds light on how multilingual translation models work and also enables us to propose methods to improve performance by training with highly related languages Our many to one models for high resource languages and one to many models for LRL outperform the best results reported by Aharoni et al

pdf bib
Improving Chinese Grammatical Error Detection via Data augmentation by Conditional Error GenerationChinese Grammatical Error Detection via Data augmentation by Conditional Error Generation
Tianchi Yue | Shulin Liu | Huihui Cai | Tao Yang | Shengkang Song | TingHao Yu

Chinese Grammatical Error Detection(CGED aims at detecting grammatical errors in Chinese texts One of the main challenges for CGED is the lack of annotated data To alleviate this problem previous studies proposed various methods to automatically generate more training samples which can be roughly categorized into rule based methods and model based methods The rule based methods construct erroneous sentences by directly introducing noises into original sentences However the introduced noises are usually context independent which are quite different from those made by humans The model based methods utilize generative models to imitate human errors The generative model may bring too many changes to the original sentences and generate semantically ambiguous sentences so it is difficult to detect grammatical errors in these generated sentences In addition generated sentences may be error free and thus become noisy data To handle these problems we propose CNEG a novel Conditional Non Autoregressive Error Generation model for generating Chinese grammatical errors Specifically in order to generate a context dependent error we first mask a span in a correct text then predict an erroneous span conditioned on both the masked text and the correct span Furthermore we filter out error free spans by measuring their perplexities in the original sentences Experimental results show that our proposed method achieves better performance than all compared data augmentation methods on the CGED-2018 and CGED-2020 benchmarks

pdf bib
Improving Robustness of Language Models from a Geometry-aware Perspective
Bin Zhu | Zhaoquan Gu | Le Wang | Jinyin Chen | Qi Xuan

Recent studies have found that removing the norm-bounded projection and increasing search steps in adversarial training can significantly improve robustness. However, we observe that a too large number of search steps can hurt accuracy. We aim to obtain strong robustness efficiently using fewer steps. Through a toy experiment, we find that perturbing the clean data to the decision boundary but not crossing it does not degrade the test accuracy. Inspired by this, we propose friendly adversarial data augmentation (FADA) to generate friendly adversarial data. On top of FADA, we propose geometry-aware adversarial training (GAT) to perform adversarial training on friendly adversarial data so that we can save a large number of search steps. Comprehensive experiments across two widely used datasets and three pre-trained language models demonstrate that GAT can obtain stronger robustness via fewer steps. In addition, we provide extensive empirical results and in-depth analyses on robustness to facilitate future studies.

pdf bib
UNIMO-2: End-to-End Unified Vision-Language Grounded Learning
Wei Li | Can Gao | Guocheng Niu | Xinyan Xiao | Hao Liu | Jiachen Liu | Hua Wu | Haifeng Wang

Vision-Language Pre-training (VLP) has achieved impressive performance on various cross-modal downstream tasks. However, most existing methods can only learn from aligned image-caption data and rely heavily on expensive regional features, which greatly limits their scalability and performance. In this paper, we propose an end-to-end unified-modal pre-training framework, namely UNIMO-2, for joint learning on both aligned image-caption data and unaligned image-only and text-only corpus. We build a unified Transformer model to jointly learn visual representations, textual representations and semantic alignment between images and texts. In particular, we propose to conduct grounded learning on both images and texts via a sharing grounded space, which helps bridge unaligned images and texts, and align the visual and textual semantic spaces on different types of corpora. The experiments show that our grounded learning method can improve textual and visual semantic alignment for improving performance on various cross-modal tasks. Moreover, benefiting from effective joint modeling of different types of corpora, our model also achieves impressive performance on single-modal visual and textual tasks. Our code and models are public at the UNIMO project page

pdf bib
Word-level Perturbation Considering Word Length and Compositional Subwords
Tatsuya Hiraoka | Sho Takase | Kei Uchiumi | Atsushi Keyaki | Naoaki Okazaki

We present two simple modifications for word-level perturbation: Word Replacement considering Length (WR-L) and Compositional Word Replacement (CWR).In conventional word replacement, a word in an input is replaced with a word sampled from the entire vocabulary, regardless of the length and context of the target word.WR-L considers the length of a target word by sampling words from the Poisson distribution.CWR considers the compositional candidates by restricting the source of sampling to related words that appear in subword regularization.Experimental results showed that the combination of WR-L and CWR improved the performance of text classification and machine translation.

pdf bib
Controlling the Focus of Pretrained Language Generation Models
Jiabao Ji | Yoon Kim | James Glass | Tianxing He

The finetuning of pretrained transformer based language generation models are typically conducted in an end to end manner where the model learns to attend to relevant parts of the input by itself However there does not exist a mechanism to directly control the model’s focus This work aims to develop a control mechanism by which a user can select spans of context as highlights’’ for the model to focus on and generate relevant output To achieve this goal we augment a pretrained model with trainable focus vectors’’ that are directly applied to the model’s embeddings while the model itself is kept fixed These vectors trained on automatic annotations derived from attribution methods act as indicators for context importance We test our approach on two core generation tasks dialogue response generation and abstractive summarization We also collect evaluation data where the highlight generation pairs are annotated by humans Our experiments show that the trained focus vectors are effective in steering the model to generate outputs that are relevant to user selected highlights

pdf bib
CUE Vectors Modular Training of Language Models Conditioned on Diverse Contextual SignalsCUE Vectors: Modular Training of Language Models Conditioned on Diverse Contextual Signals
Scott Novotney | Sreeparna Mukherjee | Zeeshan Ahmed | Andreas Stolcke

We propose a framework to modularize the training of neural language models that use diverse forms of context by eliminating the need to jointly train context and within sentence encoders Our approach contextual universal embeddings CUE trains LMs on one type of contextual data and adapts to novel context types The model consists of a pretrained neural sentence LM a BERT based contextual encoder and a masked transfomer decoder that estimates LM probabilities using sentence internal and contextual evidence When contextually annotated data is unavailable our model learns to combine contextual and sentence internal information using noisy oracle unigram embeddings as a proxy Real context data can be introduced later and used to adapt a small number of parameters that map contextual data into the decoder’s embedding space We validate the CUE framework on a NYTimes text corpus with multiple metadata types for which the LM perplexity can be lowered from 36.6 to 27.4 by conditioning on context Bootstrapping a contextual LM with only a subset of the metadata during training retains of the achievable gain Training the model initially with proxy context retains of the perplexity gain after adapting to real context Furthermore we can swap one type of pretrained sentence LM for another without retraining the context encoders by only adapting the decoder model Overall we obtain a modular framework that allows incremental scalable training of context enhanced LMs

pdf bib
Aligned Weight Regularizers for Pruning Pretrained Neural Networks
James O’ Neill | Sourav Dutta | Haytham Assem

Pruning aims to reduce the number of parameters while maintaining performance close to the original network This work proposes a novel \\emph based pruning strategy whereby the representational similarity between the pruned and unpruned versions of the same network is maximized Unlike previous approaches that treat distillation and pruning separately we use distillation to inform the pruning criteria without requiring a separate student network as in knowledge distillation We show that the proposed implicitly encourages sparse solutions naturally complementing magnitude based pruning criteria Experiments on the GLUE and XGLUE benchmarks show that self distilled pruning increases mono- and cross lingual language model performance Self distilled pruned models also outperform smaller Transformers with an equal number of parameters and are competitive against times larger distilled networks We also observe that self distillation maximizes class separability increases the signal to noise ratio and converges faster after pruning steps providing further insights into why self distilled pruning improves generalizationself-distillation based pruning strategy, whereby the representational similarity between the pruned and unpruned versions of the same network is maximized. Unlike previous approaches that treat distillation and pruning separately, we use distillation to inform the pruning criteria, without requiring a separate student network as in knowledge distillation. We show that the proposed cross-correlation objective for self-distilled pruning implicitly encourages sparse solutions, naturally complementing magnitude-based pruning criteria. Experiments on the GLUE and XGLUE benchmarks show that self-distilled pruning increases mono- and cross-lingual language model performance. Self-distilled pruned models also outperform smaller Transformers with an equal number of parameters and are competitive against (6 times) larger distilled networks. We also observe that self-distillation (1) maximizes class separability, (2) increases the signal-to-noise ratio, and (3) converges faster after pruning steps, providing further insights into why self-distilled pruning improves generalization.

pdf bib
Consistent Representation Learning for Continual Relation Extraction
Kang Zhao | Hua Xu | Jiangong Yang | Kai Gao

Continual relation extraction CRE aims to continuously train a model on data with new relations while avoiding forgetting old ones Some previous work has proved that storing a few typical samples of old relations and replaying them when learning new relations can effectively avoid forgetting However these memory based methods tend to overfit the memory samples and perform poorly on imbalanced datasets To solve these challenges a consistent representation learning method is proposed which maintains the stability of the relation embedding by adopting contrastive learning and knowledge distillation when replaying memory Specifically supervised contrastive learning based on a memory bank is first used to train each new task so that the model can effectively learn the relation representation Then contrastive replay is conducted of the samples in memory and makes the model retain the knowledge of historical relations through memory knowledge distillation to prevent the catastrophic forgetting of the old task The proposed method can better learn consistent representations to alleviate forgetting effectively Extensive experiments on FewRel and TACRED datasets show that our method significantly outperforms state of the art baselines and yield strong robustness on the imbalanced dataset

pdf bib
Comprehensive Multi-Modal Interactions for Referring Image Segmentation
Kanishk Jain | Vineet Gandhi

We investigate Referring Image Segmentation (RIS), which outputs a segmentation map corresponding to the natural language description. Addressing RIS efficiently requires considering the interactions happening across visual and linguistic modalities and the interactions within each modality. Existing methods are limited because they either compute different forms of interactions sequentially (leading to error propagation) or ignore intra-modal interactions. We address this limitation by performing all three interactions simultaneously through a Synchronous Multi-Modal Fusion Module (SFM). Moreover, to produce refined segmentation masks, we propose a novel Hierarchical Cross-Modal Aggregation Module (HCAM), where linguistic features facilitate the exchange of contextual information across the visual hierarchy. We present thorough ablation studies and validate our approach’s performance on four benchmark datasets, showing considerable performance gains over the existing state-of-the-art (SOTA) methods.

pdf bib
Improving Controllable Text Generation with Position-Aware Weighted Decoding
Yuxuan Gu | Xiaocheng Feng | Sicheng Ma | Jiaming Wu | Heng Gong | Bing Qin

Weighted decoding methods composed of the pretrained language model (LM) and the controller have achieved promising results for controllable text generation. However, these models often suffer from a control strength/fluency trade-off problem as higher control strength is more likely to generate incoherent and repetitive text. In this paper, we illustrate this trade-off is arisen by the controller imposing the target attribute on the LM at improper positions. And we propose a novel framework based on existing weighted decoding methods called CAT-PAW, which introduces a lightweight regulator to adjust bias signals from the controller at different decoding positions. Experiments on positive sentiment control, topic control, and language detoxification show the effectiveness of our CAT-PAW upon 4 SOTA models.

pdf bib
What does it take to bake a cake The RecipeRef corpus and anaphora resolution in procedural textRecipeRef corpus and anaphora resolution in procedural text
Biaoyan Fang | Timothy Baldwin | Karin Verspoor

Procedural text contains rich anaphoric phenomena yet has not received much attention in NLP To fill this gap we investigate the textual properties of two types of procedural text recipes and chemical patents and generalize an anaphora annotation framework developed for the chemical domain for modeling anaphoric phenomena in recipes We apply this framework to annotate the RecipeRef corpus with both bridging and coreference relations Through comparison to chemical patents we show the complexity of anaphora resolution in recipes We demonstrate empirically that transfer learning from the chemical domain improves resolution of anaphora in recipes suggesting transferability of general procedural knowledge

pdf bib
MERIt: Meta-Path Guided Contrastive Learning for Logical Reasoning
Fangkai Jiao | Yangyang Guo | Xuemeng Song | Liqiang Nie

Logical reasoning is of vital importance to natural language understanding. Previous studies either employ graph-based models to incorporate prior knowledge about logical relations, or introduce symbolic logic into neural models through data augmentation. These methods, however, heavily depend on annotated training data, and thus suffer from over-fitting and poor generalization problems due to the dataset sparsity. To address these two problems, in this paper, we propose MERIt, a MEta-path guided contrastive learning method for logical ReasonIng of text, to perform self-supervised pre-training on abundant unlabeled text data. Two novel strategies serve as indispensable components of our method. In particular, a strategy based on meta-path is devised to discover the logical structure in natural texts, followed by a counterfactual data augmentation strategy to eliminate the information shortcut induced by pre-training. The experimental results on two challenging logical reasoning benchmarks, i.e., ReClor and LogiQA, demonstrate that our method outperforms the SOTA baselines with significant improvements.

pdf bib
Incorporating Dynamic Semantics into Pre-Trained Language Model for Aspect-based Sentiment Analysis
Kai Zhang | Kun Zhang | Mengdi Zhang | Hongke Zhao | Qi Liu | Wei Wu | Enhong Chen

Aspect-based sentiment analysis (ABSA) predicts sentiment polarity towards a specific aspect in the given sentence. While pre-trained language models such as BERT have achieved great success, incorporating dynamic semantic changes into ABSA remains challenging. To this end, in this paper, we propose to address this problem by Dynamic Re-weighting BERT (DR-BERT), a novel method designed to learn dynamic aspect-oriented semantics for ABSA. Specifically, we first take the Stack-BERT layers as a primary encoder to grasp the overall semantic of the sentence and then fine-tune it by incorporating a lightweight Dynamic Re-weighting Adapter (DRA). Note that the DRA can pay close attention to a small region of the sentences at each step and re-weigh the vitally important words for better aspect-aware sentiment understanding. Finally, experimental results on three benchmark datasets demonstrate the effectiveness and the rationality of our proposed model and provide good interpretable insights for future semantic modeling.

pdf bib
Addressing Resource and Privacy Constraints in Semantic Parsing Through Data Augmentation
Kevin Yang | Olivia Deng | Charles Chen | Richard Shin | Subhro Roy | Benjamin Van Durme

We introduce a novel setup for low resource task oriented semantic parsing which incorporates several constraints that may arise in real world scenarios lack of similar datasets models from a related domain inability to sample useful logical forms directly from a grammar and privacy requirements for unlabeled natural utterances Our goal is to improve a low resource semantic parser using utterances collected through user interactions In this highly challenging but realistic setting we investigate data augmentation approaches involving generating a set of structured canonical utterances corresponding to logical forms before simulating corresponding natural language and filtering the resulting pairs We find that such approaches are effective despite our restrictive setup in a low resource setting on the complex SMCalFlow calendaring dataset Andreas et al we observe relative improvement over a non data augmented baseline in top-1 match

pdf bib
Benchmarking Answer Verification Methods for Question Answering-Based Summarization Evaluation Metrics
Daniel Deutsch | Dan Roth

Question answering-based summarization evaluation metrics must automatically determine whether the QA model’s prediction is correct or not, a task known as answer verification. In this work, we benchmark the lexical answer verification methods which have been used by current QA-based metrics as well as two more sophisticated text comparison methods, BERTScore and LERC. We find that LERC out-performs the other methods in some settings while remaining statistically indistinguishable from lexical overlap in others. However, our experiments reveal that improved verification performance does not necessarily translate to overall QA-based metric quality: In some scenarios, using a worse verification method — or using none at all — has comparable performance to using the best verification method, a result that we attribute to properties of the datasets.

pdf bib
Chinese Synesthesia Detection New Dataset and ModelsChinese Synesthesia Detection: New Dataset and Models
Xiaotong Jiang | Qingqing Zhao | Yunfei Long | Zhongqing Wang

In this paper we introduce a new task called synesthesia detection which aims to extract the sensory word of a sentence and to predict the original and synesthetic sensory modalities of the corresponding sensory word Synesthesia refers to the description of perceptions in one sensory modality through concepts from other modalities It involves not only a linguistic phenomenon but also a cognitive phenomenon structuring human thought and action which makes it become a bridge between figurative linguistic phenomenon and abstract cognition and thus be helpful to understand the deep semantics To address this we construct a large scale human annotated Chinese synesthesia dataset which contains 7,217 annotated sentences accompanied by sensory words Based on this dataset we propose a family of strong and representative baseline models Upon these baselines we further propose a radical based neural network model to identify the boundary of the sensory word and to jointly detect the original and synesthetic sensory modalities for the word Through extensive experiments we observe that the importance of the proposed task and dataset can be verified by the statistics and progressive performances In addition our proposed model achieves state of the art results on the synesthesia dataset

pdf bib
Zero-Shot Dense Retrieval with Momentum Adversarial Domain Invariant Representations
Ji Xin | Chenyan Xiong | Ashwin Srinivasan | Ankita Sharma | Damien Jose | Paul Bennett

Dense retrieval (DR) methods conduct text retrieval by first encoding texts in the embedding space and then matching them by nearest neighbor search. This requires strong locality properties from the representation space, e.g., close allocations of each small group of relevant texts, which are hard to generalize to domains without sufficient training data. In this paper, we aim to improve the generalization ability of DR models from source training domains with rich supervision signals to target domains without any relevance label, in the zero-shot setting. To achieve that, we propose Momentum adversarial Domain Invariant Representation learning (MoDIR), which introduces a momentum method to train a domain classifier that distinguishes source versus target domains, and then adversarially updates the DR encoder to learn domain invariant representations. Our experiments show that MoDIR robustly outperforms its baselines on 10+ ranking datasets collected in the BEIR benchmark in the zero-shot setup, with more than 10% relative gains on datasets with enough sensitivity for DR models’ evaluation. Source code is available at

pdf bib
Attention as Grounding: Exploring Textual and Cross-Modal Attention on Entities and Relations in Language-and-Vision Transformer
Nikolai Ilinykh | Simon Dobnik

We explore how a multi-modal transformer trained for generation of longer image descriptions learns syntactic and semantic representations about entities and relations grounded in objects at the level of masked self-attention (text generation) and cross-modal attention (information fusion). We observe that cross-attention learns the visual grounding of noun phrases into objects and high-level semantic information about spatial relations, while text-to-text attention captures low-level syntactic knowledge between words. This concludes that language models in a multi-modal task learn different semantic information about objects and relations cross-modally and uni-modally (text-only). Our code is available here:

pdf bib
Structural Supervision for Word Alignment and Machine Translation
Lei Li | Kai Fan | Hongjia Li | Chun Yuan

Syntactic structure has long been argued to be potentially useful for enforcing accurate word alignment and improving generalization performance of machine translation Unfortunately existing wisdom demonstrates its significance by considering only the syntactic structure of source tokens neglecting the rich structural information from target tokens and the structural similarity between the source and target sentences In this work we propose to incorporate the syntactic structure of both source and target tokens into the encoder decoder framework tightly correlating the internal logic of word alignment and machine translation for multi task learning Particularly we wo n’t leverage any annotated syntactic graph of the target side during training so we introduce Dynamic Graph Convolution Networks DGCN on observed target tokens to sequentially and simultaneously generate the target tokens and the corresponding syntactic graphs and further guide the word alignment On this basis Hierarchical Graph Random Walks HGRW are performed on the syntactic graphs of both source and target sides for incorporating structured constraints on machine translation outputs Experiments on four publicly available language pairs verify that our method is highly effective in capturing syntactic structure in different languages consistently outperforming baselines in alignment accuracy and demonstrating promising results in translation quality

pdf bib
Should We Trust This Summary Bayesian Abstractive Summarization to The RescueBayesian Abstractive Summarization to The Rescue
Alexios Gidiotis | Grigorios Tsoumakas

We explore the notion of uncertainty in the context of modern abstractive summarization models using the tools of Bayesian Deep Learning Our approach approximates Bayesian inference by first extending state of the art summarization models with Monte Carlo dropout and then using them to perform multiple stochastic forward passes Based on Bayesian inference we are able to effectively quantify uncertainty at prediction time Having a reliable uncertainty measure we can improve the experience of the end user by filtering out generated summaries of high uncertainty Furthermore uncertainty estimation could be used as a criterion for selecting samples for annotation and can be paired nicely with active learning and human in the loop approaches Finally Bayesian inference enables us to find a Bayesian summary which performs better than a deterministic one and is more robust to uncertainty In practice we show that our Variational Bayesian equivalents of BART and PEGASUS can outperform their deterministic counterparts on multiple benchmark datasets

pdf bib
On the data requirements of probing
Zining Zhu | Jixuan Wang | Bai Li | Frank Rudzicz

As large and powerful neural language models are developed researchers have been increasingly interested in developing diagnostic tools to probe them There are many papers with conclusions of the form observation X$ is found in model Y$’’ using their own datasets with varying sizes Larger probing datasets bring more reliability but are also expensive to collect There is yet to be a quantitative method for estimating reasonable probing dataset sizes We tackle this omission in the context of comparing two probing configurations after we have collected a small dataset from a pilot study how many additional data samples are sufficient to distinguish two different configurations We present a novel method to estimate the required number of data samples in such experiments and across several case studies we verify that our estimations have sufficient statistical power Our framework helps to systematically construct probing datasets to diagnose neural NLP modelsX is found in model Y”, using their own datasets with varying sizes. Larger probing datasets bring more reliability, but are also expensive to collect. There is yet to be a quantitative method for estimating reasonable probing dataset sizes. We tackle this omission in the context of comparing two probing configurations: after we have collected a small dataset from a pilot study, how many additional data samples are sufficient to distinguish two different configurations? We present a novel method to estimate the required number of data samples in such experiments and, across several case studies, we verify that our estimations have sufficient statistical power. Our framework helps to systematically construct probing datasets to diagnose neural NLP models.

pdf bib
Translation Error Detection as Rationale Extraction
Marina Fomicheva | Lucia Specia | Nikolaos Aletras

Recent Quality Estimation QE models based on multilingual pre trained representations have achieved very competitive results in predicting the overall quality of translated sentences However detecting specifically which translated words are incorrect is a more challenging task especially when dealing with limited amounts of training data We hypothesize that not unlike humans successful QE models rely on translation errors to predict overall sentence quality By exploring a set of feature attribution methods that assign relevance scores to the inputs to explain model predictions we study the behaviour of state of the art sentence level QE models and show that explanations i.e. rationales extracted from these models can indeed be used to detect translation errors We therefore i introduce a novel semi supervised method for word level QE and ii propose to use the QE task as a new benchmark for evaluating the plausibility of feature attribution i.e. how interpretable model explanations are to humans

pdf bib
On Length Divergence Bias in Textual Matching Models
Lan Jiang | Tianshu Lyu | Yankai Lin | Meng Chong | Xiaoyong Lyu | Dawei Yin

Despite the remarkable success deep models have achieved in Textual Matching TM tasks it still remains unclear whether they truly understand language or measure the semantic similarity of texts by exploiting statistical bias in datasets In this work we provide a new perspective to study this issue --- via the length divergence bias We find the length divergence heuristic widely exists in prevalent TM datasets providing direct cues for prediction To determine whether TM models have adopted such heuristic we introduce an adversarial evaluation scheme which invalidates the heuristic In this adversarial setting all TM models perform worse indicating they have indeed adopted this heuristic Through a well designed probing experiment we empirically validate that the bias of TM models can be attributed in part to extracting the text length information during training To alleviate the length divergence bias we propose an adversarial training method The results demonstrate we successfully improve the robustness and generalization ability of models at the same time