Pre trained language models have shown stellar performance in various downstream tasks But this usually comes at the cost of high latency and computation hindering their usage in resource limited settings In this work we propose a novel approach for reducing the computational cost of BERT with minimal loss in downstream performance Our method dynamically eliminates less contributing tokens through layers resulting in shorter lengths and consequently lower computational cost To determine the importance of each token representation we train a Contribution Predictor for each layer using a gradient based saliency method Our experiments on several diverse classification tasks show speedups up to 22x during inference time without much sacrifice in performance We also validate the quality of the selected tokens in our method using human annotations in the ERASER benchmark In comparison to other widely used strategies for selecting important tokens such as saliency and attention our proposed method has a significantly lower false positive rate in generating rationales Our code is freely available at https://github.com/amodaresi/AdapLeR.
Cross-lingual named entity recognition task is one of the critical problems for evaluating the potential transfer learning techniques on low resource languages. Knowledge distillation using pre-trained multilingual language models between source and target languages have shown their superiority in transfer. However, existing cross-lingual distillation models merely consider the potential transferability between two identical single tasks across both domains. Other possible auxiliary tasks to improve the learning performance have not been fully investigated. In this study, based on the knowledge distillation framework and multi-task learning, we introduce the similarity metric model as an auxiliary task to improve the cross-lingual NER performance on the target domain. Specifically, an entity recognizer and a similarity evaluator are first trained in parallel as two teachers from the source domain. Then, two tasks in the student model are supervised by these teachers simultaneously. Empirical studies on the three datasets across 7 different languages confirm the effectiveness of the proposed model.
Although current state-of-the-art Transformer-based solutions succeeded in a wide range for single-document NLP tasks, they still struggle to address multi-input tasks such as multi-document summarization. Many solutions truncate the inputs, thus ignoring potential summary-relevant contents, which is unacceptable in the medical domain where each information can be vital. Others leverage linear model approximations to apply multi-input concatenation, worsening the results because all information is considered, even if it is conflicting or noisy with respect to a shared background. Despite the importance and social impact of medicine, there are no ad-hoc solutions for multi-document summarization. For this reason, we propose a novel discriminative marginalized probabilistic method (DAMEN) trained to discriminate critical information from a cluster of topic-related medical documents and generate a multi-document summary via token probability marginalization. Results prove we outperform the previous state-of-the-art on a biomedical dataset for multi-document summarization of systematic literature reviews. Moreover, we perform extensive ablation studies to motivate the design choices and prove the importance of each module of our method.
Applying existing methods to emotional support conversation—which provides valuable assistance to people who are in need—has two major limitations: (a) they generally employ a conversation-level emotion label, which is too coarse-grained to capture user’s instant mental state; (b) most of them focus on expressing empathy in the response(s) rather than gradually reducing user’s distress. To address the problems, we propose a novel model \\textbf{MISC}, which firstly infers the user’s fine-grained emotional status, and then responds skillfully using a mixture of strategy. Experimental results on the benchmark dataset demonstrate the effectiveness of our method and reveal the benefits of fine-grained emotion understanding as well as mixed-up strategy modeling.
Modelling prosody variation is critical for synthesizing natural and expressive speech in end-to-end text-to-speech (TTS) systems. In this paper, a cross-utterance conditional VAE (CUC-VAE) is proposed to estimate a posterior probability distribution of the latent prosody features for each phoneme by conditioning on acoustic features, speaker information, and text features obtained from both past and future sentences. At inference time, instead of the standard Gaussian distribution used by VAE, CUC-VAE allows sampling from an utterance-specific prior distribution conditioned on cross-utterance information, which allows the prosody features generated by the TTS system to be related to the context and is more similar to how humans naturally produce prosody. The performance of CUC-VAE is evaluated via a qualitative listening test for naturalness, intelligibility and quantitative measurements, including word error rates and the standard deviation of prosody attributes. Experimental results on LJ-Speech and LibriTTS data show that the proposed CUC-VAE TTS system improves naturalness and prosody diversity with clear margins.
Understanding causality has vital importance for various Natural Language Processing (NLP) applications. Beyond the labeled instances, conceptual explanations of the causality can provide deep understanding of the causal fact to facilitate the causal reasoning process. However, such explanation information still remains absent in existing causal reasoning resources. In this paper, we fill this gap by presenting a human-annotated explainable CAusal REasoning dataset (e-CARE), which contains over 20K causal reasoning questions, together with natural language formed explanations of the causal questions. Experimental results show that generating valid explanations for causal facts still remains especially challenging for the state-of-the-art models, and the explanation information can be helpful for promoting the accuracy and stability of causal reasoning models.
Building models of natural language processing (NLP) is challenging in low-resource scenarios where limited data are available. Optimization-based meta-learning algorithms achieve promising results in low-resource scenarios by adapting a well-generalized model initialization to handle new tasks. Nonetheless, these approaches suffer from the memorization overfitting issue, where the model tends to memorize the meta-training tasks while ignoring support sets when adapting to new tasks. To address this issue, we propose a memory imitation meta-learning (MemIML) method that enhances the model’s reliance on support sets for task adaptation. Specifically, we introduce a task-specific memory module to store support set information and construct an imitation module to force query sets to imitate the behaviors of support sets stored in the memory. A theoretical analysis is provided to prove the effectiveness of our method, and empirical results also demonstrate that our method outperforms competitive baselines on both text classification and generation tasks.
The goal of meta-learning is to learn to adapt to a new task with only a few labeled examples. Inspired by the recent progress in large language models, we propose \\textit{in-context tuning} (ICT), which recasts task adaptation and prediction as a simple sequence prediction problem: to form the input sequence, we concatenate the task instruction, labeled in-context examples, and the target input to predict; to meta-train the model to learn from in-context examples, we fine-tune a pre-trained language model (LM) to predict the target label given the input sequence on a collection of tasks.We benchmark our method on two collections of text classification tasks: LAMA and BinaryClfs. Compared to MAML which adapts the model through gradient descent, our method leverages the inductive bias of pre-trained LMs to perform pattern matching, and outperforms MAML by an absolute 6% average AUC-ROC score on BinaryClfs, gaining more advantage with increasing model size. Compared to non-fine-tuned in-context learning (i.e. prompting a raw LM), in-context tuning meta-trains the model to learn from in-context examples. On BinaryClfs, ICT improves the average AUC-ROC score by an absolute 10%, and reduces the variance due to example ordering by 6x and example choices by 2x.
Weakly-supervised learning (WSL) has shown promising results in addressing label scarcity on many NLP tasks, but manually designing a comprehensive, high-quality labeling rule set is tedious and difficult. We study interactive weakly-supervised learning—the problem of iteratively and automatically discovering novel labeling rules from data to improve the WSL model. Our proposed model, named PRBoost, achieves this goal via iterative prompt-based rule discovery and model boosting. It uses boosting to identify large-error instances and discovers candidate rules from them by prompting pre-trained LMs with rule templates. The candidate rules are judged by human experts, and the accepted rules are used to generate complementary weak labels and strengthen the current model. Experiments on four tasks show PRBoost outperforms state-of-the-art WSL baselines up to 7.1%, and bridges the gaps with fully supervised models.
Document structure is critical for efficient information consumption. However, it is challenging to encode it efficiently into the modern Transformer architecture. In this work, we present HIBRIDS, which injects Hierarchical Biases foR Incorporating Document Structure into attention score calculation. We further present a new task, hierarchical question-summary generation, for summarizing salient content in the source document into a hierarchy of questions and summaries, where each follow-up question inquires about the content of its parent question-summary pair. We also annotate a new dataset with 6,153 question-summary hierarchies labeled on government reports. Experiment results show that our model produces better question-summary hierarchies than comparisons on both hierarchy quality and content coverage, a finding also echoed by human judges. Additionally, our model improves the generation of long-form summaries from long government reports and Wikipedia articles, as measured by ROUGE scores.
Named entity recognition (NER) is a fundamental task to recognize specific types of entities from a given sentence. Depending on how the entities appear in the sentence, it can be divided into three subtasks, namely, Flat NER, Nested NER, and Discontinuous NER. Among the existing approaches, only the generative model can be uniformly adapted to these three subtasks. However, when the generative model is applied to NER, its optimization objective is not consistent with the task, which makes the model vulnerable to the incorrect biases. In this paper, we analyze the incorrect biases in the generation process from a causality perspective and attribute them to two confounders: pre-context confounder and entity-order confounder. Furthermore, we design Intra- and Inter-entity Deconfounding Data Augmentation methods to eliminate the above confounders according to the theory of backdoor adjustment. Experiments show that our method can improve the performance of the generative NER model in various datasets.
Multilingual pre-trained models are able to zero-shot transfer knowledge from rich-resource to low-resource languages in machine reading comprehension (MRC). However, inherent linguistic discrepancies in different languages could make answer spans predicted by zero-shot transfer violate syntactic constraints of the target language. In this paper, we propose a novel multilingual MRC framework equipped with a Siamese Semantic Disentanglement Model (S2DM) to disassociate semantics from syntax in representations learned by multilingual pre-trained models. To explicitly transfer only semantic knowledge to the target language, we propose two groups of losses tailored for semantic and syntactic encoding and disentanglement. Experimental results on three multilingual MRC datasets (i.e., XQuAD, MLQA, and TyDi QA) demonstrate the effectiveness of our proposed approach over models based on mBERT and XLM-100.
Tables are often created with hierarchies but existing works on table reasoning mainly focus on flat tables and neglect hierarchical tables Hierarchical tables challenge numerical reasoning by complex hierarchical indexing as well as implicit relationships of calculation and semantics We present a new dataset HiTab to study question answering QA and natural language generation NLG over hierarchical tables HiTab is a cross domain dataset constructed from a wealth of statistical reports and Wikipedia pages and has unique characteristics nearly all tables are hierarchical and QA pairs are not proposed by annotators from scratch but are revised from real and meaningful sentences authored by analysts to reveal complex numerical reasoning in statistical reports we provide fine grained annotations of quantity and entity alignment Experiments suggest that this HiTab presents a strong challenge for existing baselines and a valuable benchmark for future research Targeting hierarchical structure we devise a hierarchy aware logical form for symbolic reasoning over tables which shows high effectiveness Targeting table reasoning we leverage entity and quantity alignment to explore partially supervised training in QA and conditional generation in NLG and largely reduce spurious predictions in QA and produce better descriptions in NLG
Tables store rich numerical data, but numerical reasoning over tables is still a challenge. In this paper, we find that the spreadsheet formula, a commonly used language to perform computations on numerical values in spreadsheets, is a valuable supervision for numerical reasoning in tables. Considering large amounts of spreadsheets available on the web, we propose FORTAP, the first exploration to leverage spreadsheet formulas for table pretraining. Two novel self-supervised pretraining objectives are derived from formulas, numerical reference prediction (NRP) and numerical calculation prediction (NCP). While our proposed objectives are generic for encoders, to better capture spreadsheet table layouts and structures, FORTAP is built upon TUTA, the first transformer-based method for spreadsheet table pretraining with tree attention. FORTAP outperforms state-of-the-art methods by large margins on three representative datasets of formula prediction, question answering, and cell type classification, showing the great potential of leveraging formulas for table pretraining.
Pre-trained sequence-to-sequence language models have led to widespread success in many natural language generation tasks. However, there has been relatively less work on analyzing their ability to generate structured outputs such as graphs. Unlike natural language, graphs have distinct structural and semantic properties in the context of a downstream NLP task, e.g., generating a graph that is connected and acyclic can be attributed to its structural constraints, while the semantics of a graph can refer to how meaningfully an edge represents the relation between two node concepts. In this work, we study pre-trained language models that generate explanation graphs in an end-to-end manner and analyze their ability to learn the structural constraints and semantics of such graphs. We first show that with limited supervision, pre-trained language models often generate graphs that either violate these constraints or are semantically incoherent. Since curating large amount of human-annotated graphs is expensive and tedious, we propose simple yet effective ways of graph perturbations via node and edge edit operations that lead to structurally and semantically positive and negative graphs. Next, we leverage these graphs in different contrastive learning models with Max-Margin and InfoNCE losses. Our methods lead to significant improvements in both structural and semantic accuracy of explanation graphs and also generalize to other similar graph generation tasks. Lastly, we show that human errors are the best negatives for contrastive learning and also that automatically generating more such human-like negative graphs can lead to further improvements.
Sentence compression reduces the length of text by removing non-essential content while preserving important facts and grammaticality. Unsupervised objective driven methods for sentence compression can be used to create customized models without the need for ground-truth training data, while allowing flexibility in the objective function(s) that are used for learning and inference. Recent unsupervised sentence compression approaches use custom objectives to guide discrete search; however, guided search is expensive at inference time. In this work, we explore the use of reinforcement learning to train effective sentence compression models that are also fast when generating predictions. In particular, we cast the task as binary sequence labelling and fine-tune a pre-trained transformer using a simple policy gradient approach. Our approach outperforms other unsupervised models while also being more efficient at inference time.
Machine reading comprehension is a heavily-studied research and test field for evaluating new pre-trained language models (PrLMs) and fine-tuning strategies, and recent studies have enriched the pre-trained language models with syntactic, semantic and other linguistic information to improve the performance of the models. In this paper, we imitate the human reading process in connecting the anaphoric expressions and explicitly leverage the coreference information of the entities to enhance the word embeddings from the pre-trained language model, in order to highlight the coreference mentions of the entities that must be identified for coreference-intensive question answering in QUOREF, a relatively new dataset that is specifically designed to evaluate the coreference-related performance of a model. We use two strategies to fine-tune a pre-trained language model, namely, placing an additional encoder layer after a pre-trained language model to focus on the coreference mentions or constructing a relational graph convolutional network to model the coreference relations. We demonstrate that the explicit incorporation of coreference information in the fine-tuning stage performs better than the incorporation of the coreference information in pre-training a language model.
We contribute a new dataset for the task of automated fact checking and an evaluation of state of the art algorithms The dataset includes claims from speeches interviews social media and news articles review articles published by professional fact checkers and premise articles used by those professional fact checkers to support their review and verify the veracity of the claims An important challenge in the use of premise articles is the identification of relevant passages that will help to infer the veracity of a claim We show that transferring a dense passage retrieval model trained with review articles improves the retrieval quality of passages in premise articles We report results for the prediction of claim veracity by inference from premise articles
Machine Translation Quality Estimation QE aims to build predictive models to assess the quality of machine generated translations in the absence of reference translations While state of the art QE models have been shown to achieve good results they over rely on features that do not have a causal impact on the quality of a translation In particular there appears to be a partial input bias i.e. a tendency to assign high quality scores to translations that are fluent and grammatically correct even though they do not preserve the meaning of the source We analyse the partial input bias in further detail and evaluate four approaches to use auxiliary tasks for bias mitigation Two approaches use additional data to inform and support the main task while the other two are adversarial actively discouraging the model from learning the bias We compare the methods with respect to their ability to reduce the partial input bias while maintaining the overall performance We find that training a multitask architecture with an auxiliary binary classification task that utilises additional augmented data best achieves the desired effects and generalises well to different languages and quality metrics
Round trip Machine Translation MT is a popular choice for paraphrase generation which leverages readily available parallel corpora for supervision In this paper we formalize the implicit similarity function induced by this approach and show that it is susceptible to non paraphrase pairs sharing a single ambiguous translation Based on these insights we design an alternative similarity metric that mitigates this issue by requiring the entire translation distribution to match and implement a relaxation of it through the Information Bottleneck method Our approach incorporates an adversarial term into MT training in order to learn representations that encode as much information about the reference translation as possible while keeping as little information about the input as possible Paraphrases can be generated by decoding back to the source from this representation without having to generate pivot translations In addition to being more principled and efficient than round trip MT our approach offers an adjustable parameter to control the fidelity diversity trade off and obtains better results in our experiments
Fine-tuning the entire set of parameters of a large pretrained model has become the mainstream approach for transfer learning. To increase its efficiency and prevent catastrophic forgetting and interference, techniques like adapters and sparse fine-tuning have been developed. Adapters are modular, as they can be combined to adapt a model towards different facets of knowledge (e.g., dedicated language and/or task adapters). Sparse fine-tuning is expressive, as it controls the behavior of all model components. In this work, we introduce a new fine-tuning method with both these desirable properties. In particular, we learn sparse, real-valued masks based on a simple variant of the Lottery Ticket Hypothesis. Task-specific masks are obtained from annotated data in a source language, and language-specific masks from masked language modeling in a target language. Both these masks can then be composed with the pretrained model. Unlike adapter-based fine-tuning, this method neither increases the number of parameters at inference time nor alters the original model architecture. Most importantly, it outperforms adapters in zero-shot cross-lingual transfer by a large margin in a series of multilingual benchmarks, including Universal Dependencies, MasakhaNER, and AmericasNLI. Based on an in-depth analysis, we additionally find that sparsity is crucial to prevent both 1) interference between the fine-tunings to be composed and 2) overfitting. We release the code and models at https://github.com/cambridgeltl/composable-sft.
Crowdsourcing has emerged as a popular approach for collecting annotated data to train supervised machine learning models However annotator bias can lead to defective annotations Though there are a few works investigating individual annotator bias the group effects in annotators are largely overlooked In this work we reveal that annotators within the same demographic group tend to show consistent group bias in annotation tasks and thus we conduct an initial study on annotator group bias We first empirically verify the existence of annotator group bias in various real world crowdsourcing datasets Then we develop a novel probabilistic graphical framework GroupAnno to capture annotator group bias with an extended Expectation Maximization EM algorithm We conduct experiments on both synthetic and real world datasets Experimental results demonstrate the effectiveness of our model in modeling annotator group bias in label aggregation and model learning over competitive baselines
Interactive neural machine translation (INMT) is able to guarantee high-quality translations by taking human interactions into account. Existing IMT systems relying on lexical constrained decoding (LCD) enable humans to translate in a flexible translation order beyond the left-to-right. However, they typically suffer from two significant limitations in translation efficiency and quality due to the reliance on LCD. In this work, we propose a novel BiTIIMT system, Bilingual Text-Infilling for Interactive Neural Machine Translation. The key idea to BiTIIMT is Bilingual Text-infilling (BiTI) which aims to fill missing segments in a manually revised translation for a given source sentence. We propose a simple yet effective solution by casting this task as a sequence-to-sequence task. In this way, our system performs decoding without explicit constraints and makes full use of revised words for better translation prediction. Experiment results show that BiTiIMT performs significantly better and faster than state-of-the-art LCD-based IMT on three translation tasks.
Fine-grained Entity Typing (FET) has made great progress based on distant supervision but still suffers from label noise. Existing FET noise learning methods rely on prediction distributions in an instance-independent manner, which causes the problem of confirmation bias. In this work, we propose a clustering-based loss correction framework named Feature Cluster Loss Correction (FCLC), to address these two problems. FCLC first train a coarse backbone model as a feature extractor and noise estimator. Loss correction is then applied to each feature cluster, learning directly from the noisy labels. Experimental results on three public datasets show that FCLC achieves the best performance over existing competitive systems. Auxiliary experiments further demonstrate that FCLC is stable to hyperparameters and it does help mitigate confirmation bias. We also find that in the extreme case of no clean data, the FCLC framework still achieves competitive performance.
The robustness of Text-to-SQL parsers against adversarial perturbations plays a crucial role in delivering highly reliable applications. Previous studies along this line primarily focused on perturbations in the natural language question side, neglecting the variability of tables. Motivated by this, we propose the Adversarial Table Perturbation (ATP) as a new attacking paradigm to measure robustness of Text-to-SQL models. Following this proposition, we curate ADVETA, the first robustness evaluation benchmark featuring natural and realistic ATPs. All tested state-of-the-art models experience dramatic performance drops on ADVETA, revealing significant room of improvement. To defense against ATP, we build a systematic adversarial training example generation framework tailored for better contextualization of tabular data. Experiments show that our approach brings models best robustness improvement against ATP, while also substantially boost model robustness against NL-side perturbations. We will release ADVETA and code to facilitate future research.
Human languages are full of metaphorical expressions. Metaphors help people understand the world by connecting new concepts and domains to more familiar ones. Large pre-trained language models (PLMs) are therefore assumed to encode metaphorical knowledge useful for NLP systems. In this paper, we investigate this hypothesis for PLMs, by probing metaphoricity information in their encodings, and by measuring the cross-lingual and cross-dataset generalization of this information. We present studies in multiple metaphor detection datasets and in four languages (i.e., English, Spanish, Russian, and Farsi). Our extensive experiments suggest that contextual representations in PLMs do encode metaphorical knowledge, and mostly in their middle layers. The knowledge is transferable between languages and datasets, especially when the annotation is consistent across training and testing sets. Our findings give helpful insights for both cognitive and NLP scientists.
In recent years, researchers tend to pre-train ever-larger language models to explore the upper limit of deep models. However, large language model pre-training costs intensive computational resources, and most of the models are trained from scratch without reusing the existing pre-trained models, which is wasteful. In this paper, we propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model through parameter initialization and significantly improve the pre-training efficiency of the large model. Specifically, we extend the previous function-preserving method proposed in computer vision on the Transformer-based language model, and further improve it by proposing a novel method, advanced knowledge for large model’s initialization. In addition, a two-stage learning method is proposed to further accelerate the pre-training. We conduct extensive experiments on representative PLMs (e.g., BERT and GPT) and demonstrate that (1) our method can save a significant amount of training cost compared with baselines including learning from scratch, StackBERT and MSLT; (2) our method is generic and applicable to different types of pre-trained models. In particular, bert2BERT saves about 45% and 47% computational cost of pre-training BERT_{\\rm BASE} and GPT_{\\rm BASE} by reusing the models of almost their half sizes.
Hedges have an important role in the management of rapport. In peer-tutoring, they are notably used by tutors in dyads experiencing low rapport to tone down the impact of instructions and negative feedback.Pursuing the objective of building a tutoring agent that manages rapport with teenagers in order to improve learning, we used a multimodal peer-tutoring dataset to construct a computational framework for identifying hedges. We compared approaches relying on pre-trained resources with others that integrate insights from the social science literature. Our best performance involved a hybrid approach that outperforms the existing baseline while being easier to interpret. We employ a model explainability tool to explore the features that characterize hedges in peer-tutoring conversations, and we identify some novel features, and the benefits of a such a hybrid model approach.
\n k-Nearest-Neighbor Machine Translation (kNN-MT) has been recently proposed as a non-parametric solution for domain adaptation in neural machine translation (NMT). It aims to alleviate the performance degradation of advanced MT systems in translating out-of-domain sentences by coordinating with an additional token-level feature-based retrieval module constructed from in-domain data. Previous studies (Khandelwal et al., 2021; Zheng et al., 2021) have already demonstrated that non-parametric NMT is even superior to models fine-tuned on out-of-domain data. In spite of this success, kNN retrieval is at the expense of high latency, in particular for large datastores. To make it practical, in this paper, we explore a more efficient kNN-MT and propose to use clustering to improve the retrieval efficiency. Concretely, we first propose a cluster-based Compact Network for feature reduction in a contrastive learning manner to compress context features into 90+% lower dimensional vectors. We then suggest a cluster-based pruning solution to filter out 10% 40% redundant nodes in large datastores while retaining translation quality. Our proposed methods achieve better or comparable performance while reducing up to 57% inference latency against the advanced non-parametric MT model on several machine translation benchmarks. Experimental results indicate that the proposed methods maintain the most useful information of the original datastore and the Compact Network shows good generalization on unseen domains. Codes are available at https://github.com/tjunlp-lab/PCKMT.
We propose a new method for projective dependency parsing based on headed spans. In a projective dependency tree, the largest subtree rooted at each word covers a contiguous sequence (i.e., a span) in the surface order. We call such a span marked by a root word headed span. A projective dependency tree can be represented as a collection of headed spans. We decompose the score of a dependency tree into the scores of the headed spans and design a novel O(n^3) dynamic programming algorithm to enable global training and exact inference. Our model achieves state-of-the-art or competitive results on PTB, CTB, and UD
Recent works on Lottery Ticket Hypothesis have shown that pre-trained language models (PLMs) contain smaller matching subnetworks(winning tickets) which are capable of reaching accuracy comparable to the original models. However, these tickets are proved to be notrobust to adversarial examples, and even worse than their PLM counterparts. To address this problem, we propose a novel method based on learning binary weight masks to identify robust tickets hidden in the original PLMs. Since the loss is not differentiable for the binary mask, we assign the hard concrete distribution to the masks and encourage their sparsity using a smoothing approximation of L0 regularization.Furthermore, we design an adversarial loss objective to guide the search for robust tickets and ensure that the tickets perform well bothin accuracy and robustness. Experimental results show the significant improvement of the proposed method over previous work on adversarial robustness evaluation.
Traditionally, a debate usually requires a manual preparation process, including reading plenty of articles, selecting the claims, identifying the stances of the claims, seeking the evidence for the claims, etc. As the AI debate attracts more attention these years, it is worth exploring the methods to automate the tedious process involved in the debating system. In this work, we introduce a comprehensive and large dataset named IAM, which can be applied to a series of argument mining tasks, including claim extraction, stance classification, evidence extraction, etc. Our dataset is collected from over 1k articles related to 123 topics. Near 70k sentences in the dataset are fully annotated based on their argument properties (e.g., claims, stances, evidence, etc.). We further propose two new integrated argument mining tasks associated with the debate preparation process: (1) claim extraction with stance classification (CESC) and (2) claim-evidence pair extraction (CEPE). We adopt a pipeline approach and an end-to-end method for each integrated task separately. Promising experimental results are reported to show the values and challenges of our proposed tasks, and motivate future research on argument mining.
Existing reference-free metrics have obvious limitations for evaluating controlled text generation models. Unsupervised metrics can only provide a task-agnostic evaluation result which correlates weakly with human judgments, whereas supervised ones may overfit task-specific data with poor generalization ability to other datasets. In this paper, we propose an unsupervised reference-free metric called CTRLEval, which evaluates controlled text generation from different aspects by formulating each aspect into multiple text infilling tasks. On top of these tasks, the metric assembles the generation probabilities from a pre-trained language model without any model training. Experimental results show that our metric has higher correlations with human judgments than other baselines, while obtaining better generalization of evaluating generated texts from different models and with different qualities.
Knowledge distillation (KD) is the preliminary step for training non-autoregressive translation (NAT) models, which eases the training of NAT models at the cost of losing important information for translating low-frequency words. In this work, we provide an appealing alternative for NAT – monolingual KD, which trains NAT student on external monolingual data with AT teacher trained on the original bilingual data. Monolingual KD is able to transfer both the knowledge of the original bilingual data (implicitly encoded in the trained AT teacher model) and that of the new monolingual data to the NAT student model. Extensive experiments on eight WMT benchmarks over two advanced NAT models show that monolingual KD consistently outperforms the standard KD by improving low-frequency word translation, without introducing any computational cost. Monolingual KD enjoys desirable expandability, which can be further enhanced (when given more computational budget) by combining with the standard KD, a reverse monolingual KD, or enlarging the scale of monolingual data. Extensive analyses demonstrate that these techniques can be used together profitably to further recall the useful information lost in the standard KD. Encouragingly, combining with standard KD, our approach achieves 30.4 and 34.1 BLEU points on the WMT14 English-German and German-English datasets, respectively. Our code and trained models are freely available at https://github.com/alphadl/RLFW-NAT.mono.
Progress with supervised Open Information Extraction (OpenIE) has been primarily limited to English due to the scarcity of training data in other languages. In this paper, we explore techniques to automatically convert English text for training OpenIE systems in other languages. We introduce the Alignment-Augmented Constrained Translation (AACTrans) model to translate English sentences and their corresponding extractions consistently with each other — with no changes to vocabulary or semantic meaning which may result from independent translations. Using the data generated with AACTrans, we train a novel two-stage generative OpenIE model, which we call Gen2OIE, that outputs for each sentence: 1) relations in the first stage and 2) all extractions containing the relation in the second stage. Gen2OIE increases relation coverage using a training data transformation technique that is generalizable to multiple languages, in contrast to existing models that use an English-specific training loss. Evaluations on 5 languages — Spanish, Portuguese, Chinese, Hindi and Telugu — show that the Gen2OIE with AACTrans data outperforms prior systems by a margin of 6-25% in F1.
We study a new problem setting of information extraction (IE), referred to as text-to-table. In text-to-table, given a text, one creates a table or several tables expressing the main content of the text, while the model is learned from text-table pair data. The problem setting differs from those of the existing methods for IE. First, the extraction can be carried out from long texts to large tables with complex structures. Second, the extraction is entirely data-driven, and there is no need to explicitly define the schemas. As far as we know, there has been no previous work that studies the problem. In this work, we formalize text-to-table as a sequence-to-sequence (seq2seq) problem. We first employ a seq2seq model fine-tuned from a pre-trained language model to perform the task. We also develop a new method within the seq2seq approach, exploiting two additional techniques in table generation: table constraint and table relation embeddings. We consider text-to-table as an inverse problem of the well-studied table-to-text, and make use of four existing table-to-text datasets in our experiments on text-to-table. Experimental results show that the vanilla seq2seq model can outperform the baseline methods of using relation extraction and named entity extraction. The results also show that our method can further boost the performances of the vanilla seq2seq model. We further discuss the main challenges of the proposed task. The code and data are available at https://github.com/shirley-wu/text_to_table.
Code search is to search reusable code snippets from source code corpus based on natural languages queries Deep learning based methods on code search have shown promising results However previous methods focus on retrieval accuracy but lacked attention to the efficiency of the retrieval process We propose a novel method CoSHC to accelerate code search with deep hashing and code classification aiming to perform efficient code search without sacrificing too much accuracy To evaluate the effectiveness of CoSHC we apply our method \n on five code search models Extensive experimental results indicate that compared with previous code search baselines CoSHC can save more than of retrieval time meanwhile preserving at least of retrieval accuracy
When working with textual data a natural application of disentangled representations is the fair classification where the goal is to make predictions without being biased or influenced by sensible attributes that may be present in the data e.g. age gender or race Dominant approaches to disentangle a sensitive attribute from textual representations rely on learning simultaneously a penalization term that involves either an adversary loss e.g. a discriminator or an information measure e.g. mutual information However these methods require the training of a deep neural network with several parameter updates for each update of the representation model As a matter of fact the resulting nested optimization loop is both times consuming adding complexity to the optimization dynamic and requires a fine hyperparameter selection e.g. learning rates architecture In this work we introduce a family of regularizers for learning disentangled representations that do not require training These regularizers are based on statistical measures of similarity between the conditional probability distributions with respect to the sensible attributes Our novel regularizers do not require additional training are faster and do not involve additional tuning while achieving better results both when combined with pretrained and randomly initialized text encoders
Due to high data demands of current methods, attention to zero-shot cross-lingual spoken language understanding (SLU) has grown, as such approaches greatly reduce human annotation effort. However, existing models solely rely on shared parameters, which can only perform implicit alignment across languages. We present Global-Local Contrastive Learning Framework (GL-CLeF) to address this shortcoming. Specifically, we employ contrastive learning, leveraging bilingual dictionaries to construct multilingual views of the same utterance, then encourage their representations to be more similar than negative example pairs, which achieves to explicitly align representations of similar sentences across languages. In addition, a key step in GL-CLeF is a proposed Local and Global component, which achieves a fine-grained cross-lingual transfer (i.e., sentence-level Local intent transfer, token-level Local slot transfer, and semantic-level Global transfer across intent and slot). Experiments on MultiATIS++ show that GL-CLeF achieves the best performance and successfully pulls representations of similar sentences across languages closer.
Recent advances in prompt-based learning have shown strong results on few-shot text classification by using cloze-style templates.Similar attempts have been made on named entity recognition (NER) which manually design templates to predict entity types for every text span in a sentence. However, such methods may suffer from error propagation induced by entity span detection, high cost due to enumeration of all possible text spans, and omission of inter-dependencies among token labels in a sentence. Here we present a simple demonstration-based learning method for NER, which lets the input be prefaced by task demonstrations for in-context learning. We perform a systematic study on demonstration strategy regarding what to include (entity examples, with or without surrounding context), how to select the examples, and what templates to use. Results on in-domain learning and domain adaptation show that the model’s performance in low-resource settings can be largely improved with a suitable demonstration strategy (e.g., a 4-17% improvement on 25 train instances). We also find that good demonstration can save many labeled examples and consistency in demonstration contributes to better performance.
News events are often associated with quantities (e.g., the number of COVID-19 patients or the number of arrests in a protest), and it is often important to extract their type, time, and location from unstructured text in order to analyze these quantity events. This paper thus formulates the NLP problem of spatiotemporal quantity extraction, and proposes the first meta-framework for solving it. This meta-framework contains a formalism that decomposes the problem into several information extraction tasks, a shareable crowdsourcing pipeline, and transformer-based baseline models. We demonstrate the meta-framework in three domains—the COVID-19 pandemic, Black Lives Matter protests, and 2020 California wildfires—to show that the formalism is general and extensible, the crowdsourcing pipeline facilitates fast and high-quality data annotation, and the baseline system can handle spatiotemporal quantity extraction well enough to be practically useful. We release all resources for future research on this topic at https://github.com/steqe.
Knowledge graph embedding (KGE) models represent each entity and relation of a knowledge graph (KG) with low-dimensional embedding vectors. These methods have recently been applied to KG link prediction and question answering over incomplete KGs (KGQA). KGEs typically create an embedding for each entity in the graph, which results in large model sizes on real-world graphs with millions of entities. For downstream tasks these atomic entity representations often need to be integrated into a multi stage pipeline, limiting their utility. We show that an off-the-shelf encoder-decoder Transformer model can serve as a scalable and versatile KGE model obtaining state-of-the-art results for KG link prediction and incomplete KG question answering. We achieve this by posing KG link prediction as a sequence-to-sequence task and exchange the triple scoring approach taken by prior KGE methods with autoregressive decoding. Such a simple but powerful method reduces the model size up to 98% compared to conventional KGE models while keeping inference time tractable. After finetuning this model on the task of KGQA over incomplete KGs, our approach outperforms baselines on multiple large-scale datasets without extensive hyperparameter tuning.
We propose fill-in-the-blanks as a video understanding evaluation framework and introduce FIBER – a novel dataset consisting of 28,000 videos and descriptions in support of this evaluation framework. The fill-in-the-blanks setting tests a model’s understanding of a video by requiring it to predict a masked noun phrase in the caption of the video, given the video and the surrounding text. The FIBER benchmark does not share the weaknesses of the current state-of-the-art language-informed video understanding tasks, namely: (1) video question answering using multiple-choice questions, where models perform relatively well because they exploit linguistic biases in the task formulation, thus making our framework challenging for the current state-of-the-art systems to solve; and (2) video captioning, which relies on an open-ended evaluation framework that is often inaccurate because system answers may be perceived as incorrect if they differ in form from the ground truth. The FIBER dataset and our code are available at https://lit.eecs.umich.edu/fiber/.
Currently, Medical Subject Headings (MeSH) are manually assigned to every biomedical article published and subsequently recorded in the PubMed database to facilitate retrieving relevant information. With the rapid growth of the PubMed database, large-scale biomedical document indexing becomes increasingly important. MeSH indexing is a challenging task for machine learning, as it needs to assign multiple labels to each article from an extremely large hierachically organized collection. To address this challenge, we propose KenMeSH, an end-to-end model that combines new text features and a dynamic knowledge-enhanced mask attention that integrates document features with MeSH label hierarchy and journal correlation features to index MeSH terms. Experimental results show the proposed method achieves state-of-the-art performance on a number of measures.
Effective question asking is a crucial component of a successful conversational chatbot It could help the bots manifest empathy and render the interaction more engaging by demonstrating attention to the speaker’s emotions However current dialog generation approaches do not model this subtle emotion regulation technique due to the lack of a taxonomy of questions and their purpose in social chitchat To address this gap we have developed an empathetic question taxonomy EQT with special attention paid to questions ability to capture communicative acts and their emotion regulation intents We further design a crowd sourcing task to annotate a large subset of the EmpatheticDialogues dataset with the established labels We use the crowd annotated data to develop automatic labeling tools and produce labels for the whole dataset Finally we employ information visualization techniques to summarize co occurrences of question acts and intents and their role in regulating interlocutors emotion These results reveal important question asking strategies in social dialogs The EQT classification scheme can facilitate computational analysis of questions in datasets More importantly it can inform future efforts in empathetic question generation using neural or hybrid methods
Aspect Sentiment Triplet Extraction (ASTE) is an emerging sentiment analysis task. Most of the existing studies focus on devising a new tagging scheme that enables the model to extract the sentiment triplets in an end-to-end fashion. However, these methods ignore the relations between words for ASTE task. In this paper, we propose an Enhanced Multi-Channel Graph Convolutional Network model (EMC-GCN) to fully utilize the relations between words. Specifically, we first define ten types of relations for ASTE task, and then adopt a biaffine attention module to embed these relations as an adjacent tensor between words in a sentence. After that, our EMC-GCN transforms the sentence into a multi-channel graph by treating words and the relation adjacent tensor as nodes and edges, respectively. Thus, relation-aware node representations can be learnt. Furthermore, we consider diverse linguistic features to enhance our EMC-GCN model. Finally, we design an effective refining strategy on EMC-GCN for word-pair representation refinement, which considers the implicit results of aspect and opinion extraction when determining whether word pairs match or not. Extensive experimental results on the benchmark datasets demonstrate that the effectiveness and robustness of our proposed model, which outperforms state-of-the-art methods significantly.
We present an incremental syntactic representation that consists of assigning a single discrete label to each word in a sentence where the label is predicted using strictly incremental processing of a prefix of the sentence and the sequence of labels for a sentence fully determines a parse tree Our goal is to induce a syntactic representation that commits to syntactic choices only as they are incrementally revealed by the input in contrast with standard representations that must make output choices such as attachments speculatively and later throw out conflicting analyses Our learned representations achieve 93.72 F1 on the Penn Treebank with as few as bits per word and at bits per word they achieve 94.97 F1 which is comparable with other state of the art parsing models when using the same pre trained embeddings We also provide an analysis of the representations learned by our system investigating properties such as the interpretable syntactic features captured by the system and mechanisms for deferred resolution of syntactic ambiguities
Even to a simple and short news headline, readers react in a multitude of ways: cognitively (e.g. inferring the writer’s intent), emotionally (e.g. feeling distrust), and behaviorally (e.g. sharing the news with their friends). Such reactions are instantaneous and yet complex, as they rely on factors that go beyond interpreting factual content of news.We propose Misinfo Reaction Frames (MRF), a pragmatic formalism for modeling how readers might react to a news headline. In contrast to categorical schema, our free-text dimensions provide a more nuanced way of understanding intent beyond being benign or malicious. We also introduce a Misinfo Reaction Frames corpus, a crowdsourced dataset of reactions to over 25k news headlines focusing on global crises: the Covid-19 pandemic, climate change, and cancer. Empirical results confirm that it is indeed possible for neural models to predict the prominent patterns of readers’ reactions to previously unseen news headlines. Additionally, our user study shows that displaying machine-generated MRF implications alongside news headlines to readers can increase their trust in real news while decreasing their trust in misinformation. Our work demonstrates the feasibility and importance of pragmatic inferences on news headlines to help enhance AI-guided misinformation detection and mitigation.
A limitation of current neural dialog models is that they tend to suffer from a lack of specificity and informativeness in generated responses, primarily due to dependence on training data that covers a limited variety of scenarios and conveys limited knowledge. One way to alleviate this issue is to extract relevant knowledge from external sources at decoding time and incorporate it into the dialog response. In this paper, we propose a post-hoc knowledge-injection technique where we first retrieve a diverse set of relevant knowledge snippets conditioned on both the dialog history and an initial response from an existing dialog model. We construct multiple candidate responses, individually injecting each retrieved snippet into the initial response using a gradient-based decoding method, and then select the final response with an unsupervised ranking step. Our experiments in goal-oriented and knowledge-grounded dialog settings demonstrate that human annotators judge the outputs from the proposed method to be more engaging and informative compared to responses from prior dialog systems. We further show that knowledge-augmentation promotes success in achieving conversational goals in both experimental settings.
It remains an open question whether incorporating external knowledge benefits commonsense reasoning while maintaining the flexibility of pretrained sequence models To investigate this question we develop generated knowledge prompting which consists of generating knowledge from a language model then providing the knowledge as additional input when answering a question Our method does not require task specific supervision for knowledge integration or access to a structured knowledge base yet it improves performance of large scale state of the art models on four commonsense reasoning tasks achieving state of the art results on numerical commonsense NumerSense general commonsense CommonsenseQA 2.0 and scientific commonsense QASC benchmarks Generated knowledge prompting highlights large scale language models as flexible sources of external knowledge for improving commonsense reasoning Our code is available at \\urlgithub.com/liujch1998/GKP\n
Retrieval based methods have been shown to be effective in NLP tasks via introducing external knowledge However the indexing and retrieving of large scale corpora bring considerable computational cost Surprisingly we found that REtrieving from the traINing datA REINA only can lead to significant gains on multiple NLG and NLU tasks We retrieve the labeled training instances most similar to the input text and then concatenate them with the input to feed into the model to generate the output Experimental results show that this simple method can achieve significantly better performance on a variety of NLU and NLG tasks including summarization machine translation language modeling and question answering tasks For instance our proposed method achieved state of the art results on XSum BigPatent and CommonsenseQA Our code is released https://github.com/microsoft/REINA
We propose a benchmark to measure whether a language model is truthful in generating answers to questions The benchmark comprises questions that span categories including health law finance and politics We crafted questions that some humans would answer falsely due to a false belief or misconception To perform well models must avoid generating false answers learned from imitating human texts We tested GPT-3 GPT Neo J GPT-2 and a T5 based model The best model was truthful on of questions while human performance was Models generated many false answers that mimic popular misconceptions and have the potential to deceive humans The largest models were generally the least truthful This contrasts with other NLP tasks where performance improves with model size However this result is expected if false answers are learned from the training distribution We suggest that scaling up models alone is less promising for improving truthfulness than fine tuning using training objectives other than imitation of text from the web
When pre trained contextualized embedding based models developed for unstructured data are adapted for structured tabular data they perform admirably However recent probing studies show that these models use spurious correlations and often predict inference labels by focusing on false evidence or ignoring it altogether To study this issue we introduce the task of Trustworthy Tabular Reasoning where a model needs to extract evidence to be used for reasoning in addition to predicting the label As a case study we propose a two stage sequential prediction approach which includes an evidence extraction and an inference stage First we crowdsource evidence row labels and develop several unsupervised and supervised evidence extraction strategies for InfoTabS a tabular NLI benchmark Our evidence extraction strategy outperforms earlier baselines On the downstream tabular inference task using only the automatically extracted evidence as the premise our approach outperforms prior benchmarks
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation. We tackle the problem by first applying a self-supervised discrete speech encoder on the target speech and then training a sequence-to-sequence speech-to-unit translation (S2UT) model to predict the discrete representations of the target speech. When target text transcripts are available, we design a joint speech and text training framework that enables the model to generate dual modality output (speech and text) simultaneously in the same inference pass. Experiments on the Fisher Spanish-English dataset show that the proposed framework yields improvement of 6.7 BLEU compared with a baseline direct S2ST model that predicts spectrogram features. When trained without any text transcripts, our model performance is comparable to models that predict spectrograms and are trained with text supervision, showing the potential of our system for translation between unwritten languages.
As language technologies become more ubiquitous there are increasing efforts towards expanding the language diversity and coverage of natural language processing NLP systems Arguably the most important factor influencing the quality of modern NLP systems is data availability In this work we study the geographical representativeness of NLP datasets aiming to quantify if and by how much do NLP datasets match the expected needs of the language speakers In doing so we use entity recognition and linking systems also making important observations about their cross lingual consistency and giving suggestions for more robust evaluation Last we explore some geographical and economic factors that may explain the observed dataset distributions
Knowledge of difficulty level of questions helps a teacher in several ways, such as estimating students’ potential quickly by asking carefully selected questions and improving quality of examination by modifying trivial and hard questions. Can we extract such benefits of instance difficulty in Natural Language Processing? To this end, we conduct Instance-Level Difficulty Analysis of Evaluation data (ILDAE) in a large-scale setup of 23 datasets and demonstrate its five novel applications: 1) conducting efficient-yet-accurate evaluations with fewer instances saving computational cost and time, 2) improving quality of existing evaluation datasets by repairing erroneous and trivial instances, 3) selecting the best model based on application requirements, 4) analyzing dataset characteristics for guiding future data creation, 5) estimating Out-of-Domain performance reliably. Comprehensive experiments for these applications lead to several interesting results, such as evaluation using just 5% instances (selected via ILDAE) achieves as high as 0.93 Kendall correlation with evaluation using complete dataset and computing weighted accuracy using difficulty scores leads to 5.2% higher correlation with Out-of-Domain performance. We release the difficulty scores and hope our work will encourage research in this important yet understudied field of leveraging instance difficulty in evaluations.
Long-form answers, consisting of multiple sentences, can provide nuanced and comprehensive answers to a broader set of questions. To better understand this complex and understudied task, we study the functional structure of long-form answers collected from three datasets, ELI5, WebGPT and Natural Questions. Our main goal is to understand how humans organize information to craft complex answers. We develop an ontology of six sentence-level functional roles for long-form answers, and annotate 3.9k sentences in 640 answer paragraphs. Different answer collection methods manifest in different discourse structures. We further analyze model-generated answers – finding that annotators agree less with each other when annotating model-generated answers compared to annotating human-written answers. Our annotated data enables training a strong classifier that can be used for automatic analysis. We hope our work can inspire future research on discourse-level modeling and evaluation of long-form QA systems.
We describe a Question Answering QA dataset that contains complex questions with conditional answers i.e. the answers are only applicable when certain conditions apply We call this dataset ConditionalQA In addition to conditional answers the dataset also features \n long context documents with information that is related in logically complex ways \n multi hop questions that require compositional logical reasoning \n a combination of extractive questions yes no questions questions with multiple answers and not answerable questions \n questions asked without knowing the answers We show that ConditionalQA is challenging for many of the existing QA models especially in selecting answer conditions We believe that this dataset will motivate further research in answering complex questions over long documents
While pretrained language models achieve excellent performance on natural language understanding benchmarks, they tend to rely on spurious correlations and generalize poorly to out-of-distribution (OOD) data. Recent work has explored using counterfactually-augmented data (CAD)—data generated by minimally perturbing examples to flip the ground-truth label—to identify robust features that are invariant under distribution shift. However, empirical results using CAD during training for OOD generalization have been mixed. To explain this discrepancy, through a toy theoretical example and empirical analysis on two crowdsourced CAD datasets, we show that: (a) while features perturbed in CAD are indeed robust features, it may prevent the model from learning unperturbed robust features; and (b) CAD may exacerbate existing spurious correlations in the data. Our results thus show that the lack of perturbation diversity limits CAD’s effectiveness on OOD generalization, calling for innovative crowdsourcing procedures to elicit diverse perturbation of examples.
Sentiment transfer is one popular example of a text style transfer task where the goal is to reverse the sentiment polarity of a text With a sentiment reversal comes also a reversal in meaning We introduce a different but related task called positive reframing in which we neutralize a negative point of view and generate a more positive perspective for the author without contradicting the original meaning Our insistence on meaning preservation makes positive reframing a challenging and semantically rich task To facilitate rapid progress we introduce a large scale benchmark Positive Psychology Frames with 8,349 sentence pairs and 12,755 structured annotations to explain positive reframing in terms of six theoretically motivated reframing strategies Then we evaluate a set of state of the art text style transfer models and conclude by discussing key challenges and directions for future work
Conversational agents have come increasingly closer to human competence in open domain dialogue settings however such models can reflect insensitive hurtful or entirely incoherent viewpoints that erode a user’s trust in the moral integrity of the system Moral deviations are difficult to mitigate because moral judgments are not universal and there may be multiple competing judgments that apply to a situation simultaneously In this work we introduce a new resource not to authoritatively resolve moral ambiguities but instead to facilitate systematic understanding of the intuitions values and moral judgments reflected in the utterances of dialogue systems The Moral Integrity Corpus MIC is such a resource which captures the moral assumptions of 38k prompt reply pairs using 99k distinct Rules of Thumb RoTs Each RoT reflects a particular moral conviction that can explain why a chatbot’s reply may appear acceptable or problematic We further organize RoTs with a set of moral and social attributes and benchmark performance for attribute classification Most importantly we show that current neural language models can automatically generate new RoTs that reasonably describe previously unseen interactions but they still struggle with certain scenarios Our findings suggest that MIC will be a useful resource for understanding and language models implicit moral assumptions and flexibly benchmarking the integrity of conversational agents To download the data see https://github.com/GT-SALT/mic
Graph neural networks have triggered a resurgence of graph-based text classification methods, defining today’s state of the art. We show that a wide multi-layer perceptron (MLP) using a Bag-of-Words (BoW) outperforms the recent graph-based models TextGCN and HeteGCN in an inductive text classification setting and is comparable with HyperGAT. Moreover, we fine-tune a sequence-based BERT and a lightweight DistilBERT model, which both outperform all state-of-the-art models. These results question the importance of synthetic graphs used in modern text classifiers. In terms of efficiency, DistilBERT is still twice as large as our BoW-based wide MLP, while graph-based models like TextGCN require setting up an \\mathcal{O}(N^2) graph, where N is the vocabulary plus corpus size. Finally, since Transformers need to compute \\mathcal{O}(L^2) attention weights with sequence length L, the MLP models show higher training and inference speeds on datasets with long sequences.
We introduce ParaBLEU a paraphrase representation learning model and evaluation metric for text generation Unlike previous approaches ParaBLEU learns to understand paraphrasis using generative conditioning as a pretraining objective ParaBLEU correlates more strongly with human judgements than existing metrics obtaining new state of the art results on the WMT Metrics Shared Task We show that our model is robust to data scarcity exceeding previous state of the art performance using only 50\\%$ of the available training data and surpassing BLEU ROUGE and METEOR with only labelled examples Finally we demonstrate that ParaBLEU can be used to conditionally generate novel paraphrases from a single demonstration which we use to confirm our hypothesis that it learns abstract generalized paraphrase representations
Word identification from continuous input is typically viewed as a segmentation task Experiments with human adults suggest that familiarity with syntactic structures in their native language also influences word identification in artificial languages however the relation between syntactic processing and word identification is yet unclear This work takes one step forward by exploring a radically different approach of word identification in which segmentation of a continuous input is viewed as a process isomorphic to unsupervised constituency parsing Besides formalizing the approach this study reports simulations of human experiments with DIORA Drozdov et al a neural unsupervised constituency parser Results show that this model can reproduce human behavior in word identification experiments suggesting that this is a viable approach to study word identification and its relation to syntactic processing
The social impact of natural language processing and its applications has received increasing attention. In this position paper, we focus on the problem of safety for end-to-end conversational AI. We survey the problem landscape therein, introducing a taxonomy of three observed phenomena: the Instigator, Yea-Sayer, and Impostor effects. We then empirically assess the extent to which current tools can measure these effects and current systems display them. We release these tools as part of a “first aid kit” (SafetyKit) to quickly assess apparent safety concerns. Our results show that, while current tools are able to provide an estimate of the relative safety of systems in various settings, they still have several shortcomings. We suggest several future directions and discuss ethical considerations.
Obtaining human like performance in NLP is often argued to require compositional generalisation Whether neural networks exhibit this ability is usually studied by training models on highly compositional synthetic data However compositionality in natural language is much more complex than the rigid arithmetic like version such data adheres to and artificial compositionality tests thus do not allow us to determine how neural models deal with more realistic forms of compositionality In this work we re instantiate three compositionality tests from the literature and reformulate them for neural machine translation NMT Our results highlight that i unfavourably models trained on more data are more compositional ii models are sometimes less compositional than expected but sometimes more exemplifying that different levels of compositionality are required and models are not always able to modulate between them correctly iii some of the non compositional behaviours are mistakes whereas others reflect the natural variation in data Apart from an empirical study our work is a call to action we should rethink the evaluation of compositionality in neural networks and develop benchmarks using real data to evaluate compositionality on natural language where composing meaning is not as straightforward as doing the math
Laws and their interpretations legal arguments and agreements are typically expressed in writing leading to the production of vast corpora of legal text Their analysis which is at the center of legal practice becomes increasingly elaborate as these collections grow in size Natural language understanding NLU technologies can be a valuable tool to support legal practitioners in these endeavors Their usefulness however largely depends on whether current state of the art models can generalize across various tasks in the legal domain To answer this currently open question we introduce the Legal General Language Understanding Evaluation LexGLUE benchmark a collection of datasets for evaluating model performance across a diverse set of legal NLU tasks in a standardized way We also provide an evaluation and analysis of several generic and legal oriented models demonstrating that the latter consistently offer performance improvements across multiple tasks
In the field of sentiment analysis, several studies have highlighted that a single sentence may express multiple, sometimes contrasting, sentiments and emotions, each with its own experiencer, target and/or cause. To this end, over the past few years researchers have started to collect and annotate data manually, in order to investigate the capabilities of automatic systems not only to distinguish between emotions, but also to capture their semantic constituents. However, currently available gold datasets are heterogeneous in size, domain, format, splits, emotion categories and role labels, making comparisons across different works difficult and hampering progress in the area. In this paper, we tackle this issue and present a unified evaluation framework focused on Semantic Role Labeling for Emotions (SRL4E), in which we unify several datasets tagged with emotions and semantic roles by using a common labeling scheme. We use SRL4E as a benchmark to evaluate how modern pretrained language models perform and analyze where we currently stand in this task, hoping to provide the tools to facilitate studies in this complex area.
In linguistics, there are two main perspectives on negation: a semantic and a pragmatic view. So far, research in NLP on negation has almost exclusively adhered to the semantic view. In this article, we adopt the pragmatic paradigm to conduct a study of negation understanding focusing on transformer-based PLMs. Our results differ from previous, semantics-based studies and therefore help to contribute a more comprehensive – and, given the results, much more optimistic – picture of the PLMs’ negation understanding.
Identifying changes in individuals behaviour and mood as observed via content shared on online platforms is increasingly gaining importance Most research to date on this topic focuses on either a identifying individuals at risk or with a certain mental health condition given a batch of posts or b providing equivalent labels at the post level A disadvantage of such work is the lack of a strong temporal component and the inability to make longitudinal assessments following an individual’s trajectory and allowing timely interventions Here we define a new task that of identifying moments of change in individuals on the basis of their shared content online The changes we consider are sudden shifts in mood switches or gradual mood progression escalations We have created detailed guidelines for capturing moments of change and a corpus of manually annotated user timelines 18.7 K posts We have developed a variety of baseline models drawing inspiration from related tasks and show that the best performance is obtained through context aware sequential modelling We also introduce new metrics for capturing rare events in temporal windows
Formality style transfer (FST) is a task that involves paraphrasing an informal sentence into a formal one without altering its meaning. To address the data-scarcity problem of existing parallel datasets, previous studies tend to adopt a cycle-reconstruction scheme to utilize additional unlabeled data, where the FST model mainly benefits from target-side unlabeled sentences. In this work, we propose a simple yet effective semi-supervised framework to better utilize source-side unlabeled sentences based on consistency training. Specifically, our approach augments pseudo-parallel data obtained from a source-side informal sentence by enforcing the model to generate similar outputs for its perturbed version. Moreover, we empirically examined the effects of various data perturbation methods and propose effective data filtering strategies to improve our framework. Experimental results on the GYAFC benchmark demonstrate that our approach can achieve state-of-the-art results, even with less than 40% of the parallel data.
Word sense disambiguation (WSD) is a crucial problem in the natural language processing (NLP) community. Current methods achieve decent performance by utilizing supervised learning and large pre-trained language models. However, the imbalanced training dataset leads to poor performance on rare senses and zero-shot senses. There are more training instances and senses for words with top frequency ranks than those with low frequency ranks in the training dataset. We investigate the statistical relation between word frequency rank and word sense number distribution. Based on the relation, we propose a Z-reweighting method on the word level to adjust the training on the imbalanced dataset. The experiments show that the Z-reweighting strategy achieves performance gain on the standard English all words WSD benchmark. Moreover, the strategy can help models generalize better on rare and zero-shot senses.
Multimodal Entity Linking MEL which aims at linking mentions with multimodal contexts to the referent entities from a knowledge base e.g. Wikipedia is an essential task for many multimodal applications Although much attention has been paid to MEL the shortcomings of existing MEL datasets including limited contextual topics and entity types simplified mention ambiguity and restricted availability have caused great obstacles to the research and application of MEL In this paper we present WikiDiverse a high quality human annotated MEL dataset with diversified contextual topics and entity types from Wikinews which uses Wikipedia as the corresponding knowledge base A well tailored annotation procedure is adopted to ensure the quality of the dataset Based on WikiDiverse a sequence of well designed MEL models with intra modality and inter modality attentions are implemented which utilize the visual information of images more adequately than existing MEL models do Extensive experimental analyses are conducted to investigate the contributions of different modalities in terms of MEL facilitating the future research on this task
Dialog response generation in open domain is an important research topic where the main challenge is to generate relevant and diverse responses. In this paper, we propose a new dialog pre-training framework called DialogVED, which introduces continuous latent variables into the enhanced encoder-decoder pre-training framework to increase the relevance and diversity of responses. With the help of a large dialog corpus (Reddit), we pre-train the model using the following 4 tasks, used in training language models (LMs) and Variational Autoencoders (VAEs) literature: 1) masked language model; 2) response generation; 3) bag-of-words prediction; and 4) KL divergence reduction. We also add additional parameters to model the turn structure in dialogs to improve the performance of the pre-trained model. We conduct experiments on PersonaChat, DailyDialog, and DSTC7-AVSD benchmarks for response generation. Experimental results show that our model achieves the new state-of-the-art results on all these datasets.
We study the problem of coarse-grained response selection in retrieval-based dialogue systems. The problem is equally important with fine-grained response selection, but is less explored in existing literature. In this paper, we propose a Contextual Fine-to-Coarse (CFC) distilled model for coarse-grained response selection in open-domain conversations. In our CFC model, dense representations of query, candidate contexts and responses is learned based on the multi-tower architecture using contextual matching, and richer knowledge learned from the one-tower architecture (fine-grained) is distilled into the multi-tower architecture (coarse-grained) to enhance the performance of the retriever. To evaluate the performance of the proposed model, we construct two new datasets based on the Reddit comments dump and Twitter corpus. Extensive experimental results on the two datasets show that the proposed method achieves huge improvement over all evaluation metrics compared with traditional baseline methods.
Recent entity and relation extraction works focus on investigating how to obtain a better span representation from the pre trained encoder However a major limitation of existing works is that they ignore the interrelation between spans pairs In this work we propose a novel span representation approach named Packed Levitated Markers PL Marker to consider the interrelation between the spans pairs by strategically packing the markers in the encoder In particular we propose a neighborhood oriented packing strategy which considers the neighbor spans integrally to better model the entity boundary information Furthermore for those more complicated span pair classification tasks we design a subject oriented packing strategy which packs each subject and all its objects to model the interrelation between the same subject span pairs The experimental results show that with the enhanced marker feature our model advances baselines on six NER benchmarks and obtains a 4.1%-4.3 strict relation F1 improvement with higher speed over previous state of the art models on ACE04 and ACE05 Our code and models are publicly available at https://github.com/thunlp/PL-Marker
Current Open-Domain Question Answering (ODQA) models typically include a retrieving module and a reading module, where the retriever selects potentially relevant passages from open-source documents for a given question, and the reader produces an answer based on the retrieved passages. The recently proposed Fusion-in-Decoder (FiD) framework is a representative example, which is built on top of a dense passage retriever and a generative reader, achieving the state-of-the-art performance. In this paper we further improve the FiD approach by introducing a knowledge-enhanced version, namely KG-FiD. Our new model uses a knowledge graph to establish the structural relationship among the retrieved passages, and a graph neural network (GNN) to re-rank the passages and select only a top few for further processing. Our experiments on common ODQA benchmark datasets (Natural Questions and TriviaQA) demonstrate that KG-FiD can achieve comparable or better performance in answer prediction than FiD, with less than 40% of the computation cost.
This paper addresses the problem of dialogue reasoning with contextualized commonsense inference We curate CICERO a dataset of dyadic conversations with five types of utterance level reasoning based inferences cause subsequent event prerequisite motivation and emotional reaction The dataset contains 53,105 of such inferences from 5,672 dialogues We use this dataset to solve relevant generative and discriminative tasks generation of cause and subsequent event generation of prerequisite motivation and listener’s emotional reaction and selection of plausible alternatives Our results ascertain the value of such dialogue centric commonsense knowledge datasets It is our hope that CICERO will open new research avenues into commonsense based dialogue reasoning
Interpretable methods to reveal the internal reasoning processes behind machine learning models have attracted increasing attention in recent years To quantify the extent to which the identified interpretations truly reflect the intrinsic decision making mechanisms various faithfulness evaluation metrics have been proposed However we find that different faithfulness metrics show conflicting preferences when comparing different interpretations Motivated by this observation we aim to conduct a comprehensive and comparative study of the widely adopted faithfulness metrics In particular we introduce two assessment dimensions namely diagnosticity and complexity Diagnosticity refers to the degree to which the faithfulness metric favors relatively faithful interpretations over randomly generated ones and complexity is measured by the average number of model forward passes According to the experimental results we find that sufficiency and comprehensiveness metrics have higher diagnosticity and lower complexity than the other faithfulness metrics
Selecting an appropriate pre-trained model (PTM) for a specific downstream task typically requires significant efforts of fine-tuning. To accelerate this process, researchers propose feature-based model selection (FMS) methods, which assess PTMs’ transferability to a specific task in a fast way without fine-tuning. In this work, we argue that current FMS methods are vulnerable, as the assessment mainly relies on the static features extracted from PTMs. However, such features are derived without training PTMs on downstream tasks, and are not necessarily reliable indicators for the PTM’s transferability. To validate our viewpoints, we design two methods to evaluate the robustness of FMS: (1) model disguise attack, which post-trains an inferior PTM with a contrastive objective, and (2) evaluation data selection, which selects a subset of the data points for FMS evaluation based on K-means clustering. Experimental results prove that both methods can successfully make FMS mistakenly judge the transferability of PTMs. Moreover, we find that these two methods can further be combined with the backdoor attack to misguide the FMS to select poisoned models. To the best of our knowledge, this is the first work to demonstrate the defects of current FMS algorithms and evaluate their potential security risks. By identifying previously unseen risks of FMS, our study indicates new directions for improving the robustness of FMS.
Generating educational questions of fairytales or storybooks is vital for improving children’s literacy ability. However, it is challenging to generate questions that capture the interesting aspects of a fairytale story with educational meaningfulness. In this paper, we propose a novel question generation method that first learns the question type distribution of an input story paragraph, and then summarizes salient events which can be used to generate high-cognitive-demand questions. To train the event-centric summarizer, we finetune a pre-trained transformer-based sequence-to-sequence model using silver samples composed by educational question-answer pairs. On a newly proposed educational question-answering dataset FairytaleQA, we show good performance of our method on both automatic and human evaluation metrics. Our work indicates the necessity of decomposing question type distribution learning and event-centric summary generation for educational question generation.
Program understanding is a fundamental task in program language processing Despite the success existing works fail to take human behaviors as reference in understanding programs In this paper we consider human behaviors and propose the PGNN EK model that consists of two main components On the one hand inspired by the divide and conquer reading behaviors of humans we present a partitioning based graph neural network model PGNN on the upgraded AST of codes On the other hand to characterize human behaviors of resorting to other resources to help code comprehension we transform raw codes with external knowledge and apply pre training techniques for information extraction Finally we combine the two embeddings generated from the two components to output code embeddings We conduct extensive experiments to show the superior performance of PGNN EK on the code summarization and code clone detection tasks In particular to show the generalization ability of our model we release a new dataset that is more challenging for code clone detection and could advance the development of the community Our codes and data are publicly available at https://github.com/RecklessRonan/PGNN-EK
We consider event extraction in a generative manner with template-based conditional generation.Although there is a rising trend of casting the task of event extraction as a sequence generation problem with prompts, these generation-based methods have two significant challenges, including using suboptimal prompts and static event type information.In this paper, we propose a generative template-based event extraction method with dynamic prefix (GTEE-DynPref) by integrating context information with type-specific prefixes to learn a context-specific prefix for each context.Experimental results show that our model achieves competitive results with the state-of-the-art classification-based model OneIE on ACE 2005 and achieves the best performances on ERE.Additionally, our model is proven to be portable to new types of events effectively.
We introduce a noisy channel approach for language model prompting in few-shot text classification. Instead of computing the likelihood of the label given the input (referred as direct models), channel models compute the conditional probability of the input given the label, and are thereby required to explain every word in the input. We use channel models for recently proposed few-shot learning methods with no or very limited updates to the language model parameters, via either in-context demonstration or prompt tuning. Our experiments show that, for both methods, channel models significantly outperform their direct counterparts, which we attribute to their stability, i.e., lower variance and higher worst-case accuracy. We also present extensive ablations that provide recommendations for when to use channel prompt tuning instead of other competitive models (e.g., direct head tuning): channel prompt tuning is preferred when the number of training examples is small, labels in the training data are imbalanced, or generalization to unseen labels is required.
Transformers are unable to model long-term memories effectively, since the amount of computation they need to perform grows with the context length. While variations of efficient transformers have been proposed, they all have a finite memory capacity and are forced to drop old information. In this paper, we propose the \\infty-former, which extends the vanilla transformer with an unbounded long-term memory. By making use of a continuous-space attention mechanism to attend over the long-term memory, the \\infty-former’s attention complexity becomes independent of the context length, trading off memory length with precision.In order to control where precision is more important, \\infty-former maintains “sticky memories,” being able to model arbitrarily long contexts while keeping the computation budget fixed.Experiments on a synthetic sorting task, language modeling, and document grounded dialogue generation demonstrate the \\infty-former’s ability to retain information from long sequences.
In recent years, neural models have often outperformed rule-based and classic Machine Learning approaches in NLG. These classic approaches are now often disregarded, for example when new neural models are evaluated. We argue that they should not be overlooked, since, for some tasks, well-designed non-neural approaches achieve better performance than neural ones. In this paper, the task of generating referring expressions in linguistic context is used as an example. We examined two very different English datasets (WEBNLG and WSJ), and evaluated each algorithm using both automatic and human evaluations.Overall, the results of these evaluations suggest that rule-based systems with simple rule sets achieve on-par or better performance on both datasets compared to state-of-the-art neural REG systems. In the case of the more realistic dataset, WSJ, a machine learning-based system with well-designed linguistic features performed best. We hope that our work can encourage researchers to consider non-neural models in future.
Paraphrase identification involves identifying whether a pair of sentences express the same or similar meanings. While cross-encoders have achieved high performances across several benchmarks, bi-encoders such as SBERT have been widely applied to sentence pair tasks. They exhibit substantially lower computation complexity and are better suited to symmetric tasks. In this work, we adopt a bi-encoder approach to the paraphrase identification task, and investigate the impact of explicitly incorporating predicate-argument information into SBERT through weighted aggregation. Experiments on six paraphrase identification datasets demonstrate that, with a minimal increase in parameters, the proposed model is able to outperform SBERT/SRoBERTa significantly. Further, ablation studies reveal that the predicate-argument based component plays a significant role in the performance gain.
Multimodal machine translation (MMT) aims to improve neural machine translation (NMT) with additional visual information, but most existing MMT methods require paired input of source sentence and image, which makes them suffer from shortage of sentence-image pairs. In this paper, we propose a phrase-level retrieval-based method for MMT to get visual information for the source input from existing sentence-image data sets so that MMT can break the limitation of paired sentence-image input. Our method performs retrieval at the phrase level and hence learns visual information from pairs of source phrase and grounded region, which can mitigate data sparsity. Furthermore, our method employs the conditional variational auto-encoder to learn visual representations which can filter redundant visual information and only retain visual information related to the phrase. Experiments show that the proposed method significantly outperforms strong baselines on multiple MMT datasets, especially when the textual context is limited.
Motivated by the success of T5 (Text-To-Text Transfer Transformer) in pre-trained natural language processing models, we propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning. The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets. After preprocessing the input speech/text through the pre-nets, the shared encoder-decoder network models the sequence-to-sequence transformation, and then the post-nets generate the output in the speech/text modality based on the output of the decoder. Leveraging large-scale unlabeled speech and text data, we pre-train SpeechT5 to learn a unified-modal representation, hoping to improve the modeling capability for both speech and text. To align the textual and speech information into this unified semantic space, we propose a cross-modal vector quantization approach that randomly mixes up speech/text states with latent units as the interface between encoder and decoder. Extensive evaluations show the superiority of the proposed SpeechT5 framework on a wide variety of spoken language processing tasks, including automatic speech recognition, speech synthesis, speech translation, voice conversion, speech enhancement, and speaker identification.
Information extraction suffers from its varying targets heterogeneous structures and demand specific schemas In this paper we propose a unified text to structure generation framework namely UIE which can universally model different IE tasks adaptively generate targeted structures and collaboratively learn general IE abilities from different knowledge sources Specifically UIE uniformly encodes different extraction structures via a structured extraction language adaptively generates target extractions via a schema based prompt mechanism structural schema instructor and captures the common IE abilities via a large scale pretrained text to structure model Experiments show that UIE achieved the state of the art performance on IE tasks datasets and on all supervised low resource and few shot settings for a wide range of entity relation event and sentiment extraction tasks and their unification These results verified the effectiveness universality and transferability of UIE
Low-shot relation extraction (RE) aims to recognize novel relations with very few or even no samples, which is critical in real scenario application. Few-shot and zero-shot RE are two representative low-shot RE tasks, which seem to be with similar target but require totally different underlying abilities. In this paper, we propose Multi-Choice Matching Networks to unify low-shot relation extraction. To fill in the gap between zero-shot and few-shot RE, we propose the triplet-paraphrase meta-training, which leverages triplet paraphrase to pre-train zero-shot label matching ability and uses meta-learning paradigm to learn few-shot instance summarizing ability. Experimental results on three different low-shot RE tasks show that the proposed method outperforms strong baselines by a large margin, and achieve the best performance on few-shot RE leaderboard.
Machine Reading Comprehension MRC reveals the ability to understand a given text passage and answer questions based on it Existing research works in MRC rely heavily on large size models and corpus to improve the performance evaluated by metrics such as Exact Match EM$ and F_1$. However such a paradigm lacks sufficient interpretation to model capability and can not efficiently train a model with a large corpus In this paper we argue that a deep understanding of model capabilities and data properties can help us feed a model with appropriate training data based on its learning status Specifically we design an MRC capability assessment framework that assesses model capabilities in an explainable and multi dimensional manner Based on it we further uncover and disentangle the connections between various data properties and model performance Finally to verify the effectiveness of the proposed MRC capability assessment framework we incorporate it into a curriculum learning pipeline and devise a Capability Boundary Breakthrough Curriculum CBBC strategy which performs a model capability based training to maximize the data value and improve training efficiency Extensive experiments demonstrate that our approach significantly improves performance achieving up to an 11.22\\% 8.71\\% improvement of EM$ F_1 on MRC tasksEM) and F_1. However, such a paradigm lacks sufficient interpretation to model capability and can not efficiently train a model with a large corpus. In this paper, we argue that a deep understanding of model capabilities and data properties can help us feed a model with appropriate training data based on its learning status. Specifically, we design an MRC capability assessment framework that assesses model capabilities in an explainable and multi-dimensional manner. Based on it, we further uncover and disentangle the connections between various data properties and model performance. Finally, to verify the effectiveness of the proposed MRC capability assessment framework, we incorporate it into a curriculum learning pipeline and devise a Capability Boundary Breakthrough Curriculum (CBBC) strategy, which performs a model capability-based training to maximize the data value and improve training efficiency. Extensive experiments demonstrate that our approach significantly improves performance, achieving up to an 11.22% / 8.71% improvement of EM / F_1 on MRC tasks.
Simile interpretation (SI) and simile generation (SG) are challenging tasks for NLP because models require adequate world knowledge to produce predictions. Previous works have employed many hand-crafted resources to bring knowledge-related into models, which is time-consuming and labor-intensive. In recent years, pre-trained language models (PLMs) based approaches have become the de-facto standard in NLP since they learn generic knowledge from a large corpus. The knowledge embedded in PLMs may be useful for SI and SG tasks. Nevertheless, there are few works to explore it. In this paper, we probe simile knowledge from PLMs to solve the SI and SG tasks in the unified framework of simile triple completion for the first time. The backbone of our framework is to construct masked sentences with manual patterns and then predict the candidate words in the masked position. In this framework, we adopt a secondary training process (Adjective-Noun mask Training) with the masked language model (MLM) loss to enhance the prediction diversity of candidate words in the masked position. Moreover, pattern ensemble (PE) and pattern search (PS) are applied to improve the quality of predicted words. Finally, automatic and human evaluations demonstrate the effectiveness of our framework in both SI and SG tasks.
Typed entailment graphs try to learn the entailment relations between predicates from text and model them as edges between predicate nodes The construction of entailment graphs usually suffers from severe sparsity and unreliability of distributional similarity We propose a two stage method Entailment Graph with Textual Entailment and Transitivity EGT2 EGT2 learns the local entailment relations by recognizing the textual entailment between template sentences formed by typed CCG parsed predicates Based on the generated local graph EGT2 then uses three novel soft transitivity constraints to consider the logical transitivity in entailment structures Experiments on benchmark datasets show that EGT2 can well model the transitivity in entailment graph to alleviate the sparsity and leads to signifcant improvement over current state of the art methods
In this paper, we study how to continually pre-train language models for improving the understanding of math problems. Specifically, we focus on solving a fundamental challenge in modeling math problems, how to fuse the semantics of textual description and formulas, which are highly different in essence. To address this issue, we propose a new approach called COMUS to continually pre-train language models for math problem understanding with syntax-aware memory network. In this approach, we first construct the math syntax graph to model the structural semantic information, by combining the parsing trees of the text and formulas, and then design the syntax-aware memory networks to deeply fuse the features from the graph and text. With the help of syntax relations, we can model the interaction between the token from the text and its semantic-related nodes within the formulas, which is helpful to capture fine-grained semantic correlations between texts and formulas. Besides, we devise three continual pre-training tasks to further align and fuse the representations of the text and math syntax graph. Experimental results on four tasks in the math domain demonstrate the effectiveness of our approach. Our code and data are publicly available at the link: bluehttps://github.com/RUCAIBox/COMUS.
Recent years have witnessed growing interests in incorporating external knowledge such as pre-trained word embeddings (PWEs) or pre-trained language models (PLMs) into neural topic modeling. However, we found that employing PWEs and PLMs for topic modeling only achieved limited performance improvements but with huge computational overhead. In this paper, we propose a novel strategy to incorporate external knowledge into neural topic modeling where the neural topic model is pre-trained on a large corpus and then fine-tuned on the target dataset. Experiments have been conducted on three datasets and results show that the proposed approach significantly outperforms both current state-of-the-art neural topic models and some topic modeling approaches enhanced with PWEs or PLMs. Moreover, further study shows that the proposed approach greatly reduces the need for the huge size of training data.
Word and sentence embeddings are useful feature representations in natural language processing However intrinsic evaluation for embeddings lags far behind and there has been no significant update since the past decade Word and sentence similarity tasks have become the de facto evaluation method It leads models to overfit to such evaluations negatively impacting embedding models development This paper first points out the problems using semantic similarity as the gold standard for word and sentence embedding evaluations Further we propose a new intrinsic evaluation method called EvalRank which shows a much stronger correlation with downstream tasks Extensive experiments are conducted based on models and popular datasets to certify our judgments Finally the practical evaluation toolkit is released for future benchmarking purposes
CLIP has shown a remarkable zero-shot capability on a wide range of vision tasks. Previously, CLIP is only regarded as a powerful visual encoder. However, after being pre-trained by language supervision from a large amount of image-caption pairs, CLIP itself should also have acquired some few-shot abilities for vision-language tasks. In this work, we empirically show that CLIP can be a strong vision-language few-shot learner by leveraging the power of language. We first evaluate CLIP’s zero-shot performance on a typical visual question answering task and demonstrate a zero-shot cross-modality transfer capability of CLIP on the visual entailment task. Then we propose a parameter-efficient fine-tuning strategy to boost the few-shot performance on the vqa task. We achieve competitive zero/few-shot results on the visual question answering and visual entailment tasks without introducing any additional pre-training procedure.
Dialogue systems are usually categorized into two types, open-domain and task-oriented. The first one focuses on chatting with users and making them engage in the conversations, where selecting a proper topic to fit the dialogue context is essential for a successful dialogue. The other one focuses on a specific task instead of casual talks, e.g., finding a movie on Friday night, playing a song. These two directions have been studied separately due to their different purposes. However, how to smoothly transition from social chatting to task-oriented dialogues is important for triggering the business opportunities, and there is no any public data focusing on such scenarios. Hence, this paper focuses on investigating the conversations starting from open-domain social chatting and then gradually transitioning to task-oriented purposes, and releases a large-scale dataset with detailed annotations for encouraging this research direction. To achieve this goal, this paper proposes a framework to automatically generate many dialogues without human involvement, in which any powerful open-domain dialogue generation model can be easily leveraged. The human evaluation shows that our generated dialogue data has a natural flow at a reasonable quality, showing that our released data has a great potential of guiding future research directions and commercial activities. Furthermore, the released models allow researchers to automatically generate unlimited dialogues in the target scenarios, which can greatly benefit semi-supervised and unsupervised approaches.
Code completion, which aims to predict the following code token(s) according to the code context, can improve the productivity of software development. Recent work has proved that statistical language modeling with transformers can greatly improve the performance in the code completion task via learning from large-scale source code datasets. However, current approaches focus only on code context within the file or project, i.e. internal context. Our distinction is utilizing ”external” context, inspired by human behaviors of copying from the related code snippets when writing code. Specifically, we propose a retrieval-augmented code completion framework, leveraging both lexical copying and referring to code with similar semantics by retrieval. We adopt a stage-wise training approach that combines a source code retriever and an auto-regressive language model for programming language. We evaluate our approach in the code completion task in Python and Java programming languages, achieving a state-of-the-art performance on CodeXGLUE benchmark.
DocRED is a widely used dataset for document-level relation extraction. In the large-scale annotation, a recommend-revise scheme is adopted to reduce the workload. Within this scheme, annotators are provided with candidate relation instances from distant supervision, and they then manually supplement and remove relational facts based on the recommendations. However, when comparing DocRED with a subset relabeled from scratch, we find that this scheme results in a considerable amount of false negative samples and an obvious bias towards popular entities and relations. Furthermore, we observe that the models trained on DocRED have low recall on our relabeled dataset and inherit the same bias in the training data. Through the analysis of annotators’ behaviors, we figure out the underlying reason for the problems above: the scheme actually discourages annotators from supplementing adequate instances in the revision phase. We appeal to future research to take into consideration the issues with the recommend-revise scheme when designing new models and annotation schemes. The relabeled dataset is released at https://github.com/AndrewZhe/Revisit-DocRED, to serve as a more reliable test set of document RE models.
A recent study by Feldman proposed a long tail theory to explain the memorization behavior of deep learning models However memorization has not been empirically verified in the context of NLP a gap addressed by this work In this paper we use three different NLP tasks to check if the long tail theory holds Our experiments demonstrate that top ranked memorized training instances are likely atypical and removing the top memorized training instances leads to a more serious drop in test accuracy compared with removing training instances randomly Furthermore we develop an attribution method to better understand why a training instance is memorized We empirically show that our memorization attribution method is faithful and share our interesting finding that the top memorized parts of a training instance tend to be features negatively correlated with the class label
Most works on financial forecasting use information directly associated with individual companies (e.g., stock prices, news on the company) to predict stock returns for trading. We refer to such company-specific information as local information. Stock returns may also be influenced by global information (e.g., news on the economy in general), and inter-company relationships. Capturing such diverse information is challenging due to the low signal-to-noise ratios, different time-scales, sparsity and distributions of global and local information from different modalities. In this paper, we propose a model that captures both global and local multimodal information for investment and risk management-related forecasting tasks. Our proposed Guided Attention Multimodal Multitask Network (GAME) model addresses these challenges by using novel attention modules to guide learning with global and local information from different modalities and dynamic inter-company relationship networks. Our extensive experiments show that GAME outperforms other state-of-the-art models in several forecasting tasks and important real-world application case studies.
Pre-trained sequence-to-sequence models have significantly improved Neural Machine Translation (NMT). Different from prior works where pre-trained models usually adopt an unidirectional decoder, this paper demonstrates that pre-training a sequence-to-sequence model but with a bidirectional decoder can produce notable performance gains for both Autoregressive and Non-autoregressive NMT. Specifically, we propose CeMAT, a conditional masked language model pre-trained on large-scale bilingual and monolingual corpora in many languages. We also introduce two simple but effective methods to enhance the CeMAT, aligned code-switching & masking and dynamic dual-masking. We conduct extensive experiments and show that our CeMAT can achieve significant performance improvement for all scenarios from low- to extremely high-resource languages, i.e., up to +14.4 BLEU on low resource and +7.9 BLEU improvements on average for Autoregressive NMT. For Non-autoregressive NMT, we demonstrate it can also produce consistent performance gains, i.e., up to +5.3 BLEU. To the best of our knowledge, this is the first work to pre-train a unified model for fine-tuning on both NMT tasks. Code, data, and pre-trained models are available at https://github.com/huawei-noah/Pretrained-Language-Model/CeMAT
Evaluation of open-domain dialogue systems is highly challenging and development of better techniques is highlighted time and again as desperately needed. Despite substantial efforts to carry out reliable live evaluation of systems in recent competitions, annotations have been abandoned and reported as too unreliable to yield sensible results. This is a serious problem since automatic metrics are not known to provide a good indication of what may or may not be a high-quality conversation. Answering the distress call of competitions that have emphasized the urgent need for better evaluation techniques in dialogue, we present the successful development of human evaluation that is highly reliable while still remaining feasible and low cost. Self-replication experiments reveal almost perfectly repeatable results with a correlation of r=0.969. Furthermore, due to the lack of appropriate methods of statistical significance testing, the likelihood of potential improvements to systems occurring due to chance is rarely taken into account in dialogue evaluation, and the evaluation we propose facilitates application of standard tests. Since we have developed a highly reliable evaluation method, new insights into system performance can be revealed. We therefore include a comparison of state-of-the-art models (i) with and without personas, to measure the contribution of personas to conversation quality, as well as (ii) prescribed versus freely chosen topics. Interestingly with respect to personas, results indicate that personas do not positively contribute to conversation quality as expected.
Generic summaries try to cover an entire document and query-based summaries try to answer document-specific questions. But real users’ needs often fall in between these extremes and correspond to aspects, high-level topics discussed among similar types of documents. In this paper, we collect a dataset of realistic aspect-oriented summaries, AspectNews, which covers different subtopics about articles in news sub-domains. We annotate data across two domains of articles, earthquakes and fraud investigations, where each article is annotated with two distinct summaries focusing on different aspects for each domain. A system producing a single generic summary cannot concisely satisfy both aspects. Our focus in evaluation is how well existing techniques can generalize to these domains without seeing in-domain training data, so we turn to techniques to construct synthetic training data that have been used in query-focused summarization work. We compare several training schemes that differ in how strongly keywords are used and how oracle summaries are extracted. Our evaluation shows that our final approach yields (a) focused summaries, better than those from a generic summarization system or from keyword matching; (b) a system sensitive to the choice of keywords.
We introduce MemSum (Multi-step Episodic Markov decision process extractive SUMmarizer), a reinforcement-learning-based extractive summarizer enriched at each step with information on the current extraction history. When MemSum iteratively selects sentences into the summary, it considers a broad information set that would intuitively also be used by humans in this task: 1) the text content of the sentence, 2) the global text context of the rest of the document, and 3) the extraction history consisting of the set of sentences that have already been extracted. With a lightweight architecture, MemSum obtains state-of-the-art test-set performance (ROUGE) in summarizing long documents taken from PubMed, arXiv, and GovReport. Ablation studies demonstrate the importance of local, global, and history information. A human evaluation confirms the high quality and low redundancy of the generated summaries, stemming from MemSum’s awareness of extraction history.
Online alignment in machine translation refers to the task of aligning a target word to a source word when the target sequence has only been partially decoded. Good online alignments facilitate important applications such as lexically constrained translation where user-defined dictionaries are used to inject lexical constraints into the translation model. We propose a novel posterior alignment technique that is truly online in its execution and superior in terms of alignment error rates compared to existing methods. Our proposed inference technique jointly considers alignment and token probabilities in a principled manner and can be seamlessly integrated within existing constrained beam-search decoding algorithms. On five language pairs, including two distant language pairs, we achieve consistent drop in alignment error rates. When deployed on seven lexically constrained translation tasks, we achieve significant improvements in BLEU specifically around the constrained positions.
While a great deal of work has been done on NLP approaches to lexical semantic change detection other aspects of language change have received less attention from the NLP community In this paper we address the detection of sound change through historical spelling We propose that a sound change can be captured by comparing the relative distance through time between the distributions of the characters involved before and after the change has taken place We model these distributions using PPMI character embeddings We verify this hypothesis in synthetic data and then test the methods ability to trace the well known historical change of lenition of plosives in Danish historical sources We show that the models are able to identify several of the changes under consideration and to uncover meaningful contexts in which they appeared The methodology has the potential to contribute to the study of open questions such as the relative chronology of sound shifts and their geographical distribution
Simultaneous machine translation (SiMT) starts translating while receiving the streaming source inputs, and hence the source sentence is always incomplete during translating. Different from the full-sentence MT using the conventional seq-to-seq architecture, SiMT often applies prefix-to-prefix architecture, which forces each target word to only align with a partial source prefix to adapt to the incomplete source in streaming inputs. However, the source words in the front positions are always illusoryly considered more important since they appear in more prefixes, resulting in position bias, which makes the model pay more attention on the front source positions in testing. In this paper, we first analyze the phenomenon of position bias in SiMT, and develop a Length-Aware Framework to reduce the position bias by bridging the structural gap between SiMT and full-sentence MT. Specifically, given the streaming inputs, we first predict the full-sentence length and then fill the future source position with positional encoding, thereby turning the streaming inputs into a pseudo full-sentence. The proposed framework can be integrated into most existing SiMT methods to further improve performance. Experiments on two representative SiMT methods, including the state-of-the-art adaptive policy, show that our method successfully reduces the position bias and thereby achieves better SiMT performance.
We present a novel pipeline for the collection of parallel data for the detoxification task We collect non toxic paraphrases for over 10,000 English toxic sentences We also show that this pipeline can be used to distill a large existing corpus of paraphrases to get toxic neutral sentence pairs We release two parallel corpora which can be used for the training of detoxification models To the best of our knowledge these are the first parallel datasets for this task We describe our pipeline in detail to make it fast to set up for a new language or domain thus contributing to faster and easier development of new parallel resources We train several detoxification models on the collected data and compare them with several baselines and state of the art unsupervised approaches We conduct both automatic and manual evaluations All models trained on parallel data outperform the state of the art unsupervised models by a large margin This suggests that our novel datasets can boost the performance of detoxification systems
While neural text-to-speech systems perform remarkably well in high-resource scenarios, they cannot be applied to the majority of the over 6,000 spoken languages in the world due to a lack of appropriate training data. In this work, we use embeddings derived from articulatory vectors rather than embeddings derived from phoneme identities to learn phoneme representations that hold across languages. In conjunction with language agnostic meta learning, this enables us to fine-tune a high-quality text-to-speech model on just 30 minutes of data in a previously unseen language spoken by a previously unseen speaker.
For a natural language understanding benchmark to be useful in research it has to consist of examples that are diverse and difficult enough to discriminate among current and near future state of the art systems However we do not yet know how best to select text sources to collect a variety of challenging examples In this study we crowdsource multiple choice reading comprehension questions for passages taken from seven qualitatively distinct sources analyzing what attributes of passages contribute to the difficulty and question types of the collected examples To our surprise we find that passage source length and readability measures do not significantly affect question difficulty Through our manual annotation of seven reasoning types we observe several trends between passage sources and reasoning types e.g. logical reasoning is more often required in questions written for technical passages These results suggest that when creating a new benchmark dataset selecting a diverse set of passages can help ensure a diverse range of question types but that passage difficulty need not be a priority
Various efforts in the Natural Language Processing (NLP) community have been made to accommodate linguistic diversity and serve speakers of many different languages. However, it is important to acknowledge that speakers and the content they produce and require, vary not just by language, but also by culture. Although language and culture are tightly linked, there are important differences. Analogous to cross-lingual and multilingual NLP, cross-cultural and multicultural NLP considers these differences in order to better serve users of NLP systems. We propose a principled framework to frame these efforts, and survey existing and potential strategies.
Prompt-based tuning for pre-trained language models (PLMs) has shown its effectiveness in few-shot learning. Typically, prompt-based tuning wraps the input text into a cloze question. To make predictions, the model maps the output words to labels via a verbalizer, which is either manually designed or automatically built. However, manual verbalizers heavily depend on domain-specific prior knowledge and human efforts, while finding appropriate label words automatically still remains challenging.In this work, we propose the prototypical verbalizer (ProtoVerb) which is built directly from training data. Specifically, ProtoVerb learns prototype vectors as verbalizers by contrastive learning. In this way, the prototypes summarize training instances and are able to enclose rich class-level semantics. We conduct experiments on both topic classification and entity typing tasks, and the results demonstrate that ProtoVerb significantly outperforms current automatic verbalizers, especially when training data is extremely scarce. More surprisingly, ProtoVerb consistently boosts prompt-based tuning even on untuned PLMs, indicating an elegant non-tuning way to utilize PLMs. Our codes are avaliable at https://github.com/thunlp/OpenPrompt.
We introduce and study the task of clickbait spoiling generating a short text that satisfies the curiosity induced by a clickbait post Clickbait links to a web page and advertises its contents by arousing curiosity instead of providing an informative summary Our contributions are approaches to classify the type of spoiler needed i.e. a phrase or a passage and to generate appropriate spoilers A large scale evaluation and error analysis on a new corpus of 5,000 manually spoiled clickbait posts --- the Webis Clickbait Spoiling Corpus~2022 --shows that our spoiler type classifier achieves an accuracy of~80\\% while the question answering model DeBERTa large outperforms all others in generating spoilers for both types
Hierarchical text classification is a challenging subtask of multi label classification due to its complex label hierarchy Existing methods encode text and label hierarchy separately and mix their representations for classification where the hierarchy remains unchanged for all input text Instead of modeling them separately in this work we propose Hierarchy guided Contrastive Learning HGCLR to directly embed the hierarchy into a text encoder During training HGCLR constructs positive samples for input text under the guidance of the label hierarchy By pulling together the input text and its positive sample the text encoder can learn to generate the hierarchy aware text representation independently Therefore after training the HGCLR enhanced text encoder can dispense with the redundant hierarchy Extensive experiments on three benchmark datasets verify the effectiveness of HGCLR
To alleviate the data scarcity problem in training question answering systems, recent works propose additional intermediate pre-training for dense passage retrieval (DPR). However, there still remains a large discrepancy between the provided upstream signals and the downstream question-passage relevance, which leads to less improvement. To bridge this gap, we propose the HyperLink-induced Pre-training (HLP), a method to pre-train the dense retriever with the text relevance induced by hyperlink-based topology within Web documents. We demonstrate that the hyperlink-based structures of dual-link and co-mention can provide effective relevance signals for large-scale pre-training that better facilitate downstream passage retrieval. We investigate the effectiveness of our approach across a wide range of open-domain QA datasets under zero-shot, few-shot, multi-hop, and out-of-domain scenarios. The experiments show our HLP outperforms the BM25 by up to 7 points as well as other pre-training methods by more than 10 points in terms of top-20 retrieval accuracy under the zero-shot scenario. Furthermore, HLP significantly outperforms other pre-training methods under the other scenarios.
Model ensemble is a popular approach to produce a low variance and well generalized model However it induces large memory and inference costs which is often not affordable for real world deployment Existing work has resorted to sharing weights among models However when increasing the proportion of the shared weights the resulting models tend to be similar and the benefits of using model ensemble diminish To retain ensemble benefits while maintaining a low memory cost we propose a consistency regularized ensemble learning approach based on perturbed models named CAMERO Specifically we share the weights of bottom layers across all models and apply different perturbations to the hidden representations for different models which can effectively promote the model diversity Meanwhile we apply a prediction consistency regularizer across the perturbed models to control the variance due to the model diversity Our experiments using large language models demonstrate that CAMERO significantly improves the generalization performance of the ensemble model Specifically CAMERO outperforms the standard ensemble of BERT base models on the GLUE benchmark by 0.7 with a significantly smaller model size 114.2$M vs. 880.6$M
Grammatical Error Correction (GEC) should not focus only on high accuracy of corrections but also on interpretability for language learning.However, existing neural-based GEC models mainly aim at improving accuracy, and their interpretability has not been explored.A promising approach for improving interpretability is an example-based method, which uses similar retrieved examples to generate corrections. In addition, examples are beneficial in language learning, helping learners understand the basis of grammatically incorrect/correct texts and improve their confidence in writing.Therefore, we hypothesize that incorporating an example-based method into GEC can improve interpretability as well as support language learners.In this study, we introduce an Example-Based GEC (EB-GEC) that presents examples to language learners as a basis for a correction result.The examples consist of pairs of correct and incorrect sentences similar to a given input and its predicted correction.Experiments demonstrate that the examples presented by EB-GEC help language learners decide to accept or refuse suggestions from the GEC output.Furthermore, the experiments also show that retrieved examples improve the accuracy of corrections.
Pre-trained models for programming languages have recently demonstrated great success on code intelligence. To support both code-related understanding and generation tasks, recent works attempt to pre-train unified encoder-decoder models. However, such encoder-decoder framework is sub-optimal for auto-regressive tasks, especially code completion that requires a decoder-only manner for efficient inference. In this paper, we present UniXcoder, a unified cross-modal pre-trained model for programming language. The model utilizes mask attention matrices with prefix adapters to control the behavior of the model and leverages cross-modal contents like AST and code comment to enhance code representation. To encode AST that is represented as a tree in parallel, we propose a one-to-one mapping method to transform AST in a sequence structure that retains all structural information from the tree. Furthermore, we propose to utilize multi-modal contents to learn representation of code fragment with contrastive learning, and then align representations among programming languages using a cross-modal generation task. We evaluate UniXcoder on five code-related tasks over nine datasets. To further evaluate the performance of code fragment representation, we also construct a dataset for a new task, called zero-shot code-to-code search. Results show that our model achieves state-of-the-art performance on most tasks and analysis reveals that comment and AST can both enhance UniXcoder.
Recent studies have shown that multilingual pretrained language models can be effectively improved with cross-lingual alignment information from Wikipedia entities.However, existing methods only exploit entity information in pretraining and do not explicitly use entities in downstream tasks.In this study, we explore the effectiveness of leveraging entity representations for downstream cross-lingual tasks.We train a multilingual language model with 24 languages with entity representations and showthe model consistently outperforms word-based pretrained models in various cross-lingual transfer tasks.We also analyze the model and the key insight is that incorporating entity representations into the input allows us to extract more language-agnostic features.We also evaluate the model with a multilingual cloze prompt task with the mLAMA dataset.We show that entity-based prompt elicits correct factual knowledge more likely than using only word representations.
Transformer architectures have achieved state- of-the-art results on a variety of natural language processing (NLP) tasks. However, their attention mechanism comes with a quadratic complexity in sequence lengths, making the computational overhead prohibitive, especially for long sequences. Attention context can be seen as a random-access memory with each token taking a slot. Under this perspective, the memory size grows linearly with the sequence length, and so does the overhead of reading from it. One way to improve the efficiency is to bound the memory size. We show that disparate approaches can be subsumed into one abstraction, attention with bounded-memory control (ABC), and they vary in their organization of the memory. ABC reveals new, unexplored possibilities. First, it connects several efficient attention variants that would otherwise seem apart. Second, this abstraction gives new insights—an established approach (Wang et al., 2020b) previously thought to not be applicable in causal attention, actually is. Last, we present a new instance of ABC, which draws inspiration from existing ABC approaches, but replaces their heuristic memory-organizing functions with a learned, contextualized one. Our experiments on language modeling, machine translation, and masked language model finetuning show that our approach outperforms previous efficient attention models; compared to the strong transformer baselines, it significantly improves the inference time and space efficiency with no or negligible accuracy loss.
Neural coreference resolution models trained on one dataset may not transfer to new low resource domains Active learning mitigates this problem by sampling a small subset of data for annotators to label While active learning is well defined for classification tasks its application to coreference resolution is neither well defined nor fully understood This paper explores how to actively label coreference examining sources of model uncertainty and document reading costs We compare uncertainty sampling strategies and their advantages through thorough error analysis In both synthetic and human experiments labeling spans within the same document is more effective than annotating spans across documents The findings contribute to a more realistic development of coreference resolution models
Although transformers are remarkably effective for many tasks, there are some surprisingly easy-looking regular languages that they struggle with. Hahn shows that for languages where acceptance depends on a single input symbol, a transformer’s classification decisions get closer and closer to random guessing (that is, a cross-entropy of 1) as input strings get longer and longer. We examine this limitation using two languages: PARITY, the language of bit strings with an odd number of 1s, and FIRST, the language of bit strings starting with a 1. We demonstrate three ways of overcoming the limitation implied by Hahn’s lemma. First, we settle an open question by constructing a transformer that recognizes PARITY with perfect accuracy, and similarly for FIRST. Second, we use layer normalization to bring the cross-entropy of both models arbitrarily close to zero. Third, when transformers need to focus on a single position, as for FIRST, we find that they can fail to generalize to longer strings; we offer a simple remedy to this problem that also improves length generalization in machine translation.
Regularization methods applying input perturbation have drawn considerable attention and have been frequently explored for NMT tasks in recent years Despite their simplicity and effectiveness we argue that these methods are limited by the under fitting of training data In this paper we utilize prediction difference for ground truth tokens to analyze the fitting of token level samples and find that under fitting is almost as common as over fitting We introduce prediction difference regularization PD R a simple and effective method that can reduce over fitting and under fitting at the same time For all token level samples PD R minimizes the prediction difference between the original pass and the input perturbed pass making the model less sensitive to small input changes thus more robust to both perturbations and under fitted training data Experiments on three widely used WMT translation tasks show that our approach can significantly improve over existing perturbation regularization methods On WMT16 En De task our model achieves 1.80 SacreBLEU improvement over vanilla transformer
Cross-lingual transfer learning with large multilingual pre-trained models can be an effective approach for low-resource languages with no labeled training data. Existing evaluations of zero-shot cross-lingual generalisability of large pre-trained models use datasets with English training data, and test data in a selection of target languages. We explore a more extensive transfer learning setup with 65 different source languages and 105 target languages for part-of-speech tagging. Through our analysis, we show that pre-training of both source and target language, as well as matching language families, writing systems, word order systems, and lexical-phonetic distance significantly impact cross-lingual performance. The findings described in this paper can be used as indicators of which factors are important for effective zero-shot cross-lingual transfer to zero- and low-resource languages.
With the rapid development of deep learning, Seq2Seq paradigm has become prevalent for end-to-end data-to-text generation, and the BLEU scores have been increasing in recent years. However, it is widely recognized that there is still a gap between the quality of the texts generated by models and the texts written by human. In order to better understand the ability of Seq2Seq models, evaluate their performance and analyze the results, we choose to use Multidimensional Quality Metric(MQM) to evaluate several representative Seq2Seq models on end-to-end data-to-text generation. We annotate the outputs of five models on four datasets with eight error types and find that 1) copy mechanism is helpful for the improvement in Omission and Inaccuracy Extrinsic errors but it increases other types of errors such as Addition; 2) pre-training techniques are highly effective, and pre-training strategy and model size are very significant; 3) the structure of the dataset also influences the model’s performance greatly; 4) some specific types of errors are generally challenging for seq2seq models.
Structured document understanding has attracted considerable attention and made significant progress recently, owing to its crucial role in intelligent document processing. However, most existing related models can only deal with the document data of specific language(s) (typically English) included in the pre-training collection, which is extremely limited. To address this issue, we propose a simple yet effective Language-independent Layout Transformer (LiLT) for structured document understanding. LiLT can be pre-trained on the structured documents of a single language and then directly fine-tuned on other languages with the corresponding off-the-shelf monolingual/multilingual pre-trained textual models. Experimental results on eight languages have shown that LiLT can achieve competitive or even superior performance on diverse widely-used downstream benchmarks, which enables language-independent benefit from the pre-training of document layout structure. Code and model are publicly available at https://github.com/jpWang/LiLT.
Identifying argument components from unstructured texts and predicting the relationships expressed among them are two primary steps of argument mining The intrinsic complexity of these tasks demands powerful learning models While pretrained Transformer based Language Models LM have been shown to provide state of the art results over different NLP tasks the scarcity of manually annotated data and the highly domain dependent nature of argumentation restrict the capabilities of such models In this work we propose a novel transfer learning strategy to overcome these challenges We utilize argumentation rich social discussions from the subreddit as a source of unsupervised argumentative discourse aware knowledge by finetuning pretrained LMs on a selectively masked language modeling task Furthermore we introduce a novel prompt based strategy for inter component relation prediction that compliments our proposed finetuning method while leveraging on the discourse context Exhaustive experiments show the generalization capability of our method on these two tasks over within domain as well as out of domain datasets outperforming several existing and employed strong baselinesChangeMyView subreddit as a source of unsupervised, argumentative discourse-aware knowledge by finetuning pretrained LMs on a selectively masked language modeling task. Furthermore, we introduce a novel prompt-based strategy for inter-component relation prediction that compliments our proposed finetuning method while leveraging on the discourse context. Exhaustive experiments show the generalization capability of our method on these two tasks over within-domain as well as out-of-domain datasets, outperforming several existing and employed strong baselines.
In this paper, we propose an entity-based neural local coherence model which is linguistically more sound than previously proposed neural coherence models. Recent neural coherence models encode the input document using large-scale pretrained language models. Hence their basis for computing local coherence are words and even sub-words. The analysis of their output shows that these models frequently compute coherence on the basis of connections between (sub-)words which, from a linguistic perspective, should not play a role. Still, these models achieve state-of-the-art performance in several end applications. In contrast to these models, we compute coherence on the basis of entities by constraining the input to noun phrases and proper names. This provides us with an explicit representation of the most important items in sentences leading to the notion of focus. This brings our model linguistically in line with pre-neural models of computing coherence. It also gives us better insight into the behaviour of the model thus leading to better explainability. Our approach is also in accord with a recent study (O’Connor and Andreas, 2021), which shows that most usable information is captured by nouns and verbs in transformer-based language models. We evaluate our model on three downstream tasks showing that it is not only linguistically more sound than previous models but also that it outperforms them in end applications.
Language model LM pretraining captures various knowledge from text corpora helping downstream tasks However existing methods such as BERT model a single document and do not capture dependencies or knowledge that span across documents In this work we propose LinkBERT an LM pretraining method that leverages links between documents e.g. hyperlinks Given a text corpus we view it as a graph of documents and create LM inputs by placing linked documents in the same context We then pretrain the LM with two joint self supervised objectives masked language modeling and our new proposal document relation prediction We show that LinkBERT outperforms BERT on various downstream tasks across two domains the general domain pretrained on Wikipedia with hyperlinks and biomedical domain pretrained on PubMed with citation links LinkBERT is especially effective for multi hop reasoning and few shot QA +5 absolute improvement on HotpotQA and TriviaQA and our biomedical LinkBERT sets new states of the art on various BioNLP tasks +7 on BioASQ and USMLE We release our pretrained models LinkBERT and BioLinkBERT as well as code and data
We teach goal driven agents to interactively act and speak in situated environments by training on generated curriculums Our agents operate in LIGHT Urbanek et al 2019)---a large scale crowd sourced fantasy text adventure game wherein an agent perceives and interacts with the world through textual natural language Goals in this environment take the form of character based quests consisting of personas and motivations We augment LIGHT by learning to procedurally generate additional novel textual worlds and quests to create a curriculum of steadily increasing difficulty for training agents to achieve such goals In particular we measure curriculum difficulty in terms of the rarity of the quest in the original training distribution --- an easier environment is one that is more likely to have been found in the unaugmented dataset An ablation study shows that this method of learning from the tail of a distribution results in significantly higher generalization abilities as measured by zero shot performance on never before seen quests
Program induction for answering complex questions over knowledge bases KBs aims to decompose a question into a multi step program whose execution against the KB produces the final answer Learning to induce programs relies on a large number of parallel question program pairs for the given KB However for most KBs the gold program annotations are usually lacking making learning difficult In this paper we propose the approach of program transfer which aims to leverage the valuable program annotations on the rich resourced KBs as external supervision signals to aid program induction for the low resourced KBs that lack program annotations For program transfer we design a novel two stage parsing framework with an efficient ontology guided pruning strategy First a sketch parser translates the question into a high level program sketch which is the composition of functions Second given the question and sketch an argument parser searches the detailed arguments from the KB for functions During the searching we incorporate the KB ontology to prune the search space The experiments on ComplexWebQuestions and WebQuestionSP show that our method outperforms SOTA methods significantly demonstrating the effectiveness of program transfer and our framework Our codes and datasets can be obtained from https://github.com/THU-KEG/ProgramTransfer
Prompts for pre-trained language models (PLMs) have shown remarkable performance by bridging the gap between pre-training tasks and various downstream tasks. Among these methods, prompt tuning, which freezes PLMs and only tunes soft prompts, provides an efficient and effective solution for adapting large-scale PLMs to downstream tasks. However, prompt tuning is yet to be fully explored. In our pilot experiments, we find that prompt tuning performs comparably with conventional full-model tuning when downstream data are sufficient, whereas it is much worse under few-shot learning settings, which may hinder the application of prompt tuning. We attribute this low performance to the manner of initializing soft prompts. Therefore, in this work, we propose to pre-train prompts by adding soft prompts into the pre-training stage to obtain a better initialization. We name this Pre-trained Prompt Tuning framework “PPT”. To ensure the generalization of PPT, we formulate similar classification tasks into a unified task form and pre-train soft prompts for this unified task. Extensive experiments show that tuning pre-trained prompts for downstream tasks can reach or even outperform full-model fine-tuning under both full-data and few-shot settings. Our approach is effective and efficient for using large-scale PLMs in practice.
We find that existing language modeling datasets contain many near duplicate examples and long repetitive substrings As a result over of the unprompted output of language models trained on these datasets is copied verbatim from the training data We develop two tools that allow us to deduplicate training datasets --- for example removing from C4 a single word English sentence that is repeated over 60,000 times Deduplication allows us to train models that emit memorized text ten times less frequently and require fewer training steps to achieve the same or better accuracy We can also reduce train test overlap which affects over of the validation set of standard datasets thus allowing for more accurate evaluation Code for deduplication is released at https://github.com/google-research/deduplicate-text-datasets
The largest store of continually updating knowledge on our planet can be accessed via internet search. In this work we study giving access to this information to conversational agents. Large language models, even though they store an impressive amount of knowledge within their weights, are known to hallucinate facts when generating dialogue (Shuster et al., 2021); moreover, those facts are frozen in time at the point of model training. In contrast, we propose an approach that learns to generate an internet search query based on the context, and then conditions on the search results to finally generate a response, a method that can employ up-to-the-minute relevant information. We train and evaluate such models on a newly collected dataset of human-human conversations whereby one of the speakers is given access to internet search during knowledgedriven discussions in order to ground their responses. We find that search-query based access of the internet in conversation provides superior performance compared to existing approaches that either use no augmentation or FAISS-based retrieval (Lewis et al., 2020b).
Large scale pretrained language models are surprisingly good at recalling factual knowledge presented in the training corpus In this paper we present preliminary studies on how factual knowledge is stored in pretrained Transformers by introducing the concept of knowledge neurons Specifically we examine the fill in the blank cloze task for BERT Given a relational fact we propose a knowledge attribution method to identify the neurons that express the fact We find that the activation of such knowledge neurons is positively correlated to the expression of their corresponding facts In our case studies we attempt to leverage knowledge neurons to edit such as update and erase specific factual knowledge without fine tuning Our results shed light on understanding the storage of knowledge within pretrained Transformers
We study the problem of building text classifiers with little or no training data, commonly known as zero and few-shot text classification. In recent years, an approach based on neural textual entailment models has been found to give strong results on a diverse range of tasks. In this work, we show that with proper pre-training, Siamese Networks that embed texts and labels offer a competitive alternative. These models allow for a large reduction in inference cost: constant in the number of labels rather than linear. Furthermore, we introduce label tuning, a simple and computationally efficient approach that allows to adapt the models in a few-shot setup by only changing the label embeddings. While giving lower performance than model fine-tuning, this approach has the architectural advantage that a single encoder can be shared by many different tasks.
Generating factual, long-form text such as Wikipedia articles raises three key challenges: how to gather relevant evidence, how to structure information into well-formed text, and how to ensure that the generated text is factually correct. We address these by developing a model for English text that uses a retrieval mechanism to identify relevant supporting information on the web and a cache-based pre-trained encoder-decoder to generate long-form biographies section by section, including citation information. To assess the impact of available web evidence on the output text, we compare the performance of our approach when generating biographies about women (for which less information is available on the web) vs. biographies generally. To this end, we curate a dataset of 1,500 biographies about women. We analyze our generated text to understand how differences in available web evidence data affect generation. We evaluate the factuality, fluency, and quality of the generated texts using automatic metrics and human evaluation. We hope that these techniques can be used as a starting point for human writers, to aid in reducing the complexity inherent in the creation of long-form, factual text.
In many natural language processing (NLP) tasks the same input (e.g. source sentence) can have multiple possible outputs (e.g. translations). To analyze how this ambiguity (also known as intrinsic uncertainty) shapes the distribution learned by neural sequence models we measure sentence-level uncertainty by computing the degree of overlap between references in multi-reference test sets from two different NLP tasks: machine translation (MT) and grammatical error correction (GEC). At both the sentence- and the task-level, intrinsic uncertainty has major implications for various aspects of search such as the inductive biases in beam search and the complexity of exact search. In particular, we show that well-known pathologies such as a high number of beam search errors, the inadequacy of the mode, and the drop in system performance with large beam sizes apply to tasks with high level of ambiguity such as MT but not to less uncertain tasks such as GEC. Furthermore, we propose a novel exact n-best search algorithm for neural sequence models, and show that intrinsic uncertainty affects model uncertainty as the model tends to overly spread out the probability mass for uncertain tasks and sentences.
Most previous methods for text data augmentation are limited to simple tasks and weak baselines. We explore data augmentation on hard tasks (i.e., few-shot natural language understanding) and strong baselines (i.e., pretrained models with over one billion parameters). Under this setting, we reproduced a large number of previous augmentation methods and found that these methods bring marginal gains at best and sometimes degrade the performance much. To address this challenge, we propose a novel data augmentation method FlipDA that jointly uses a generative model and a classifier to generate label-flipped data. Central to the idea of FlipDA is the discovery that generating label-flipped data is more crucial to the performance than generating label-preserved data. Experiments show that FlipDA achieves a good tradeoff between effectiveness and robustness—it substantially improves many tasks while not negatively affecting the others.
This work presents methods for learning cross lingual sentence representations using paired or unpaired bilingual texts We hypothesize that the cross lingual alignment strategy is transferable and therefore a model trained to align only two languages can encode multilingually more aligned representations We thus introduce dual pivot transfer training on one language pair and evaluating on other pairs To study this theory we design unsupervised models trained on unpaired sentences and single pair supervised models trained on bitexts both based on the unsupervised language model XLM R with its parameters frozen The experiments evaluate the models as universal sentence encoders on the task of unsupervised bitext mining on two datasets where the unsupervised model reaches the state of the art of unsupervised retrieval and the alternative single pair supervised model approaches the performance of multilingually supervised models The results suggest that bilingual training techniques as proposed can be applied to get sentence representations with multilingual alignment
Transformer-based language models such as BERT (CITATION) have achieved the state-of-the-art performance on various NLP tasks, but are computationally prohibitive. A recent line of works use various heuristics to successively shorten sequence length while transforming tokens through encoders, in tasks such as classification and ranking that require a single token embedding for prediction.We present a novel solution to this problem, called Pyramid-BERT where we replace previously used heuristics with a core-set based token selection method justified by theoretical results. The core-set based token selection technique allows us to avoid expensive pre-training, gives a space-efficient fine tuning, and thus makes it suitable to handle longer sequence lengths. We provide extensive experiments establishing advantages of pyramid BERT over several baselines and existing works on the GLUE benchmarks and Long Range Arena (CITATION) datasets.