Despite the importance of relation extraction in building and representing knowledge, less research is focused on generalizing to unseen relations types. We introduce the task setting of Zero-Shot Relation Triplet Extraction (ZeroRTE) to encourage further research in low-resource relation extraction methods. Given an input sentence, each extracted triplet consists of the head entity, relation label, and tail entity where the relation label is not seen at the training stage. To solve ZeroRTE, we propose to synthesize relation examples by prompting language models to generate structured texts. Concretely, we unify language model prompts and structured text approaches to design a structured prompt template for generating synthetic relation samples when conditioning on relation label prompts (RelationPrompt). To overcome the limitation for extracting multiple relation triplets in a sentence, we design a novel Triplet Search Decoding method. Experiments on FewRel and Wiki-ZSL datasets show the efficacy of RelationPrompt for the ZeroRTE task and zero-shot relation classification. Our code and data are available at github.com/declare-lab/RelationPrompt.
The table-based fact verification task has recently gained widespread attention and yet remains to be a very challenging problem. It inherently requires informative reasoning over natural language together with different numerical and logical reasoning on tables (e.g., count, superlative, comparative). Considering that, we exploit mixture-of-experts and present in this paper a new method: Self-adaptive Mixture-of-Experts Network (SaMoE). Specifically, we have developed a mixture-of-experts neural network to recognize and execute different types of reasoning—the network is composed of multiple experts, each handling a specific part of the semantics for reasoning, whereas a management module is applied to decide the contribution of each expert network to the verification result. A self-adaptive method is developed to teach the management module combining results of different experts more efficiently without external knowledge. The experimental results illustrate that our framework achieves 85.1% accuracy on the benchmark dataset TabFact, comparable with the previous state-of-the-art models. We hope our framework can serve as a new baseline for table-based verification. Our code is available at https://github.com/THUMLP/SaMoE.
Recognizing facts is the most fundamental step in making judgments, hence detecting events in the legal documents is important to legal case analysis tasks. However, existing Legal Event Detection (LED) datasets only concern incomprehensive event types and have limited annotated data, which restricts the development of LED methods and their downstream applications. To alleviate these issues, we present LEVEN a large-scale Chinese LEgal eVENt detection dataset, with 8,116 legal documents and 150,977 human-annotated event mentions in 108 event types. Not only charge-related events, LEVEN also covers general events, which are critical for legal case understanding but neglected in existing LED datasets. To our knowledge, LEVEN is the largest LED dataset and has dozens of times the data scale of others, which shall significantly promote the training and evaluation of LED methods. The results of extensive experiments indicate that LED is challenging and needs further effort. Moreover, we simply utilize legal events as side information to promote downstream applications. The method achieves improvements of average 2.2 points precision in low-resource judgment prediction, and 1.5 points mean average precision in unsupervised case retrieval, which suggests the fundamentality of LED. The source code and dataset can be obtained from https://github.com/thunlp/LEVEN.
We present RuCCoN a new dataset for clinical concept normalization in Russian manually annotated by medical professionals It contains over 16,028 entity mentions manually linked to over 2,409 unique concepts from the Russian language part of the UMLS ontology We provide train test splits for different settings stratified zero shot and CUI less and present strong baselines obtained with state of the art models such as SapBERT At present Russian medical NLP is lacking in both datasets and trained models and we view this work as an important step towards filling this gap Our dataset and annotation guidelines are available at https://github.com/sberbank-ai-lab/RuCCoN.
Hate speech classifiers exhibit substantial performance degradation when evaluated on datasets different from the source. This is due to learning spurious correlations between words that are not necessarily relevant to hateful language, and hate speech labels from the training corpus. Previous work has attempted to mitigate this problem by regularizing specific terms from pre-defined static dictionaries. While this has been demonstrated to improve the generalizability of classifiers, the coverage of such methods is limited and the dictionaries require regular manual updates from human experts. In this paper, we propose to automatically identify and reduce spurious correlations using attribution methods with dynamic refinement of the list of terms that need to be regularized during training. Our approach is flexible and improves the cross-corpora performance over previous work independently and in combination with pre-defined dictionaries.
Probing is popular to analyze whether linguistic information can be captured by a well trained deep neural model but it is hard to answer how the change of the encoded linguistic information will affect task performance To this end we study the dynamic relationship between the encoded linguistic information and task performance from the viewpoint of Pareto Optimality Its key idea is to obtain a set of models which are Pareto optimal in terms of both objectives From this viewpoint we propose a method to optimize the Pareto optimal models by formalizing it as a multi objective optimization problem We conduct experiments on two popular NLP tasks i.e. machine translation and language modeling and investigate the relationship between several kinds of linguistic information and task performances Experimental results demonstrate that the proposed method is better than a baseline method Our empirical findings suggest that some syntactic information is helpful for NLP tasks whereas encoding more syntactic information does not necessarily lead to better performance because the model architecture is also an important factor
The automation of extracting argument structures faces a pair of challenges on encoding long term contexts to facilitate comprehensive understanding and improving data efficiency since constructing high quality argument structures is time consuming In this work we propose a novel context aware Transformer based argument structure prediction model which on five different domains significantly outperforms models that rely on features or only encode limited contexts To tackle the difficulty of data annotation we examine two complementary methods i transfer learning to leverage existing annotated data to boost model performance in a new target domain and ii active learning to strategically identify a small amount of samples for annotation We further propose model independent sample acquisition strategies which can be generalized to diverse domains With extensive experiments we show that our simple yet effective acquisition strategies yield competitive results against three strong comparisons Combined with transfer learning substantial F1 score boost can be further achieved during the early iterations of active learning across domains
Code mixing is the linguistic phenomenon where bilingual speakers tend to switch between two or more languages in conversations. Recent work on code-mixing in computational settings has leveraged social media code mixed texts to train NLP models. For capturing the variety of code mixing in, and across corpus, Language ID (LID) tags based measures (CMI) have been proposed. Syntactical variety/patterns of code-mixing and their relationship vis-a-vis computational model’s performance is under explored. In this work, we investigate a collection of English(en)-Hindi(hi) code-mixed datasets from a syntactic lens to propose, SyMCoM, an indicator of syntactic variety in code-mixed text, with intuitive theoretical bounds. We train SoTA en-hi PoS tagger, accuracy of 93.4%, to reliably compute PoS tags on a corpus, and demonstrate the utility of SyMCoM by applying it on various syntactical categories on a collection of datasets, and compare datasets using the measure.
Reddit is home to a broad spectrum of political activity, and users signal their political affiliations in multiple ways—from self-declarations to community participation. Frequently, computational studies have treated political users as a single bloc, both in developing models to infer political leaning and in studying political behavior. Here, we test this assumption of political users and show that commonly-used political-inference models do not generalize, indicating heterogeneous types of political users. The models remain imprecise at best for most users, regardless of which sources of data or methods are used. Across a 14-year longitudinal analysis, we demonstrate that the choice in definition of a political user has significant implications for behavioral analysis. Controlling for multiple factors, political users are more toxic on the platform and inter-party interactions are even more toxic—but not all political users behave this way. Last, we identify a subset of political users who repeatedly flip affiliations, showing that these users are the most controversial of all, acting as provocateurs by more frequently bringing up politics, and are more likely to be banned, suspended, or deleted.
Pre trained models have achieved excellent performance on the dialogue task However for the continual increase of online chit chat scenarios directly fine tuning these models for each of the new tasks not only explodes the capacity of the dialogue system on the embedded devices but also causes knowledge forgetting on pre trained models and knowledge interference among diverse dialogue tasks In this work we propose a hierarchical inductive transfer framework to learn and deploy the dialogue skills continually and efficiently First we introduce the adapter module into pre trained models for learning new dialogue tasks As the only trainable module it is beneficial for the dialogue system on the embedded devices to acquire new dialogue skills with negligible additional parameters Then for alleviating knowledge interference between tasks yet benefiting the regularization between them we further design hierarchical inductive transfer that enables new tasks to use general knowledge in the base adapter without being misled by diverse knowledge in task specific adapters Empirical evaluation and analysis indicate that our framework obtains comparable performance under deployment friendly model capacity
Few-Shot Relation Extraction aims at predicting the relation for a pair of entities in a sentence by training with a few labelled examples in each relation. Some recent works have introduced relation information (i.e., relation labels or descriptions) to assist model learning based on Prototype Network. However, most of them constrain the prototypes of each relation class implicitly with relation information, generally through designing complex network structures, like generating hybrid features, combining with contrastive learning or attention networks. We argue that relation information can be introduced more explicitly and effectively into the model. Thus, this paper proposes a direct addition approach to introduce relation information. Specifically, for each relation class, the relation representation is first generated by concatenating two views of relations (i.e., [CLS] token embedding and the mean value of embeddings of all tokens) and then directly added to the original prototype for both train and prediction. Experimental results on the benchmark dataset FewRel 1.0 show significant improvements and achieve comparable results to the state-of-the-art, which demonstrates the effectiveness of our proposed approach. Besides, further analyses verify that the direct addition is a much more effective way to integrate the relation representations and the original prototypes.
Understanding causal narratives communicated in clinical notes can help make strides towards personalized healthcare. Extracted causal information from clinical notes can be combined with structured EHR data such as patients’ demographics, diagnoses, and medications. This will enhance healthcare providers’ ability to identify aspects of a patient’s story communicated in the clinical notes and help make more informed decisions. In this work, we propose annotation guidelines, develop an annotated corpus and provide baseline scores to identify types and direction of causal relations between a pair of biomedical concepts in clinical notes; communicated implicitly or explicitly, identified either in a single sentence or across multiple sentences. We annotate a total of 2714 de-identified examples sampled from the 2018 n2c2 shared task dataset and train four different language model based architectures. Annotation based on our guidelines achieved a high inter-annotator agreement i.e. Fleiss’ kappa (\\kappa) score of 0.72, and our model for identification of causal relations achieved a macro F1 score of 0.56 on the test data. The high inter-annotator agreement for clinical text shows the quality of our annotation guidelines while the provided baseline F1 score sets the direction for future research towards understanding narratives in clinical texts.
Current Question Answering over Knowledge Graphs (KGQA) task mainly focuses on performing answer reasoning upon KGs with binary facts. However, it neglects the n-ary facts, which contain more than two entities. In this work, we highlight a more challenging but under-explored task: n-ary KGQA, i.e., answering n-ary facts questions upon n-ary KGs. Nevertheless, the multi-hop reasoning framework popular in binary KGQA task is not directly applicable on n-ary KGQA. We propose two feasible improvements: 1) upgrade the basic reasoning unit from entity or relation to fact, and 2) upgrade the reasoning structure from chain to tree. Therefore, we propose a novel fact-tree reasoning framework, FacTree, which integrates the above two upgrades. FacTree transforms the question into a fact tree and performs iterative fact reasoning on the fact tree to infer the correct answer. Experimental results on the n-ary KGQA dataset we constructed and two binary KGQA benchmarks demonstrate the effectiveness of FacTree compared with state-of-the-art methods.
Having sufficient resources for language X lifts it from the under resourced languages class but not necessarily from the under researched class In this paper we address the problem of the absence of organized benchmarks in the Turkish language We demonstrate that languages such as Turkish are left behind the state of the art in NLP applications As a solution we present Mukayese a set of NLP benchmarks for the Turkish language that contains several NLP tasks We work on one or more datasets for each benchmark and present two or more baselines Moreover we present four new benchmarking datasets in Turkish for language modeling sentence segmentation and spell checking All datasets and baselines are available under https://github.com/alisafaya/mukayese
Recently, the problem of robustness of pre-trained language models (PrLMs) has received increasing research interest. Latest studies on adversarial attacks achieve high attack success rates against PrLMs, claiming that PrLMs are not robust. However, we find that the adversarial samples that PrLMs fail are mostly non-natural and do not appear in reality. We question the validity of the current evaluation of robustness of PrLMs based on these non-natural adversarial samples and propose an anomaly detector to evaluate the robustness of PrLMs with more natural adversarial samples. We also investigate two applications of the anomaly detector: (1) In data augmentation, we employ the anomaly detector to force generating augmented data that are distinguished as non-natural, which brings larger gains to the accuracy of PrLMs. (2) We apply the anomaly detector to a defense framework to enhance the robustness of PrLMs. It can be used to defend all types of attacks and achieves higher accuracy on both adversarial samples and compliant samples than other defense frameworks.
We propose GRS an unsupervised approach to sentence simplification that combines text generation and text revision We start with an iterative framework in which an input sentence is revised using explicit edit operations and add paraphrasing as a new edit operation This allows us to combine the advantages of generative and revision based approaches paraphrasing captures complex edit operations and the use of explicit edit operations in an iterative manner provides controllability and interpretability We demonstrate these advantages of GRS compared to existing methods on the Newsela and ASSET datasets
We introduce distributed NLI a new NLU task with a goal to predict the distribution of human judgements for natural language inference We show that by applying additional distribution estimation methods namely Monte Carlo MC Dropout Deep Ensemble Re Calibration and Distribution Distillation models can capture human judgement distribution more effectively than the softmax baseline We show that MC Dropout is able to achieve decent performance without any distribution annotations while Re Calibration can give further improvements with extra distribution annotations suggesting the value of multiple annotations for one example in modeling the distribution of human judgements Despite these improvements the best results are still far below the estimated human upper bound indicating that predicting the distribution of human judgements is still an open challenging problem with a large room for improvements We showcase the common errors for MC Dropout and Re Calibration Finally we give guidelines on the usage of these methods with different levels of data availability and encourage future work on modeling the human opinion distribution for language reasoning
Learning from rationales seeks to augment model prediction accuracy using human-annotated rationales (i.e. subsets of input tokens) that justify their chosen labels, often in the form of intermediate or multitask supervision. While intuitive, this idea has proven elusive in practice. We make two observations about human rationales via empirical analyses:1) maximizing rationale supervision accuracy is not necessarily the optimal objective for improving model accuracy; 2) human rationales vary in whether they provide sufficient information for the model to exploit for prediction.Building on these insights, we propose several novel loss functions and learning strategies, and evaluate their effectiveness on three datasets with human rationales. Our results demonstrate consistent improvements over baselines in both label and rationale accuracy, including a 3% accuracy improvement on MultiRC. Our work highlights the importance of understanding properties of human explanations and exploiting them accordingly in model training.
A critical bottleneck in supervised machine learning is the need for large amounts of labeled data which is expensive and time-consuming to obtain. Although a small amount of labeled data cannot be used to train a model, it can be used effectively for the generation of humaninterpretable labeling functions (LFs). These LFs, in turn, have been used to generate a large amount of additional noisy labeled data in a paradigm that is now commonly referred to as data programming. Previous methods of generating LFs do not attempt to use the given labeled data further to train a model, thus missing opportunities for improving performance. Additionally, since the LFs are generated automatically, they are likely to be noisy, and naively aggregating these LFs can lead to suboptimal results. In this work, we propose an LF-based bi-level optimization framework WISDOM to solve these two critical limitations. WISDOM learns a joint model on the (same) labeled dataset used for LF induction along with any unlabeled data in a semi-supervised manner, and more critically, reweighs each LF according to its goodness, influencing its contribution to the semi-supervised loss using a robust bi-level optimization algorithm. We show that WISDOM significantly outperforms prior approaches on several text classification datasets.
Predicate entailment detection is a crucial task for question-answering from text, where previous work has explored unsupervised learning of entailment graphs from typed open relation triples. In this paper, we present the first pipeline for building Chinese entailment graphs, which involves a novel high-recall open relation extraction (ORE) method and the first Chinese fine-grained entity typing dataset under the FIGER type ontology. Through experiments on the Levy-Holt dataset, we verify the strength of our Chinese entailment graph, and reveal the cross-lingual complementarity: on the parallel Levy-Holt dataset, an ensemble of Chinese and English entailment graphs outperforms both monolingual graphs, and raises unsupervised SOTA by 4.7 AUC points.
After a period of decrease interest in word alignments is increasing again for their usefulness in domains such as typological research cross lingual annotation projection and machine translation Generally alignment algorithms only use bitext and do not make use of the fact that many parallel corpora are multiparallel Here we compute high quality word alignments between multiple language pairs by considering all language pairs together First we create a multiparallel word alignment graph joining all bilingual word alignment pairs in one graph Next we use graph neural networks GNNs to exploit the graph structure Our GNN approach i utilizes information about the meaning position and language of the input words ii incorporates information from multiple parallel sentences iii adds and removes edges from the initial alignments and iv yields a prediction model that can generalize beyond the training sentences We show that community detection algorithms can provide valuable information for multiparallel word alignment Our method outperforms previous work on three word alignment datasets and on a downstream task
Multimodal sentiment analysis has attracted increasing attention and lots of models have been proposed However the performance of the state of the art models decreases sharply when they are deployed in the real world We find that the main reason is that real world applications can only access the text outputs by the automatic speech recognition ASR models which may be with errors because of the limitation of model capacity Through further analysis of the ASR outputs we find that in some cases the sentiment words the key sentiment elements in the textual modality are recognized as other words which makes the sentiment of the text change and hurts the performance of multimodal sentiment analysis models directly To address this problem we propose the sentiment word aware multimodal refinement model SWRM which can dynamically refine the erroneous sentiment words by leveraging multimodal sentiment clues Specifically we first use the sentiment word position detection module to obtain the most possible position of the sentiment word in the text and then utilize the multimodal sentiment word refinement module to dynamically refine the sentiment word embeddings The refined embeddings are taken as the textual inputs of the multimodal feature fusion module to predict the sentiment labels We conduct extensive experiments on the real world datasets including MOSI Speechbrain MOSI IBM and MOSI iFlytek and the results demonstrate the effectiveness of our model which surpasses the current state of the art models on three datasets Furthermore our approach can be adapted for other multimodal feature fusion models easily
Code switching (CS) refers to the phenomenon of interchangeably using words and phrases from different languages. CS can pose significant accuracy challenges to NLP, due to the often monolingual nature of the underlying systems. In this work, we focus on CS in the context of English/Spanish conversations for the task of speech translation (ST), generating and evaluating both transcript and translation. To evaluate model performance on this task, we create a novel ST corpus derived from existing public data sets. We explore various ST architectures across two dimensions: cascaded (transcribe then translate) vs end-to-end (jointly transcribe and translate) and unidirectional (source -> target) vs bidirectional (source <-> target). We show that our ST architectures, and especially our bidirectional end-to-end architecture, perform well on CS speech, even when no CS training data is used.
Natural Language Inference NLI datasets contain examples with highly ambiguous labels due to its subjectivity Several recent efforts have been made to acknowledge and embrace the existence of ambiguity and explore how to capture the human disagreement distribution In contrast with directly learning from gold ambiguity labels relying on special resource we argue that the model has naturally captured the human ambiguity distribution as long as its calibrated i.e. the predictive probability can reflect the true correctness likelihood Our experiments show that when model is well calibrated either by label smoothing or temperature scaling it can obtain competitive performance as prior work on both divergence scores between predictive probability and the true human opinion distribution and the accuracy This reveals the overhead of collecting gold ambiguity labels can be cut by broadly solving how to calibrate the NLI network
To maximize the accuracy and increase the overall acceptance of text classifiers, we propose a framework for the efficient, in-operation moderation of classifiers’ output. Our framework focuses on use cases in which F1-scores of modern Neural Networks classifiers (ca. 90%) are still inapplicable in practice. We suggest a semi-automated approach that uses prediction uncertainties to pass unconfident, probably incorrect classifications to human moderators. To minimize the workload, we limit the human moderated data to the point where the accuracy gains saturate and further human effort does not lead to substantial improvements. A series of benchmarking experiments based on three different datasets and three state-of-the-art classifiers show that our framework can improve the classification F1-scores by 5.1 to 11.2% (up to approx. 98 to 99%), while reducing the moderation load up to 73.3% compared to a random moderation.
The extreme multi label classification XMC task aims at tagging content with a subset of labels from an extremely large label set The label vocabulary is typically defined in advance by domain experts and assumed to capture all necessary tags However in real world scenarios this label set although large is often incomplete and experts frequently need to refine it To develop systems that simplify this process we introduce the task of open vocabulary XMC OXMC): given a piece of content predict a set of labels some of which may be outside of the known tag set Hence in addition to not having training data for some labelsas is the case in zero shot classificationmodels need to invent some labels on thefly We propose GROOV a fine tuned seq2seq model for OXMC that generates the set of labels as a flat sequence and is trained using a novel loss independent of predicted label order We show the efficacy of the approach experimenting with popular XMC datasets for which GROOV is able to predict meaningful labels outside the given vocabulary while performing on par with state of the art solutions for known labels
Few-shot named entity recognition (NER) systems aim at recognizing novel-class named entities based on only a few labeled examples. In this paper, we present a decomposed meta-learning approach which addresses the problem of few-shot NER by sequentially tackling few-shot span detection and few-shot entity typing using meta-learning. In particular, we take the few-shot span detection as a sequence labeling problem and train the span detector by introducing the model-agnostic meta-learning (MAML) algorithm to find a good model parameter initialization that could fast adapt to new entity classes. For few-shot entity typing, we propose MAML-ProtoNet, i.e., MAML-enhanced prototypical networks to find a good embedding space that can better distinguish text span representations from different entity classes. Extensive experiments on various benchmarks show that our approach achieves superior performance over prior methods.
Logical reasoning of text requires identifying critical logical structures in the text and performing inference over them. Existing methods for logical reasoning mainly focus on contextual semantics of text while struggling to explicitly model the logical inference process. In this paper, we not only put forward a logic-driven context extension framework but also propose a logic-driven data augmentation algorithm. The former follows a three-step reasoning paradigm, and each step is respectively to extract logical expressions as elementary reasoning units, symbolically infer the implicit expressions following equivalence laws and extend the context to validate the options. The latter augments literally similar but logically different instances and incorporates contrastive learning to better capture logical information, especially logical negative and conditional relationships. We conduct experiments on two benchmark datasets, ReClor and LogiQA. The results show that our method achieves state-of-the-art performance on both datasets, and even surpasses human performance on the ReClor dataset.
Document-level Relation Extraction (DocRE) is a more challenging task compared to its sentence-level counterpart. It aims to extract relations from multiple sentences at once. In this paper, we propose a semi-supervised framework for DocRE with three novel components. Firstly, we use an axial attention module for learning the interdependency among entity-pairs, which improves the performance on two-hop relations. Secondly, we propose an adaptive focal loss to tackle the class imbalance problem of DocRE. Lastly, we use knowledge distillation to overcome the differences between human annotated data and distantly supervised data. We conducted experiments on two DocRE datasets. Our model consistently outperforms strong baselines and its performance exceeds the previous SOTA by 1.36 F1 and 1.46 Ign_F1 score on the DocRED leaderboard.
Recently, there has been a trend to investigate the factual knowledge captured by Pre-trained Language Models (PLMs). Many works show the PLMs’ ability to fill in the missing factual words in cloze-style prompts such as ”Dante was born in [MASK].” However, it is still a mystery how PLMs generate the results correctly: relying on effective clues or shortcut patterns? We try to answer this question by a causal-inspired analysis that quantitatively measures and evaluates the word-level patterns that PLMs depend on to generate the missing words. We check the words that have three typical associations with the missing words: knowledge-dependent, positionally close, and highly co-occurred. Our analysis shows: (1) PLMs generate the missing factual words more by the positionally close and highly co-occurred words than the knowledge-dependent words; (2) the dependence on the knowledge-dependent words is more effective than the positionally close and highly co-occurred words. Accordingly, we conclude that the PLMs capture the factual knowledge ineffectively because of depending on the inadequate associations.
We propose a novel approach that jointly utilizes the labels and elicited rationales for text classification to speed up the training of deep learning models with limited training data. We define and optimize a ranking-constrained loss function that combines cross-entropy loss with ranking losses as rationale constraints. We evaluate our proposed rationale-augmented learning approach on three human-annotated datasets, and show that our approach provides significant improvements over classification approaches that do not utilize rationales as well as other state-of-the-art rationale-augmented baselines.
Considering the seq2seq architecture of Yin and Neubig for natural language to code translation we identify four key components of importance grammatical constraints lexical preprocessing input representations and copy mechanisms To study the impact of these components we use a state of the art architecture that relies on BERT encoder and a grammar based decoder for which a formalization is provided The paper highlights the importance of the lexical substitution component in the current natural language to code systems
The popularity of pretrained language models in natural language processing systems calls for a careful evaluation of such models in down stream tasks which have a higher potential for societal impact The evaluation of such systems usually focuses on accuracy measures Our findings in this paper call for attention to be paid to fairness measures as well Through the analysis of more than a dozen pretrained language models of varying sizes on two toxic text classification tasks English we demonstrate that focusing on accuracy measures alone can lead to models with wide variation in fairness characteristics Specifically we observe that fairness can vary even more than accuracy with increasing training data size and different random initializations At the same time we find that little of the fairness variation is explained by model size despite claims in the literature To improve model fairness without retraining we show that two post processing methods developed for structured tabular data can be successfully applied to a range of pretrained language models Warning This paper contains samples of offensive text
Many tasks in text based computational social science CSS involve \n the classification of political statements into categories based on a domain specific codebook In order to be useful for CSS analysis these categories must be fine grained The typically skewed distribution of fine grained categories however results in \n a challenging classification problem on the NLP side This paper proposes to make use of the hierarchical relations among categories typically present in such codebooks \n e.g. markets and taxation are both subcategories of economy while borders is a subcategory of security We use these ontological relations as prior knowledge to establish additional constraints on the learned model thus \n improving performance overall and in particular for infrequent categories We evaluate several lightweight variants of this intuition by extending state of the art transformer based text \n classifiers on two datasets and multiple languages We find the most consistent improvement for an approach based on regularization
We present a literature and empirical survey that critically assesses the state of the art in character-level modeling for machine translation (MT). Despite evidence in the literature that character-level systems are comparable with subword systems, they are virtually never used in competitive setups in WMT competitions. We empirically show that even with recent modeling innovations in character-level natural language processing, character-level MT systems still struggle to match their subword-based counterparts. Character-level MT systems show neither better domain robustness, nor better morphological generalization, despite being often so motivated. However, we are able to show robustness towards source side noise and that translation quality does not degrade with increasing beam size at decoding time.
We investigate the exploitation of self supervised models for two Creole languages with few resources Gwadloupyen and Morisien Automatic language processing tools are almost non existent for these two languages We propose to use about one hour of annotated data to design an automatic speech recognition system for each language We evaluate how much data is needed to obtain a query by example system that is usable by linguists Moreover our experiments show that multilingual self supervised models are not necessarily the most efficient for Creole languages
Most of the open-domain dialogue models tend to perform poorly in the setting of long-term human-bot conversations. The possible reason is that they lack the capability of understanding and memorizing long-term dialogue history information. To address this issue, we present a novel task of Long-term Memory Conversation (LeMon) and then build a new dialogue dataset DuLeMon and a dialogue generation framework with Long-Term Memory (LTM) mechanism (called PLATO-LTM). This LTM mechanism enables our system to accurately extract and continuously update long-term persona memory without requiring multiple-session dialogue datasets for model training. To our knowledge, this is the first attempt to conduct real-time dynamic management of persona information of both parties, including the user and the bot. Results on DuLeMon indicate that PLATO-LTM can significantly outperform baselines in terms of long-term dialogue consistency, leading to better dialogue engagingness.
While multilingual training is now an essential ingredient in machine translation MT systems recent work has demonstrated that it has different effects in different multilingual settings such as many to one one to many and many to many learning These training settings expose the encoder and the decoder in a machine translation model with different data distributions In this paper we examine how different varieties of multilingual training contribute to learning these two components of the MT model Specifically we compare bilingual models with encoders and/or decoders initialized by multilingual training We show that multilingual training is beneficial to encoders in general while it only benefits decoders for low resource languages LRLs We further find the important attention heads for each language pair and compare their correlations during inference Our analysis sheds light on how multilingual translation models work and also enables us to propose methods to improve performance by training with highly related languages Our many to one models for high resource languages and one to many models for LRL outperform the best results reported by Aharoni et al
Chinese Grammatical Error Detection(CGED aims at detecting grammatical errors in Chinese texts One of the main challenges for CGED is the lack of annotated data To alleviate this problem previous studies proposed various methods to automatically generate more training samples which can be roughly categorized into rule based methods and model based methods The rule based methods construct erroneous sentences by directly introducing noises into original sentences However the introduced noises are usually context independent which are quite different from those made by humans The model based methods utilize generative models to imitate human errors The generative model may bring too many changes to the original sentences and generate semantically ambiguous sentences so it is difficult to detect grammatical errors in these generated sentences In addition generated sentences may be error free and thus become noisy data To handle these problems we propose CNEG a novel Conditional Non Autoregressive Error Generation model for generating Chinese grammatical errors Specifically in order to generate a context dependent error we first mask a span in a correct text then predict an erroneous span conditioned on both the masked text and the correct span Furthermore we filter out error free spans by measuring their perplexities in the original sentences Experimental results show that our proposed method achieves better performance than all compared data augmentation methods on the CGED-2018 and CGED-2020 benchmarks
Recent studies have found that removing the norm-bounded projection and increasing search steps in adversarial training can significantly improve robustness. However, we observe that a too large number of search steps can hurt accuracy. We aim to obtain strong robustness efficiently using fewer steps. Through a toy experiment, we find that perturbing the clean data to the decision boundary but not crossing it does not degrade the test accuracy. Inspired by this, we propose friendly adversarial data augmentation (FADA) to generate friendly adversarial data. On top of FADA, we propose geometry-aware adversarial training (GAT) to perform adversarial training on friendly adversarial data so that we can save a large number of search steps. Comprehensive experiments across two widely used datasets and three pre-trained language models demonstrate that GAT can obtain stronger robustness via fewer steps. In addition, we provide extensive empirical results and in-depth analyses on robustness to facilitate future studies.
Vision-Language Pre-training (VLP) has achieved impressive performance on various cross-modal downstream tasks. However, most existing methods can only learn from aligned image-caption data and rely heavily on expensive regional features, which greatly limits their scalability and performance. In this paper, we propose an end-to-end unified-modal pre-training framework, namely UNIMO-2, for joint learning on both aligned image-caption data and unaligned image-only and text-only corpus. We build a unified Transformer model to jointly learn visual representations, textual representations and semantic alignment between images and texts. In particular, we propose to conduct grounded learning on both images and texts via a sharing grounded space, which helps bridge unaligned images and texts, and align the visual and textual semantic spaces on different types of corpora. The experiments show that our grounded learning method can improve textual and visual semantic alignment for improving performance on various cross-modal tasks. Moreover, benefiting from effective joint modeling of different types of corpora, our model also achieves impressive performance on single-modal visual and textual tasks. Our code and models are public at the UNIMO project page https://unimo-ptm.github.io/.
We present two simple modifications for word-level perturbation: Word Replacement considering Length (WR-L) and Compositional Word Replacement (CWR).In conventional word replacement, a word in an input is replaced with a word sampled from the entire vocabulary, regardless of the length and context of the target word.WR-L considers the length of a target word by sampling words from the Poisson distribution.CWR considers the compositional candidates by restricting the source of sampling to related words that appear in subword regularization.Experimental results showed that the combination of WR-L and CWR improved the performance of text classification and machine translation.
The finetuning of pretrained transformer based language generation models are typically conducted in an end to end manner where the model learns to attend to relevant parts of the input by itself However there does not exist a mechanism to directly control the model’s focus This work aims to develop a control mechanism by which a user can select spans of context as highlights’’ for the model to focus on and generate relevant output To achieve this goal we augment a pretrained model with trainable focus vectors’’ that are directly applied to the model’s embeddings while the model itself is kept fixed These vectors trained on automatic annotations derived from attribution methods act as indicators for context importance We test our approach on two core generation tasks dialogue response generation and abstractive summarization We also collect evaluation data where the highlight generation pairs are annotated by humans Our experiments show that the trained focus vectors are effective in steering the model to generate outputs that are relevant to user selected highlights
We propose a framework to modularize the training of neural language models that use diverse forms of context by eliminating the need to jointly train context and within sentence encoders Our approach contextual universal embeddings CUE trains LMs on one type of contextual data and adapts to novel context types The model consists of a pretrained neural sentence LM a BERT based contextual encoder and a masked transfomer decoder that estimates LM probabilities using sentence internal and contextual evidence When contextually annotated data is unavailable our model learns to combine contextual and sentence internal information using noisy oracle unigram embeddings as a proxy Real context data can be introduced later and used to adapt a small number of parameters that map contextual data into the decoder’s embedding space We validate the CUE framework on a NYTimes text corpus with multiple metadata types for which the LM perplexity can be lowered from 36.6 to 27.4 by conditioning on context Bootstrapping a contextual LM with only a subset of the metadata during training retains of the achievable gain Training the model initially with proxy context retains of the perplexity gain after adapting to real context Furthermore we can swap one type of pretrained sentence LM for another without retraining the context encoders by only adapting the decoder model Overall we obtain a modular framework that allows incremental scalable training of context enhanced LMs
Pruning aims to reduce the number of parameters while maintaining performance close to the original network This work proposes a novel \\emph based pruning strategy whereby the representational similarity between the pruned and unpruned versions of the same network is maximized Unlike previous approaches that treat distillation and pruning separately we use distillation to inform the pruning criteria without requiring a separate student network as in knowledge distillation We show that the proposed implicitly encourages sparse solutions naturally complementing magnitude based pruning criteria Experiments on the GLUE and XGLUE benchmarks show that self distilled pruning increases mono- and cross lingual language model performance Self distilled pruned models also outperform smaller Transformers with an equal number of parameters and are competitive against times larger distilled networks We also observe that self distillation maximizes class separability increases the signal to noise ratio and converges faster after pruning steps providing further insights into why self distilled pruning improves generalizationself-distillation based pruning strategy, whereby the representational similarity between the pruned and unpruned versions of the same network is maximized. Unlike previous approaches that treat distillation and pruning separately, we use distillation to inform the pruning criteria, without requiring a separate student network as in knowledge distillation. We show that the proposed cross-correlation objective for self-distilled pruning implicitly encourages sparse solutions, naturally complementing magnitude-based pruning criteria. Experiments on the GLUE and XGLUE benchmarks show that self-distilled pruning increases mono- and cross-lingual language model performance. Self-distilled pruned models also outperform smaller Transformers with an equal number of parameters and are competitive against (6 times) larger distilled networks. We also observe that self-distillation (1) maximizes class separability, (2) increases the signal-to-noise ratio, and (3) converges faster after pruning steps, providing further insights into why self-distilled pruning improves generalization.
Continual relation extraction CRE aims to continuously train a model on data with new relations while avoiding forgetting old ones Some previous work has proved that storing a few typical samples of old relations and replaying them when learning new relations can effectively avoid forgetting However these memory based methods tend to overfit the memory samples and perform poorly on imbalanced datasets To solve these challenges a consistent representation learning method is proposed which maintains the stability of the relation embedding by adopting contrastive learning and knowledge distillation when replaying memory Specifically supervised contrastive learning based on a memory bank is first used to train each new task so that the model can effectively learn the relation representation Then contrastive replay is conducted of the samples in memory and makes the model retain the knowledge of historical relations through memory knowledge distillation to prevent the catastrophic forgetting of the old task The proposed method can better learn consistent representations to alleviate forgetting effectively Extensive experiments on FewRel and TACRED datasets show that our method significantly outperforms state of the art baselines and yield strong robustness on the imbalanced dataset
We investigate Referring Image Segmentation (RIS), which outputs a segmentation map corresponding to the natural language description. Addressing RIS efficiently requires considering the interactions happening across visual and linguistic modalities and the interactions within each modality. Existing methods are limited because they either compute different forms of interactions sequentially (leading to error propagation) or ignore intra-modal interactions. We address this limitation by performing all three interactions simultaneously through a Synchronous Multi-Modal Fusion Module (SFM). Moreover, to produce refined segmentation masks, we propose a novel Hierarchical Cross-Modal Aggregation Module (HCAM), where linguistic features facilitate the exchange of contextual information across the visual hierarchy. We present thorough ablation studies and validate our approach’s performance on four benchmark datasets, showing considerable performance gains over the existing state-of-the-art (SOTA) methods.
Weighted decoding methods composed of the pretrained language model (LM) and the controller have achieved promising results for controllable text generation. However, these models often suffer from a control strength/fluency trade-off problem as higher control strength is more likely to generate incoherent and repetitive text. In this paper, we illustrate this trade-off is arisen by the controller imposing the target attribute on the LM at improper positions. And we propose a novel framework based on existing weighted decoding methods called CAT-PAW, which introduces a lightweight regulator to adjust bias signals from the controller at different decoding positions. Experiments on positive sentiment control, topic control, and language detoxification show the effectiveness of our CAT-PAW upon 4 SOTA models.
Procedural text contains rich anaphoric phenomena yet has not received much attention in NLP To fill this gap we investigate the textual properties of two types of procedural text recipes and chemical patents and generalize an anaphora annotation framework developed for the chemical domain for modeling anaphoric phenomena in recipes We apply this framework to annotate the RecipeRef corpus with both bridging and coreference relations Through comparison to chemical patents we show the complexity of anaphora resolution in recipes We demonstrate empirically that transfer learning from the chemical domain improves resolution of anaphora in recipes suggesting transferability of general procedural knowledge
Logical reasoning is of vital importance to natural language understanding. Previous studies either employ graph-based models to incorporate prior knowledge about logical relations, or introduce symbolic logic into neural models through data augmentation. These methods, however, heavily depend on annotated training data, and thus suffer from over-fitting and poor generalization problems due to the dataset sparsity. To address these two problems, in this paper, we propose MERIt, a MEta-path guided contrastive learning method for logical ReasonIng of text, to perform self-supervised pre-training on abundant unlabeled text data. Two novel strategies serve as indispensable components of our method. In particular, a strategy based on meta-path is devised to discover the logical structure in natural texts, followed by a counterfactual data augmentation strategy to eliminate the information shortcut induced by pre-training. The experimental results on two challenging logical reasoning benchmarks, i.e., ReClor and LogiQA, demonstrate that our method outperforms the SOTA baselines with significant improvements.
Aspect-based sentiment analysis (ABSA) predicts sentiment polarity towards a specific aspect in the given sentence. While pre-trained language models such as BERT have achieved great success, incorporating dynamic semantic changes into ABSA remains challenging. To this end, in this paper, we propose to address this problem by Dynamic Re-weighting BERT (DR-BERT), a novel method designed to learn dynamic aspect-oriented semantics for ABSA. Specifically, we first take the Stack-BERT layers as a primary encoder to grasp the overall semantic of the sentence and then fine-tune it by incorporating a lightweight Dynamic Re-weighting Adapter (DRA). Note that the DRA can pay close attention to a small region of the sentences at each step and re-weigh the vitally important words for better aspect-aware sentiment understanding. Finally, experimental results on three benchmark datasets demonstrate the effectiveness and the rationality of our proposed model and provide good interpretable insights for future semantic modeling.
We introduce a novel setup for low resource task oriented semantic parsing which incorporates several constraints that may arise in real world scenarios lack of similar datasets models from a related domain inability to sample useful logical forms directly from a grammar and privacy requirements for unlabeled natural utterances Our goal is to improve a low resource semantic parser using utterances collected through user interactions In this highly challenging but realistic setting we investigate data augmentation approaches involving generating a set of structured canonical utterances corresponding to logical forms before simulating corresponding natural language and filtering the resulting pairs We find that such approaches are effective despite our restrictive setup in a low resource setting on the complex SMCalFlow calendaring dataset Andreas et al we observe relative improvement over a non data augmented baseline in top-1 match
Question answering-based summarization evaluation metrics must automatically determine whether the QA model’s prediction is correct or not, a task known as answer verification. In this work, we benchmark the lexical answer verification methods which have been used by current QA-based metrics as well as two more sophisticated text comparison methods, BERTScore and LERC. We find that LERC out-performs the other methods in some settings while remaining statistically indistinguishable from lexical overlap in others. However, our experiments reveal that improved verification performance does not necessarily translate to overall QA-based metric quality: In some scenarios, using a worse verification method — or using none at all — has comparable performance to using the best verification method, a result that we attribute to properties of the datasets.
In this paper we introduce a new task called synesthesia detection which aims to extract the sensory word of a sentence and to predict the original and synesthetic sensory modalities of the corresponding sensory word Synesthesia refers to the description of perceptions in one sensory modality through concepts from other modalities It involves not only a linguistic phenomenon but also a cognitive phenomenon structuring human thought and action which makes it become a bridge between figurative linguistic phenomenon and abstract cognition and thus be helpful to understand the deep semantics To address this we construct a large scale human annotated Chinese synesthesia dataset which contains 7,217 annotated sentences accompanied by sensory words Based on this dataset we propose a family of strong and representative baseline models Upon these baselines we further propose a radical based neural network model to identify the boundary of the sensory word and to jointly detect the original and synesthetic sensory modalities for the word Through extensive experiments we observe that the importance of the proposed task and dataset can be verified by the statistics and progressive performances In addition our proposed model achieves state of the art results on the synesthesia dataset
Dense retrieval (DR) methods conduct text retrieval by first encoding texts in the embedding space and then matching them by nearest neighbor search. This requires strong locality properties from the representation space, e.g., close allocations of each small group of relevant texts, which are hard to generalize to domains without sufficient training data. In this paper, we aim to improve the generalization ability of DR models from source training domains with rich supervision signals to target domains without any relevance label, in the zero-shot setting. To achieve that, we propose Momentum adversarial Domain Invariant Representation learning (MoDIR), which introduces a momentum method to train a domain classifier that distinguishes source versus target domains, and then adversarially updates the DR encoder to learn domain invariant representations. Our experiments show that MoDIR robustly outperforms its baselines on 10+ ranking datasets collected in the BEIR benchmark in the zero-shot setup, with more than 10% relative gains on datasets with enough sensitivity for DR models’ evaluation. Source code is available at https://github.com/ji-xin/modir.
We explore how a multi-modal transformer trained for generation of longer image descriptions learns syntactic and semantic representations about entities and relations grounded in objects at the level of masked self-attention (text generation) and cross-modal attention (information fusion). We observe that cross-attention learns the visual grounding of noun phrases into objects and high-level semantic information about spatial relations, while text-to-text attention captures low-level syntactic knowledge between words. This concludes that language models in a multi-modal task learn different semantic information about objects and relations cross-modally and uni-modally (text-only). Our code is available here: https://github.com/GU-CLASP/attention-as-grounding.
Syntactic structure has long been argued to be potentially useful for enforcing accurate word alignment and improving generalization performance of machine translation Unfortunately existing wisdom demonstrates its significance by considering only the syntactic structure of source tokens neglecting the rich structural information from target tokens and the structural similarity between the source and target sentences In this work we propose to incorporate the syntactic structure of both source and target tokens into the encoder decoder framework tightly correlating the internal logic of word alignment and machine translation for multi task learning Particularly we wo n’t leverage any annotated syntactic graph of the target side during training so we introduce Dynamic Graph Convolution Networks DGCN on observed target tokens to sequentially and simultaneously generate the target tokens and the corresponding syntactic graphs and further guide the word alignment On this basis Hierarchical Graph Random Walks HGRW are performed on the syntactic graphs of both source and target sides for incorporating structured constraints on machine translation outputs Experiments on four publicly available language pairs verify that our method is highly effective in capturing syntactic structure in different languages consistently outperforming baselines in alignment accuracy and demonstrating promising results in translation quality
We explore the notion of uncertainty in the context of modern abstractive summarization models using the tools of Bayesian Deep Learning Our approach approximates Bayesian inference by first extending state of the art summarization models with Monte Carlo dropout and then using them to perform multiple stochastic forward passes Based on Bayesian inference we are able to effectively quantify uncertainty at prediction time Having a reliable uncertainty measure we can improve the experience of the end user by filtering out generated summaries of high uncertainty Furthermore uncertainty estimation could be used as a criterion for selecting samples for annotation and can be paired nicely with active learning and human in the loop approaches Finally Bayesian inference enables us to find a Bayesian summary which performs better than a deterministic one and is more robust to uncertainty In practice we show that our Variational Bayesian equivalents of BART and PEGASUS can outperform their deterministic counterparts on multiple benchmark datasets
As large and powerful neural language models are developed researchers have been increasingly interested in developing diagnostic tools to probe them There are many papers with conclusions of the form observation X$ is found in model Y$’’ using their own datasets with varying sizes Larger probing datasets bring more reliability but are also expensive to collect There is yet to be a quantitative method for estimating reasonable probing dataset sizes We tackle this omission in the context of comparing two probing configurations after we have collected a small dataset from a pilot study how many additional data samples are sufficient to distinguish two different configurations We present a novel method to estimate the required number of data samples in such experiments and across several case studies we verify that our estimations have sufficient statistical power Our framework helps to systematically construct probing datasets to diagnose neural NLP modelsX is found in model Y”, using their own datasets with varying sizes. Larger probing datasets bring more reliability, but are also expensive to collect. There is yet to be a quantitative method for estimating reasonable probing dataset sizes. We tackle this omission in the context of comparing two probing configurations: after we have collected a small dataset from a pilot study, how many additional data samples are sufficient to distinguish two different configurations? We present a novel method to estimate the required number of data samples in such experiments and, across several case studies, we verify that our estimations have sufficient statistical power. Our framework helps to systematically construct probing datasets to diagnose neural NLP models.
Recent Quality Estimation QE models based on multilingual pre trained representations have achieved very competitive results in predicting the overall quality of translated sentences However detecting specifically which translated words are incorrect is a more challenging task especially when dealing with limited amounts of training data We hypothesize that not unlike humans successful QE models rely on translation errors to predict overall sentence quality By exploring a set of feature attribution methods that assign relevance scores to the inputs to explain model predictions we study the behaviour of state of the art sentence level QE models and show that explanations i.e. rationales extracted from these models can indeed be used to detect translation errors We therefore i introduce a novel semi supervised method for word level QE and ii propose to use the QE task as a new benchmark for evaluating the plausibility of feature attribution i.e. how interpretable model explanations are to humans
Despite the remarkable success deep models have achieved in Textual Matching TM tasks it still remains unclear whether they truly understand language or measure the semantic similarity of texts by exploiting statistical bias in datasets In this work we provide a new perspective to study this issue --- via the length divergence bias We find the length divergence heuristic widely exists in prevalent TM datasets providing direct cues for prediction To determine whether TM models have adopted such heuristic we introduce an adversarial evaluation scheme which invalidates the heuristic In this adversarial setting all TM models perform worse indicating they have indeed adopted this heuristic Through a well designed probing experiment we empirically validate that the bias of TM models can be attributed in part to extracting the text length information during training To alleviate the length divergence bias we propose an adversarial training method The results demonstrate we successfully improve the robustness and generalization ability of models at the same time