Andrew McCallum


2021

pdf bib
Modeling Fine-Grained Entity Types with Box Embeddings
Yasumasa Onoe | Michael Boratko | Andrew McCallum | Greg Durrett
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Neural entity typing models typically represent fine-grained entity types as vectors in a high-dimensional space, but such spaces are not well-suited to modeling these types’ complex interdependencies. We study the ability of box embeddings, which embed concepts as d-dimensional hyperrectangles, to capture hierarchies of types even when these relationships are not defined explicitly in the ontology. Our model represents both types and entity mentions as boxes. Each mention and its context are fed into a BERT-based model to embed that mention in our box space ; essentially, this model leverages typological clues present in the surface text to hypothesize a type representation for the mention. Box containment can then be used to derive both the posterior probability of a mention exhibiting a given type and the conditional probability relations between types themselves. We compare our approach with a vector-based typing model and observe state-of-the-art performance on several entity typing benchmarks. In addition to competitive typing performance, our box-based model shows better performance in prediction consistency (predicting a supertype and a subtype together) and confidence (i.e., calibration), demonstrating that the box-based model captures the latent type hierarchies better than the vector-based model does.

pdf bib
Energy-Based Reranking : Improving Neural Machine Translation Using Energy-Based Models
Sumanta Bhattacharyya | Amirmohammad Rooshenas | Subhajit Naskar | Simeng Sun | Mohit Iyyer | Andrew McCallum
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

The discrepancy between maximum likelihood estimation (MLE) and task measures such as BLEU score has been studied before for autoregressive neural machine translation (NMT) and resulted in alternative training algorithms (Ranzato et al., 2016 ; Norouzi et al., 2016 ; Shen et al., 2016 ; Wu et al., 2018). However, MLE training remains the de facto approach for autoregressive NMT because of its computational efficiency and stability. Despite this mismatch between the training objective and task measure, we notice that the samples drawn from an MLE-based trained NMT support the desired distribution there are samples with much higher BLEU score comparing to the beam decoding output. To benefit from this observation, we train an energy-based model to mimic the behavior of the task measure (i.e., the energy-based model assigns lower energy to samples with higher BLEU score), which is resulted in a re-ranking algorithm based on the samples drawn from NMT : energy-based re-ranking (EBR). We use both marginal energy models (over target sentence) and joint energy models (over both source and target sentences). Our EBR with the joint energy model consistently improves the performance of the Transformer-based NMT : +3.7 BLEU points on IWSLT’14 German-English, +3.37 BELU points on Sinhala-English, +1.4 BLEU points on WMT’16 English-German tasks.

pdf bib
Benchmarking Scalable Methods for Streaming Cross Document Entity Coreference
Robert L Logan IV | Andrew McCallum | Sameer Singh | Dan Bikel
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Streaming cross document entity coreference (CDC) systems disambiguate mentions of named entities in a scalable manner via incremental clustering. Unlike other approaches for named entity disambiguation (e.g., entity linking), streaming CDC allows for the disambiguation of entities that are unknown at inference time. Thus, it is well-suited for processing streams of data where new entities are frequently introduced. Despite these benefits, this task is currently difficult to study, as existing approaches are either evaluated on datasets that are no longer available, or omit other crucial details needed to ensure fair comparison. In this work, we address this issue by compiling a large benchmark adapted from existing free datasets, and performing a comprehensive evaluation of a number of novel and existing baseline models. We investigate : how to best encode mentions, which clustering algorithms are most effective for grouping mentions, how models transfer to different domains, and how bounding the number of mentions tracked during inference impacts performance. Our results show that the relative performance of neural and feature-based mention encoders varies across different domains, and in most cases the best performance is achieved using a combination of both approaches. We also find that performance is minimally impacted by limiting the number of tracked mentions.

pdf bib
Long Document Summarization in a Low Resource Setting using Pretrained Language Models
Ahsaas Bajaj | Pavitra Dangati | Kalpesh Krishna | Pradhiksha Ashok Kumar | Rheeya Uppaal | Bradford Windsor | Eliot Brenner | Dominic Dotterrer | Rajarshi Das | Andrew McCallum
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop

Abstractive summarization is the task of compressing a long document into a coherent short document while retaining salient information. Modern abstractive summarization methods are based on deep neural networks which often require large training datasets. Since collecting summarization datasets is an expensive and time-consuming task, practical industrial settings are usually low-resource. In this paper, we study a challenging low-resource setting of summarizing long legal briefs with an average source document length of 4268 words and only 120 available (document, summary) pairs. To account for data scarcity, we used a modern pre-trained abstractive summarizer BART, which only achieves 17.9 ROUGE-L as it struggles with long documents. We thus attempt to compress these long documents by identifying salient sentences in the source which best ground the summary, using a novel algorithm based on GPT-2 language model perplexity scores, that operates within the low resource regime. On feeding the compressed documents to BART, we observe a 6.0 ROUGE-L improvement. Our method also beats several competitive salience detection baselines. Furthermore, the identified salient sentences tend to agree with independent human labeling by domain experts.

pdf bib
Improved Latent Tree Induction with Distant Supervision via Span Constraints
Zhiyang Xu | Andrew Drozdov | Jay Yoon Lee | Tim O’Gorman | Subendhu Rongali | Dylan Finkbeiner | Shilpa Suresh | Mohit Iyyer | Andrew McCallum
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

For over thirty years, researchers have developed and analyzed methods for latent tree induction as an approach for unsupervised syntactic parsing. Nonetheless, modern systems still do not perform well enough compared to their supervised counterparts to have any practical use as structural annotation of text. In this work, we present a technique that uses distant supervision in the form of span constraints (i.e. phrase bracketing) to improve performance in unsupervised constituency parsing. Using a relatively small number of span constraints we can substantially improve the output from DIORA, an already competitive unsupervised parsing system. Compared with full parse tree annotation, span constraints can be acquired with minimal effort, such as with a lexicon derived from Wikipedia, to find exact text matches. Our experiments show span constraints based on entities improves constituency parsing on English WSJ Penn Treebank by more than 5 F1. Furthermore, our method extends to any domain where span constraints are easily attainable, and as a case study we demonstrate its effectiveness by parsing biomedical text from the CRAFT dataset.

pdf bib
Case-based Reasoning for Natural Language Queries over Knowledge Bases
Rajarshi Das | Manzil Zaheer | Dung Thai | Ameya Godbole | Ethan Perez | Jay Yoon Lee | Lizhen Tan | Lazaros Polymenakos | Andrew McCallum
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

It is often challenging to solve a complex problem from scratch, but much easier if we can access other similar problems with their solutions a paradigm known as case-based reasoning (CBR). We propose a neuro-symbolic CBR approach (CBR-KBQA) for question answering over large knowledge bases. CBR-KBQA consists of a nonparametric memory that stores cases (question and logical forms) and a parametric model that can generate a logical form for a new question by retrieving cases that are relevant to it. On several KBQA datasets that contain complex questions, CBR-KBQA achieves competitive performance. For example, on the CWQ dataset, CBR-KBQA outperforms the current state of the art by 11 % on accuracy. Furthermore, we show that CBR-KBQA is capable of using new cases without any further training : by incorporating a few human-labeled examples in the case memory, CBR-KBQA is able to successfully generate logical forms containing unseen KB entities as well as relations.without any further training: by incorporating a few human-labeled examples in the case memory, CBR-KBQA is able to successfully generate logical forms containing unseen KB entities as well as relations.

pdf bib
Probabilistic Box Embeddings for Uncertain Knowledge Graph Reasoning
Xuelu Chen | Michael Boratko | Muhao Chen | Shib Sankar Dasgupta | Xiang Lorraine Li | Andrew McCallum
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Knowledge bases often consist of facts which are harvested from a variety of sources, many of which are noisy and some of which conflict, resulting in a level of uncertainty for each triple. Knowledge bases are also often incomplete, prompting the use of embedding methods to generalize from known facts, however, existing embedding methods only model triple-level uncertainty, and reasoning results lack global consistency. To address these shortcomings, we propose BEUrRE, a novel uncertain knowledge graph embedding method with calibrated probabilistic semantics. BEUrRE models each entity as a box (i.e. axis-aligned hyperrectangle) and relations between two entities as affine transforms on the head and tail entity boxes. The geometry of the boxes allows for efficient calculation of intersections and volumes, endowing the model with calibrated probabilistic semantics and facilitating the incorporation of relational constraints. Extensive experiments on two benchmark datasets show that BEUrRE consistently outperforms baselines on confidence prediction and fact ranking due to its probabilistic calibration and ability to capture high-order dependencies among facts.

pdf bib
Clustering-based Inference for Biomedical Entity Linking
Rico Angell | Nicholas Monath | Sunil Mohan | Nishant Yadav | Andrew McCallum
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Due to large number of entities in biomedical knowledge bases, only a small fraction of entities have corresponding labelled training data. This necessitates entity linking models which are able to link mentions of unseen entities using learned representations of entities. Previous approaches link each mention independently, ignoring the relationships within and across documents between the entity mentions. These relations can be very useful for linking mentions in biomedical text where linking decisions are often difficult due mentions having a generic or a highly specialized form. In this paper, we introduce a model in which linking decisions can be made not merely by linking to a knowledge base entity but also by grouping multiple mentions together via clustering and jointly making linking predictions. In experiments on the largest publicly available biomedical dataset, we improve the best independent prediction for entity linking by 3.0 points of accuracy, and our clustering-based inference model further improves entity linking by 2.3 points.

2020

pdf bib
Self-Supervised Meta-Learning for Few-Shot Natural Language Classification Tasks
Trapit Bansal | Rishikesh Jha | Tsendsuren Munkhdalai | Andrew McCallum
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Self-supervised pre-training of transformer models has revolutionized NLP applications. Such pre-training with language modeling objectives provides a useful initial point for parameters that generalize well to new tasks with fine-tuning. However, fine-tuning is still data inefficient when there are few labeled examples, accuracy can be low. Data efficiency can be improved by optimizing pre-training directly for future fine-tuning with few examples ; this can be treated as a meta-learning problem. However, standard meta-learning techniques require many training tasks in order to generalize ; unfortunately, finding a diverse set of such supervised tasks is usually difficult. This paper proposes a self-supervised approach to generate a large, rich, meta-learning task distribution from unlabeled text. This is achieved using a cloze-style objective, but creating separate multi-class classification tasks by gathering tokens-to-be blanked from among only a handful of vocabulary terms. This yields as many unique meta-training tasks as the number of subsets of vocabulary terms. We meta-train a transformer model on this distribution of tasks using a recent meta-learning framework. On 17 NLP tasks, we show that this meta-training leads to better few-shot generalization than language-model pre-training followed by finetuning. Furthermore, we show how the self-supervised tasks can be combined with supervised tasks for meta-learning, providing substantial accuracy gains over previous supervised meta-learning.

pdf bib
Unsupervised Parsing with S-DIORA : Single Tree Encoding for Deep Inside-Outside Recursive AutoencodersS-DIORA: Single Tree Encoding for Deep Inside-Outside Recursive Autoencoders
Andrew Drozdov | Subendhu Rongali | Yi-Pei Chen | Tim O’Gorman | Mohit Iyyer | Andrew McCallum
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

The deep inside-outside recursive autoencoder (DIORA ; Drozdov et al. 2019) is a self-supervised neural model that learns to induce syntactic tree structures for input sentences * without access to labeled training data *. In this paper, we discover that while DIORA exhaustively encodes all possible binary trees of a sentence with a soft dynamic program, its vector averaging approach is locally greedy and can not recover from errors when computing the highest scoring parse tree in bottom-up chart parsing. To fix this issue, we introduce S-DIORA, an improved variant of DIORA that encodes a single tree rather than a softly-weighted mixture of trees by employing a hard argmax operation and a beam at each cell in the chart. Our experiments show that through * fine-tuning * a pre-trained DIORA with our new algorithm, we improve the state of the art in * unsupervised * constituency parsing on the English WSJ Penn Treebank by 2.2-6 % F1, depending on the data used for fine-tuning.

pdf bib
An Instance Level Approach for Shallow Semantic Parsing in Scientific Procedural TextAn Instance Level Approach for Shallow Semantic Parsing in Scientific Procedural Text
Daivik Swarup | Ahsaas Bajaj | Sheshera Mysore | Tim O’Gorman | Rajarshi Das | Andrew McCallum
Findings of the Association for Computational Linguistics: EMNLP 2020

In specific domains, such as procedural scientific text, human labeled data for shallow semantic parsing is especially limited and expensive to create. Fortunately, such specific domains often use rather formulaic writing, such that the different ways of expressing relations in a small number of grammatically similar labeled sentences may provide high coverage of semantic structures in the corpus, through an appropriately rich similarity metric. In light of this opportunity, this paper explores an instance-based approach to the relation prediction sub-task within shallow semantic parsing, in which semantic labels from structurally similar sentences in the training set are copied to test sentences. Candidate similar sentences are retrieved using SciBERT embeddings. For labels where it is possible to copy from a similar sentence we employ an instance level copy network, when this is not possible, a globally shared parametric model is employed. Experiments show our approach outperforms both baseline and prior methods by 0.75 to 3 F1 absolute in the Wet Lab Protocol Corpus and 1 F1 absolute in the Materials Science Procedural Text Corpus.

2019

pdf bib
Unsupervised Labeled Parsing with Deep Inside-Outside Recursive Autoencoders
Andrew Drozdov | Patrick Verga | Yi-Pei Chen | Mohit Iyyer | Andrew McCallum
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Understanding text often requires identifying meaningful constituent spans such as noun phrases and verb phrases. In this work, we show that we can effectively recover these types of labels using the learned phrase vectors from deep inside-outside recursive autoencoders (DIORA). Specifically, we cluster span representations to induce span labels. Additionally, we improve the model’s labeling accuracy by integrating latent code learning into the training procedure. We evaluate this approach empirically through unsupervised labeled constituency parsing. Our method outperforms ELMo and BERT on two versions of the Wall Street Journal (WSJ) dataset and is competitive to prior work that requires additional human annotations, improving over a previous state-of-the-art system that depends on ground-truth part-of-speech tags by 5 absolute F1 points (19 % relative error reduction).

pdf bib
Chains-of-Reasoning at TextGraphs 2019 Shared Task : Reasoning over Chains of Facts for Explainable Multi-hop InferenceTextGraphs 2019 Shared Task: Reasoning over Chains of Facts for Explainable Multi-hop Inference
Rajarshi Das | Ameya Godbole | Manzil Zaheer | Shehzaad Dhuliawala | Andrew McCallum
Proceedings of the Thirteenth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-13)

This paper describes our submission to the shared task on Multi-hop Inference Explanation Regeneration in TextGraphs workshop at EMNLP 2019 (Jansen and Ustalov, 2019). Our system identifies chains of facts relevant to explain an answer to an elementary science examination question. To counter the problem of ‘spurious chains’ leading to ‘semantic drifts’, we train a ranker that uses contextualized representation of facts to score its relevance for explaining an answer to a question. Our system was ranked first w.r.t the mean average precision (MAP) metric outperforming the second best system by 14.95 points.

pdf bib
Proceedings of the Workshop on Extracting Structured Knowledge from Scientific Publications
Vivi Nastase | Benjamin Roth | Laura Dietz | Andrew McCallum
Proceedings of the Workshop on Extracting Structured Knowledge from Scientific Publications

pdf bib
Unsupervised Latent Tree Induction with Deep Inside-Outside Recursive Auto-Encoders
Andrew Drozdov | Patrick Verga | Mohit Yadav | Mohit Iyyer | Andrew McCallum
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

We introduce the deep inside-outside recursive autoencoder (DIORA), a fully-unsupervised method for discovering syntax that simultaneously learns representations for constituents within the induced tree. Our approach predicts each word in an input sentence conditioned on the rest of the sentence. During training we use dynamic programming to consider all possible binary trees over the sentence, and for inference we use the CKY algorithm to extract the highest scoring parse. DIORA outperforms previously reported results for unsupervised binary constituency parsing on the benchmark WSJ dataset.

pdf bib
Energy and Policy Considerations for Deep Learning in NLPNLP
Emma Strubell | Ananya Ganesh | Andrew McCallum
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Recent progress in hardware and methodology for training neural networks has ushered in a new generation of large networks trained on abundant data. These models have obtained notable gains in accuracy across many NLP tasks. However, these accuracy improvements depend on the availability of exceptionally large computational resources that necessitate similarly substantial energy consumption. As a result these models are costly to train and develop, both financially, due to the cost of hardware and electricity or cloud compute time, and environmentally, due to the carbon footprint required to fuel modern tensor processing hardware. In this paper we bring this issue to the attention of NLP researchers by quantifying the approximate financial and environmental costs of training a variety of recently successful neural network models for NLP. Based on these findings, we propose actionable recommendations to reduce costs and improve equity in NLP research and practice.

pdf bib
A2N : Attending to Neighbors for Knowledge Graph InferenceA2N: Attending to Neighbors for Knowledge Graph Inference
Trapit Bansal | Da-Cheng Juan | Sujith Ravi | Andrew McCallum
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

State-of-the-art models for knowledge graph completion aim at learning a fixed embedding representation of entities in a multi-relational graph which can generalize to infer unseen entity relationships at test time. This can be sub-optimal as it requires memorizing and generalizing to all possible entity relationships using these fixed representations. We thus propose a novel attention-based method to learn query-dependent representation of entities which adaptively combines the relevant graph neighborhood of an entity leading to more accurate KG completion. The proposed method is evaluated on two benchmark datasets for knowledge graph completion, and experimental results show that the proposed model performs competitively or better than existing state-of-the-art, including recent methods for explicit multi-hop reasoning. Qualitative probing offers insight into how the model can reason about facts involving multiple hops in the knowledge graph, through the use of neighborhood attention.

2018

pdf bib
Marginal Likelihood Training of BiLSTM-CRF for Biomedical Named Entity Recognition from Disjoint Label SetsBiLSTM-CRF for Biomedical Named Entity Recognition from Disjoint Label Sets
Nathan Greenberg | Trapit Bansal | Patrick Verga | Andrew McCallum
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Extracting typed entity mentions from text is a fundamental component to language understanding and reasoning. While there exist substantial labeled text datasets for multiple subsets of biomedical entity typessuch as genes and proteins, or chemicals and diseasesit is rare to find large labeled datasets containing labels for all desired entity types together. This paper presents a method for training a single CRF extractor from multiple datasets with disjoint or partially overlapping sets of entity types. Our approach employs marginal likelihood training to insist on labels that are present in the data, while filling in missing labels. This allows us to leverage all the available data within a single model. In experimental results on the Biocreative V CDR (chemicals / diseases), Biocreative VI ChemProt (chemicals / proteins) and MedMentions (19 entity types) datasets, we show that joint training on multiple datasets improves NER F1 over training in isolation, and our methods achieve state-of-the-art results.

pdf bib
Linguistically-Informed Self-Attention for Semantic Role Labeling
Emma Strubell | Patrick Verga | Daniel Andor | David Weiss | Andrew McCallum
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Current state-of-the-art semantic role labeling (SRL) uses a deep neural network with no explicit linguistic features. However, prior work has shown that gold syntax trees can dramatically improve SRL decoding, suggesting the possibility of increased accuracy from explicit modeling of syntax. In this work, we present linguistically-informed self-attention (LISA): a neural network model that combines multi-head self-attention with multi-task learning across dependency parsing, part-of-speech tagging, predicate detection and SRL. Unlike previous models which require significant pre-processing to prepare linguistic features, LISA can incorporate syntax using merely raw tokens as input, encoding the sequence only once to simultaneously perform parsing, predicate detection and role labeling for all predicates. Syntax is incorporated by training one attention head to attend to syntactic parents for each token. Moreover, if a high-quality syntactic parse is already available, it can be beneficially injected at test time without re-training our SRL model. In experiments on CoNLL-2005 SRL, LISA achieves new state-of-the-art performance for a model using predicted predicates and standard word embeddings, attaining 2.5 F1 absolute higher than the previous state-of-the-art on newswire and more than 3.5 F1 on out-of-domain data, nearly 10 % reduction in error. On ConLL-2012 English SRL we also show an improvement of more than 2.5 F1.

pdf bib
An Interface for Annotating Science Questions
Michael Boratko | Harshit Padigela | Divyendra Mikkilineni | Pritish Yuvraj | Rajarshi Das | Andrew McCallum | Maria Chang | Achille Fokoue | Pavan Kapanipathi | Nicholas Mattei | Ryan Musa | Kartik Talamadupula | Michael Witbrock
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Recent work introduces the AI2 Reasoning Challenge (ARC) and the associated ARC dataset that partitions open domain, complex science questions into an Easy Set and a Challenge Set. That work includes an analysis of 100 questions with respect to the types of knowledge and reasoning required to answer them. However, it does not include clear definitions of these types, nor does it offer information about the quality of the labels or the annotation process used. In this paper, we introduce a novel interface for human annotation of science question-answer pairs with their respective knowledge and reasoning types, in order that the classification of new questions may be improved. We build on the classification schema proposed by prior work on the ARC dataset, and evaluate the effectiveness of our interface with a preliminary study involving 10 participants.

pdf bib
Efficient Graph-based Word Sense Induction by Distributional Inclusion Vector Embeddings
Haw-Shiuan Chang | Amol Agrawal | Ananya Ganesh | Anirudha Desai | Vinayak Mathur | Alfred Hough | Andrew McCallum
Proceedings of the Twelfth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-12)

Word sense induction (WSI), which addresses polysemy by unsupervised discovery of multiple word senses, resolves ambiguities for downstream NLP tasks and also makes word representations more interpretable. This paper proposes an accurate and efficient graph-based method for WSI that builds a global non-negative vector embedding basis (which are interpretable like topics) and clusters the basis indexes in the ego network of each polysemous word. By adopting distributional inclusion vector embeddings as our basis formation model, we avoid the expensive step of nearest neighbor search that plagues other graph-based methods without sacrificing the quality of sense clusters. Experiments on three datasets show that our proposed method produces similar or better sense clusters and embeddings compared with previous state-of-the-art methods while being significantly more efficient.

pdf bib
Distributional Inclusion Vector Embedding for Unsupervised Hypernymy Detection
Haw-Shiuan Chang | Ziyun Wang | Luke Vilnis | Andrew McCallum
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

Modeling hypernymy, such as poodle is-a dog, is an important generalization aid to many NLP tasks, such as entailment, relation extraction, and question answering. Supervised learning from labeled hypernym sources, such as WordNet, limits the coverage of these models, which can be addressed by learning hypernyms from unlabeled text. Existing unsupervised methods either do not scale to large vocabularies or yield unacceptably poor accuracy. This paper introduces distributional inclusion vector embedding (DIVE), a simple-to-implement unsupervised method of hypernym discovery via per-word non-negative vector embeddings which preserve the inclusion property of word contexts. In experimental evaluations more comprehensive than any previous literature of which we are awareevaluating on 11 datasets using multiple existing as well as newly proposed scoring functionswe find that our method provides up to double the precision of previous unsupervised methods, and the highest average performance, using a much more compact word representation, and yielding many new state-of-the-art results.

pdf bib
Embedded-State Latent Conditional Random Fields for Sequence Labeling
Dung Thai | Sree Harsha Ramesh | Shikhar Murty | Luke Vilnis | Andrew McCallum
Proceedings of the 22nd Conference on Computational Natural Language Learning

Complex textual information extraction tasks are often posed as sequence labeling or shallow parsing, where fields are extracted using local labels made consistent through probabilistic inference in a graphical model with constrained transitions. Recently, it has become common to locally parametrize these models using rich features extracted by recurrent neural networks (such as LSTM), while enforcing consistent outputs through a simple linear-chain model, representing Markovian dependencies between successive labels. However, the simple graphical model structure belies the often complex non-local constraints between output labels. For example, many fields, such as a first name, can only occur a fixed number of times, or in the presence of other fields. While RNNs have provided increasingly powerful context-aware local features for sequence tagging, they have yet to be integrated with a global graphical model of similar expressivity in the output distribution. Our model goes beyond the linear chain CRF to incorporate multiple hidden states per output label, but parametrizes them parsimoniously with low-rank log-potential scoring matrices, effectively learning an embedding space for hidden states. This augmented latent space of inference variables complements the rich feature representation of the RNN, and allows exact global inference obeying complex, learned non-local output constraints. We experiment with several datasets and show that the model outperforms baseline CRF+RNN models when global output constraints are necessary at inference-time, and explore the interpretable latent structure.shallow parsing, where fields are extracted using local labels made consistent through probabilistic inference in a graphical model with constrained transitions. Recently, it has become common to locally parametrize these models using rich features extracted by recurrent neural networks (such as LSTM), while enforcing consistent outputs through a simple linear-chain model, representing Markovian dependencies between successive labels. However, the simple graphical model structure belies the often complex non-local constraints between output labels. For example, many fields, such as a first name, can only occur a fixed number of times, or in the presence of other fields. While RNNs have provided increasingly powerful context-aware local features for sequence tagging, they have yet to be integrated with a global graphical model of similar expressivity in the output distribution. Our model goes beyond the linear chain CRF to incorporate multiple hidden states per output label, but parametrizes them parsimoniously with low-rank log-potential scoring matrices, effectively learning an embedding space for hidden states. This augmented latent space of inference variables complements the rich feature representation of the RNN, and allows exact global inference obeying complex, learned non-local output constraints. We experiment with several datasets and show that the model outperforms baseline CRF+RNN models when global output constraints are necessary at inference-time, and explore the interpretable latent structure.

pdf bib
Probabilistic Embedding of Knowledge Graphs with Box Lattice Measures
Luke Vilnis | Xiang Li | Shikhar Murty | Andrew McCallum
Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Embedding methods which enforce a partial order or lattice structure over the concept space, such as Order Embeddings (OE), are a natural way to model transitive relational data (e.g. entailment graphs). However, OE learns a deterministic knowledge base, limiting expressiveness of queries and the ability to use uncertainty for both prediction and learning (e.g. learning from expectations). Probabilistic extensions of OE have provided the ability to somewhat calibrate these denotational probabilities while retaining the consistency and inductive bias of ordered models, but lack the ability to model the negative correlations found in real-world knowledge. In this work we show that a broad class of models that assign probability measures to OE can never capture negative correlation, which motivates our construction of a novel box lattice and accompanying probability measure to capture anti-correlation and even disjoint concepts, while still providing the benefits of probabilistic modeling, such as the ability to perform rich joint and conditional queries over arbitrary sets of concepts, and both learning from and predicting calibrated uncertainty. We show improvements over previous approaches in modeling the Flickr and WordNet entailment graphs, and investigate the power of the model.

2017

pdf bib
Question Answering on Knowledge Bases and Text using Universal Schema and Memory Networks
Rajarshi Das | Manzil Zaheer | Siva Reddy | Andrew McCallum
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Existing question answering methods infer answers either from a knowledge base or from raw text. While knowledge base (KB) methods are good at answering compositional questions, their performance is often affected by the incompleteness of the KB. Au contraire, web text contains millions of facts that are absent in the KB, however in an unstructured form. Universal schema can support reasoning on the union of both structured KBs and unstructured text by aligning them in a common embedded space. In this paper we extend universal schema to natural language question answering, employing Memory networks to attend to the large body of facts in the combination of text and KB. Our models can be trained in an end-to-end fashion on question-answer pairs. Evaluation results on Spades fill-in-the-blank question answering dataset show that exploiting universal schema for question answering is better than using either a KB or text alone. This model also outperforms the current state-of-the-art by 8.5 F1 points.

pdf bib
Dependency Parsing with Dilated Iterated Graph CNNsCNNs
Emma Strubell | Andrew McCallum
Proceedings of the 2nd Workshop on Structured Prediction for Natural Language Processing

Dependency parses are an effective way to inject linguistic knowledge into many downstream tasks, and many practitioners wish to efficiently parse sentences at scale. Recent advances in GPU hardware have enabled neural networks to achieve significant gains over the previous best models, these models still fail to leverage GPUs’ capability for massive parallelism due to their requirement of sequential processing of the sentence. In response, we propose Dilated Iterated Graph Convolutional Neural Networks (DIG-CNNs) for graph-based dependency parsing, a graph convolutional architecture that allows for efficient end-to-end GPU parsing. In experiments on the English Penn TreeBank benchmark, we show that DIG-CNNs perform on par with some of the best neural network parsers.

pdf bib
SemEval 2017 Task 10 : ScienceIE-Extracting Keyphrases and Relations from Scientific PublicationsSemEval 2017 Task 10: ScienceIE - Extracting Keyphrases and Relations from Scientific Publications
Isabelle Augenstein | Mrinal Das | Sebastian Riedel | Lakshmi Vikraman | Andrew McCallum
Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017)

We describe the SemEval task of extracting keyphrases and relations between them from scientific documents, which is crucial for understanding which publications describe which processes, tasks and materials. Although this was a new task, we had a total of 26 submissions across 3 evaluation scenarios. We expect the task and the findings reported in this paper to be relevant for researchers working on understanding scientific content, as well as the broader knowledge base population and information extraction communities.

pdf bib
Fast and Accurate Entity Recognition with Iterated Dilated Convolutions
Emma Strubell | Patrick Verga | David Belanger | Andrew McCallum
Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

Today when many practitioners run basic NLP on the entire web and large-volume traffic, faster methods are paramount to saving time and energy costs. Recent advances in GPU hardware have led to the emergence of bi-directional LSTMs as a standard method for obtaining per-token vector representations serving as input to labeling tasks such as NER (often followed by prediction in a linear-chain CRF). Though expressive and accurate, these models fail to fully exploit GPU parallelism, limiting their computational efficiency. This paper proposes a faster alternative to Bi-LSTMs for NER : Iterated Dilated Convolutional Neural Networks (ID-CNNs), which have better capacity than traditional CNNs for large context and structured prediction. Unlike LSTMs whose sequential processing on sentences of length N requires O(N) time even in the face of parallelism, ID-CNNs permit fixed-depth convolutions to run in parallel across entire documents. We describe a distinct combination of network structure, parameter sharing and training procedures that enable dramatic 14-20x test-time speedups while retaining accuracy comparable to the Bi-LSTM-CRF. Moreover, ID-CNNs trained to aggregate context from the entire document are more accurate than Bi-LSTM-CRFs while attaining 8x faster test time speeds.

pdf bib
Chains of Reasoning over Entities, Relations, and Text using Recurrent Neural Networks
Rajarshi Das | Arvind Neelakantan | David Belanger | Andrew McCallum
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

Our goal is to combine the rich multi-step inference of symbolic logical reasoning with the generalization capabilities of neural networks. We are particularly interested in complex reasoning about entities and relations in text and large-scale knowledge bases (KBs). Neelakantan et al. (2015) use RNNs to compose the distributed semantics of multi-hop paths in KBs ; however for multiple reasons, the approach lacks accuracy and practicality. This paper proposes three significant modeling advances : (1) we learn to jointly reason about relations, entities, and entity-types ; (2) we use neural attention modeling to incorporate multiple paths ; (3) we learn to share strength in a single RNN that represents logical composition across all relations. On a large-scale Freebase+ClueWeb prediction task, we achieve 25 % error reduction, and a 53 % error reduction on sparse relations due to shared strength. On chains of reasoning in WordNet we reduce error in mean quantile by 84 % versus previous state-of-the-art.entities, and entity-types; (2) we use neural attention modeling to incorporate multiple paths; (3) we learn to share strength in a single RNN that represents logical composition across all relations. On a large-scale Freebase+ClueWeb prediction task, we achieve 25% error reduction, and a 53% error reduction on sparse relations due to shared strength. On chains of reasoning in WordNet we reduce error in mean quantile by 84% versus previous state-of-the-art.

pdf bib
Generalizing to Unseen Entities and Entity Pairs with Row-less Universal Schema
Patrick Verga | Arvind Neelakantan | Andrew McCallum
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

Universal schema predicts the types of entities and relations in a knowledge base (KB) by jointly embedding the union of all available schema typesnot only types from multiple structured databases (such as Freebase or Wikipedia infoboxes), but also types expressed as textual patterns from raw text. This prediction is typically modeled as a matrix completion problem, with one type per column, and either one or two entities per row (in the case of entity types or binary relation types, respectively). Factorizing this sparsely observed matrix yields a learned vector embedding for each row and each column. In this paper we explore the problem of making predictions for entities or entity-pairs unseen at training time (and hence without a pre-learned row embedding). We propose an approach having no per-row parameters at all ; rather we produce a row vector on the fly using a learned aggregation function of the vectors of the observed columns for that row. We experiment with various aggregation functions, including neural network attention models. Our approach can be understood as a natural language database, in that questions about KB entities are answered by attending to textual or database evidence. In experiments predicting both relations and entity types, we demonstrate that despite having an order of magnitude fewer parameters than traditional universal schema, we can match the accuracy of the traditional model, and more importantly, we can now make predictions about unseen rows with nearly the same accuracy as rows available at training time.