Proceedings of the Twelfth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-12)

Goran Glavaš, Swapna Somasundaran, Martin Riedl, Eduard Hovy (Editors)

Anthology ID:
New Orleans, Louisiana, USA
NAACL | TextGraphs | WS
Association for Computational Linguistics
Bib Export formats:

pdf bib
Proceedings of the Twelfth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-12)
Goran Glavaš | Swapna Somasundaran | Martin Riedl | Eduard Hovy

pdf bib
Multi-hop Inference for Sentence-level TextGraphs : How Challenging is Meaningfully Combining Information for Science Question Answering?TextGraphs: How Challenging is Meaningfully Combining Information for Science Question Answering?
Peter Jansen

Question Answering for complex questions is often modelled as a graph construction or traversal task, where a solver must build or traverse a graph of facts that answer and explain a given question. This multi-hop inference has been shown to be extremely challenging, with few models able to aggregate more than two facts before being overwhelmed by semantic drift, or the tendency for long chains of facts to quickly drift off topic. This is a major barrier to current inference models, as even elementary science questions require an average of 4 to 6 facts to answer and explain. In this work we empirically characterize the difficulty of building or traversing a graph of sentences connected by lexical overlap, by evaluating chance sentence aggregation quality through 9,784 manually-annotated judgements across knowledge graphs built from three free-text corpora (including study guides and Simple Wikipedia). We demonstrate semantic drift tends to be high and aggregation quality low, at between 0.04 and 3, and highlight scenarios that maximize the likelihood of meaningfully combining information.

pdf bib
Multi-Sentence Compression with Word Vertex-Labeled Graphs and Integer Linear Programming
Elvys Linhares Pontes | Stéphane Huet | Thiago Gouveia da Silva | Andréa Carneiro Linhares | Juan-Manuel Torres-Moreno

Multi-Sentence Compression (MSC) aims to generate a short sentence with key information from a cluster of closely related sentences. MSC enables summarization and question-answering systems to generate outputs combining fully formed sentences from one or several documents. This paper describes a new Integer Linear Programming method for MSC using a vertex-labeled graph to select different keywords, and novel 3-gram scores to generate more informative sentences while maintaining their grammaticality. Our system is of good quality and outperforms the state-of-the-art for evaluations led on news dataset. We led both automatic and manual evaluations to determine the informativeness and the grammaticality of compressions for each dataset. Additional tests, which take advantage of the fact that the length of compressions can be modulated, still improve ROUGE scores with shorter output sentences.

pdf bib
Large-scale spectral clustering using diffusion coordinates on landmark-based bipartite graphs
Khiem Pham | Guangliang Chen

Spectral clustering has received a lot of attention due to its ability to separate nonconvex, non-intersecting manifolds, but its high computational complexity has significantly limited its applicability. Motivated by the document-term co-clustering framework by Dhillon (2001), we propose a landmark-based scalable spectral clustering approach in which we first use the selected landmark set and the given data to form a bipartite graph and then run a diffusion process on it to obtain a family of diffusion coordinates for clustering. We show that our proposed algorithm can be implemented based on very efficient operations on the affinity matrix between the given data and selected landmarks, thus capable of handling large data. Finally, we demonstrate the excellent performance of our method by comparing with the state-of-the-art scalable algorithms on several benchmark data sets.

pdf bib
Efficient Graph-based Word Sense Induction by Distributional Inclusion Vector Embeddings
Haw-Shiuan Chang | Amol Agrawal | Ananya Ganesh | Anirudha Desai | Vinayak Mathur | Alfred Hough | Andrew McCallum

Word sense induction (WSI), which addresses polysemy by unsupervised discovery of multiple word senses, resolves ambiguities for downstream NLP tasks and also makes word representations more interpretable. This paper proposes an accurate and efficient graph-based method for WSI that builds a global non-negative vector embedding basis (which are interpretable like topics) and clusters the basis indexes in the ego network of each polysemous word. By adopting distributional inclusion vector embeddings as our basis formation model, we avoid the expensive step of nearest neighbor search that plagues other graph-based methods without sacrificing the quality of sense clusters. Experiments on three datasets show that our proposed method produces similar or better sense clusters and embeddings compared with previous state-of-the-art methods while being significantly more efficient.