Proceedings of the Second Workshop on Scholarly Document Processing

Iz Beltagy, Arman Cohan, Guy Feigenblat, Dayne Freitag, Tirthankar Ghosal, Keith Hall, Drahomira Herrmannova, Petr Knoth, Kyle Lo, Philipp Mayr, Robert M. Patton, Michal Shmueli-Scheuer, Anita de Waard, Kuansan Wang, Lucy Lu Wang (Editors)

Anthology ID:
NAACL | sdp
Association for Computational Linguistics
Bib Export formats:

pdf bib
Proceedings of the Second Workshop on Scholarly Document Processing
Iz Beltagy | Arman Cohan | Guy Feigenblat | Dayne Freitag | Tirthankar Ghosal | Keith Hall | Drahomira Herrmannova | Petr Knoth | Kyle Lo | Philipp Mayr | Robert M. Patton | Michal Shmueli-Scheuer | Anita de Waard | Kuansan Wang | Lucy Lu Wang

pdf bib
Keyphrase Extraction from Scientific Articles via Extractive Summarization
Chrysovalantis Giorgos Kontoulis | Eirini Papagiannopoulou | Grigorios Tsoumakas

Automatically extracting keyphrases from scholarly documents leads to a valuable concise representation that humans can understand and machines can process for tasks, such as information retrieval, article clustering and article classification. This paper is concerned with the parts of a scientific article that should be given as input to keyphrase extraction methods. Recent deep learning methods take titles and abstracts as input due to the increased computational complexity in processing long sequences, whereas traditional approaches can also work with full-texts. Titles and abstracts are dense in keyphrases, but often miss important aspects of the articles, while full-texts on the other hand are richer in keyphrases but much noisier. To address this trade-off, we propose the use of extractive summarization models on the full-texts of scholarly documents. Our empirical study on 3 article collections using 3 keyphrase extraction methods shows promising results.

pdf bib
The Effect of Pretraining on Extractive Summarization for Scientific Documents
Yash Gupta | Pawan Sasanka Ammanamanchi | Shikha Bordia | Arjun Manoharan | Deepak Mittal | Ramakanth Pasunuru | Manish Shrivastava | Maneesh Singh | Mohit Bansal | Preethi Jyothi

Large pretrained models have seen enormous success in extractive summarization tasks. In this work, we investigate the influence of pretraining on a BERT-based extractive summarization system for scientific documents. We derive significant performance improvements using an intermediate pretraining step that leverages existing summarization datasets and report state-of-the-art results on a recently released scientific summarization dataset, SciTLDR. We systematically analyze the intermediate pretraining step by varying the size and domain of the pretraining corpus, changing the length of the input sequence in the target task and varying target tasks. We also investigate how intermediate pretraining interacts with contextualized word embeddings trained on different domains.

pdf bib
Finding Pragmatic Differences Between Disciplines
Lee Kezar | Jay Pujara

Scholarly documents have a great degree of variation, both in terms of content (semantics) and structure (pragmatics). Prior work in scholarly document understanding emphasizes semantics through document summarization and corpus topic modeling but tends to omit pragmatics such as document organization and flow. Using a corpus of scholarly documents across 19 disciplines and state-of-the-art language modeling techniques, we learn a fixed set of domain-agnostic descriptors for document sections and retrofit the corpus to these descriptors (also referred to as normalization). Then, we analyze the position and ordering of these descriptors across documents to understand the relationship between discipline and structure. We report within-discipline structural archetypes, variability, and between-discipline comparisons, supporting the hypothesis that scholarly communities, despite their size, diversity, and breadth, share similar avenues for expressing their work. Our findings lay the foundation for future work in assessing research quality, domain style transfer, and further pragmatic analysis.

pdf bib
Extractive Research Slide Generation Using Windowed Labeling Ranking
Athar Sefid | Prasenjit Mitra | Jian Wu | C Lee Giles

Presentation slides generated from original research papers provide an efficient form to present research innovations. Manually generating presentation slides is labor-intensive. We propose a method to automatically generates slides for scientific articles based on a corpus of 5000 paper-slide pairs compiled from conference proceedings websites. The sentence labeling module of our method is based on SummaRuNNer, a neural sequence model for extractive summarization. Instead of ranking sentences based on semantic similarities in the whole document, our algorithm measures the importance and novelty of sentences by combining semantic and lexical features within a sentence window. Our method outperforms several baseline methods including SummaRuNNer by a significant margin in terms of ROUGE score.

pdf bib
Unsupervised document summarization using pre-trained sentence embeddings and graph centrality
Juan Ramirez-Orta | Evangelos Milios

This paper describes our submission for the LongSumm task in SDP 2021. We propose a method for incorporating sentence embeddings produced by deep language models into extractive summarization techniques based on graph centrality in an unsupervised manner. The proposed method is simple, fast, can summarize any kind of document of any size and can satisfy any length constraints for the summaries produced. The method offers competitive performance to more sophisticated supervised methods and can serve as a proxy for abstractive summarization techniques

pdf bib
QMUL-SDS at SCIVER : Step-by-Step Binary Classification for Scientific Claim VerificationQMUL-SDS at SCIVER: Step-by-Step Binary Classification for Scientific Claim Verification
Xia Zeng | Arkaitz Zubiaga

Scientific claim verification is a unique challenge that is attracting increasing interest. The SCIVER shared task offers a benchmark scenario to test and compare claim verification approaches by participating teams and consists in three steps : relevant abstract selection, rationale selection and label prediction. In this paper, we present team QMUL-SDS’s participation in the shared task. We propose an approach that performs scientific claim verification by doing binary classifications step-by-step. We trained a BioBERT-large classifier to select abstracts based on pairwise relevance assessments for each claim, title of the abstract and continued to train it to select rationales out of each retrieved abstract based on claim, sentence. We then propose a two-step setting for label prediction, i.e. first predicting NOT_ENOUGH_INFO or ENOUGH_INFO, then label those marked as ENOUGH_INFO as either SUPPORT or CONTRADICT. Compared to the baseline system, we achieve substantial improvements on the dev set. As a result, our team is the No. 4 team on the leaderboard.