Proceedings of the Ninth International Workshop on Health Text Mining and Information Analysis

Alberto Lavelli, Anne-Lyse Minard, Fabio Rinaldi (Editors)


Anthology ID:
W18-56
Month:
October
Year:
2018
Address:
Brussels, Belgium
Venues:
EMNLP | Louhi | WS
SIG:
Publisher:
Association for Computational Linguistics
URL:
https://aclanthology.org/W18-56
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
https://aclanthology.org/W18-56.pdf

pdf bib
Proceedings of the Ninth International Workshop on Health Text Mining and Information Analysis
Alberto Lavelli | Anne-Lyse Minard | Fabio Rinaldi

pdf bib
Supervised Machine Learning for Extractive Query Based Summarisation of Biomedical Data
Mandeep Kaur | Diego Mollá

The automation of text summarisation of biomedical publications is a pressing need due to the plethora of information available online. This paper explores the impact of several supervised machine learning approaches for extracting multi-document summaries for given queries. In particular, we compare classification and regression approaches for query-based extractive summarisation using data provided by the BioASQ Challenge. We tackled the problem of annotating sentences for training classification systems and show that a simple annotation approach outperforms regression-based summarisation.

pdf bib
Comparing CNN and LSTM character-level embeddings in BiLSTM-CRF models for chemical and disease named entity recognitionCNN and LSTM character-level embeddings in BiLSTM-CRF models for chemical and disease named entity recognition
Zenan Zhai | Dat Quoc Nguyen | Karin Verspoor

We compare the use of LSTM-based and CNN-based character-level word embeddings in BiLSTM-CRF models to approach chemical and disease named entity recognition (NER) tasks. Empirical results over the BioCreative V CDR corpus show that the use of either type of character-level word embeddings in conjunction with the BiLSTM-CRF models leads to comparable state-of-the-art performance. However, the models using CNN-based character-level word embeddings have a computational performance advantage, increasing training time over word-based models by 25 % while the LSTM-based character-level word embeddings more than double the required training time.

pdf bib
Investigating the Challenges of Temporal Relation Extraction from Clinical Text
Diana Galvan | Naoaki Okazaki | Koji Matsuda | Kentaro Inui

Temporal reasoning remains as an unsolved task for Natural Language Processing (NLP), particularly demonstrated in the clinical domain. The complexity of temporal representation in language is evident as results of the 2016 Clinical TempEval challenge indicate : the current state-of-the-art systems perform well in solving mention-identification tasks of event and time expressions but poorly in temporal relation extraction, showing a gap of around 0.25 point below human performance. We explore to adapt the tree-based LSTM-RNN model proposed by Miwa and Bansal (2016) to temporal relation extraction from clinical text, obtaining a five point improvement over the best 2016 Clinical TempEval system and two points over the state-of-the-art. We deliver a deep analysis of the results and discuss the next step towards human-like temporal reasoning.

pdf bib
De-identifying Free Text of Japanese Dummy Electronic Health RecordsJapanese Dummy Electronic Health Records
Kohei Kajiyama | Hiromasa Horiguchi | Takashi Okumura | Mizuki Morita | Yoshinobu Kano

A new law was established in Japan to promote utilization of EHRs for research and developments, while de-identification is required to use EHRs. However, studies of automatic de-identification in the healthcare domain is not active for Japanese language, no de-identification tool available in practical performance for Japanese medical domains, as far as we know. Previous work shows that rule-based methods are still effective, while deep learning methods are reported to be better recently. In order to implement and evaluate a de-identification tool in a practical level, we implemented three methods, rule-based, CRF, and LSTM. We prepared three datasets of pseudo EHRs with de-identification tags manually annotated. These datasets are derived from shared task data to compare with previous work, and our new data to increase training data. Our result shows that our LSTM-based method is better and robust, which leads to our future work that plans to apply our system to actual de-identification tasks in hospitals.

pdf bib
Unsupervised Identification of Study Descriptors in Toxicology Research : An Experimental Study
Drahomira Herrmannova | Steven Young | Robert Patton | Christopher Stahl | Nicole Kleinstreuer | Mary Wolfe

Identifying and extracting data elements such as study descriptors in publication full texts is a critical yet manual and labor-intensive step required in a number of tasks. In this paper we address the question of identifying data elements in an unsupervised manner. Specifically, provided a set of criteria describing specific study parameters, such as species, route of administration, and dosing regimen, we develop an unsupervised approach to identify text segments (sentences) relevant to the criteria. A binary classifier trained to identify publications that met the criteria performs better when trained on the candidate sentences than when trained on sentences randomly picked from the text, supporting the intuition that our method is able to accurately identify study descriptors.

pdf bib
Iterative development of family history annotation guidelines using a synthetic corpus of clinical text
Taraka Rama | Pål Brekke | Øystein Nytrø | Lilja Øvrelid

In this article, we describe the development of annotation guidelines for family history information in Norwegian clinical text. We make use of incrementally developed synthetic clinical text describing patients’ family history relating to cases of cardiac disease and present a general methodology which integrates the synthetically produced clinical statements and guideline development. We analyze inter-annotator agreement based on the developed guidelines and present results from experiments aimed at evaluating the validity and applicability of the annotated corpus using machine learning techniques. The resulting annotated corpus contains 477 sentences and 6030 tokens. Both the annotation guidelines and the annotated corpus are made freely available and as such constitutes the first publicly available resource of Norwegian clinical text.

pdf bib
Analysis of Risk Factor Domains in Psychosis Patient Health Records
Eben Holderness | Nicholas Miller | Kirsten Bolton | Philip Cawkwell | Marie Meteer | James Pustejovsky | Mei Hua-Hall

Readmission after discharge from a hospital is disruptive and costly, regardless of the reason. However, it can be particularly problematic for psychiatric patients, so predicting which patients may be readmitted is critically important but also very difficult. Clinical narratives in psychiatric electronic health records (EHRs) span a wide range of topics and vocabulary ; therefore, a psychiatric readmission prediction model must begin with a robust and interpretable topic extraction component. We created a data pipeline for using document vector similarity metrics to perform topic extraction on psychiatric EHR data in service of our long-term goal of creating a readmission risk classifier. We show initial results for our topic extraction model and identify additional features we will be incorporating in the future.

pdf bib
Patient Risk Assessment and Warning Symptom Detection Using Deep Attention-Based Neural Networks
Ivan Girardi | Pengfei Ji | An-phi Nguyen | Nora Hollenstein | Adam Ivankay | Lorenz Kuhn | Chiara Marchiori | Ce Zhang

We present an operational component of a real-world patient triage system. Given a specific patient presentation, the system is able to assess the level of medical urgency and issue the most appropriate recommendation in terms of best point of care and time to treat. We use an attention-based convolutional neural network architecture trained on 600,000 doctor notes in German. We compare two approaches, one that uses the full text of the medical notes and one that uses only a selected list of medical entities extracted from the text. These approaches achieve 79 % and 66 % precision, respectively, but on a confidence threshold of 0.6, precision increases to 85 % and 75 %, respectively. In addition, a method to detect warning symptoms is implemented to render the classification task transparent from a medical perspective. The method is based on the learning of attention scores and a method of automatic validation using the same data.point of care and time to treat. We use an attention-based convolutional neural network architecture trained on 600,000 doctor notes in German. We compare two approaches, one that uses the full text of the medical notes and one that uses only a selected list of medical entities extracted from the text. These approaches achieve 79% and 66% precision, respectively, but on a confidence threshold of 0.6, precision increases to 85% and 75%, respectively. In addition, a method to detect warning symptoms is implemented to render the classification task transparent from a medical perspective. The method is based on the learning of attention scores and a method of automatic validation using the same data.

pdf bib
Syntax-based Transfer Learning for the Task of Biomedical Relation Extraction
Joël Legrand | Yannick Toussaint | Chedy Raïssi | Adrien Coulet

Transfer learning (TL) proposes to enhance machine learning performance on a problem, by reusing labeled data originally designed for a related problem. In particular, domain adaptation consists, for a specific task, in reusing training data developed for the same task but a distinct domain. This is particularly relevant to the applications of deep learning in Natural Language Processing, because those usually require large annotated corpora that may not exist for the targeted domain, but exist for side domains. In this paper, we experiment with TL for the task of Relation Extraction (RE) from biomedical texts, using the TreeLSTM model. We empirically show the impact of TreeLSTM alone and with domain adaptation by obtaining better performances than the state of the art on two biomedical RE tasks and equal performances for two others, for which few annotated data are available. Furthermore, we propose an analysis of the role that syntactic features may play in TL for RE.

pdf bib
In-domain Context-aware Token Embeddings Improve Biomedical Named Entity Recognition
Golnar Sheikhshabbafghi | Inanc Birol | Anoop Sarkar

Rapidly expanding volume of publications in the biomedical domain makes it increasingly difficult for a timely evaluation of the latest literature. That, along with a push for automated evaluation of clinical reports, present opportunities for effective natural language processing methods. In this study we target the problem of named entity recognition, where texts are processed to annotate terms that are relevant for biomedical studies. Terms of interest in the domain include gene and protein names, and cell lines and types. Here we report on a pipeline built on Embeddings from Language Models (ELMo) and a deep learning package for natural language processing (AllenNLP). We trained context-aware token embeddings on a dataset of biomedical papers using ELMo, and incorporated these embeddings in the LSTM-CRF model used by AllenNLP for named entity recognition. We show these representations improve named entity recognition for different types of biomedical named entities. We also achieve a new state of the art in gene mention detection on the BioCreative II gene mention shared task.

pdf bib
Self-training improves Recurrent Neural Networks performance for Temporal Relation Extraction
Chen Lin | Timothy Miller | Dmitriy Dligach | Hadi Amiri | Steven Bethard | Guergana Savova

Neural network models are oftentimes restricted by limited labeled instances and resort to advanced architectures and features for cutting edge performance. We propose to build a recurrent neural network with multiple semantically heterogeneous embeddings within a self-training framework. Our framework makes use of labeled, unlabeled, and social media data, operates on basic features, and is scalable and generalizable. With this method, we establish the state-of-the-art result for both in- and cross-domain for a clinical temporal relation extraction task.

pdf bib
Listwise temporal ordering of events in clinical notes
Serena Jeblee | Graeme Hirst

We present metrics for listwise temporal ordering of events in clinical notes, as well as a baseline listwise temporal ranking model that generates a timeline of events that can be used in downstream medical natural language processing tasks.

pdf bib
Evaluation of a Sequence Tagging Tool for Biomedical Texts
Julien Tourille | Matthieu Doutreligne | Olivier Ferret | Aurélie Névéol | Nicolas Paris | Xavier Tannier

Many applications in biomedical natural language processing rely on sequence tagging as an initial step to perform more complex analysis. To support text analysis in the biomedical domain, we introduce Yet Another SEquence Tagger (YASET), an open-source multi purpose sequence tagger that implements state-of-the-art deep learning algorithms for sequence tagging. Herein, we evaluate YASET on part-of-speech tagging and named entity recognition in a variety of text genres including articles from the biomedical literature in English and clinical narratives in French. To further characterize performance, we report distributions over 30 runs and different sizes of training datasets. YASET provides state-of-the-art performance on the CoNLL 2003 NER dataset (F1=0.87), MEDPOST corpus (F1=0.97), MERLoT corpus (F1=0.99) and NCBI disease corpus (F1=0.81). We believe that YASET is a versatile and efficient tool that can be used for sequence tagging in biomedical and clinical texts.