Proceedings of the Tenth International Workshop on Health Text Mining and Information Analysis (LOUHI 2019)

Eben Holderness, Antonio Jimeno Yepes, Alberto Lavelli, Anne-Lyse Minard, James Pustejovsky, Fabio Rinaldi (Editors)


Anthology ID:
D19-62
Month:
November
Year:
2019
Address:
Hong Kong
Venues:
EMNLP | Louhi | WS
SIG:
Publisher:
Association for Computational Linguistics
URL:
https://aclanthology.org/D19-62
DOI:
Bib Export formats:
BibTeX MODS XML EndNote
PDF:
https://aclanthology.org/D19-62.pdf

pdf bib
Proceedings of the Tenth International Workshop on Health Text Mining and Information Analysis (LOUHI 2019)
Eben Holderness | Antonio Jimeno Yepes | Alberto Lavelli | Anne-Lyse Minard | James Pustejovsky | Fabio Rinaldi

pdf bib
On the Effectiveness of the Pooling Methods for Biomedical Relation Extraction with Deep Learning
Tuan Ngo Nguyen | Franck Dernoncourt | Thien Huu Nguyen

Deep learning models have achieved state-of-the-art performances on many relation extraction datasets. A common element in these deep learning models involves the pooling mechanisms where a sequence of hidden vectors is aggregated to generate a single representation vector, serving as the features to perform prediction for RE. Unfortunately, the models in the literature tend to employ different strategies to perform pooling for RE, leading to the challenge to determine the best pooling mechanism for this problem, especially in the biomedical domain. In order to answer this question, in this work, we conduct a comprehensive study to evaluate the effectiveness of different pooling mechanisms for the deep learning models in biomedical RE. The experimental results suggest that dependency-based pooling is the best pooling strategy for RE in the biomedical domain, yielding the state-of-the-art performance on two benchmark datasets for this problem.

pdf bib
Experiments with ad hoc ambiguous abbreviation expansion
Agnieszka Mykowiecka | Malgorzata Marciniak

The paper addresses experiments to expand ad hoc ambiguous abbreviations in medical notes on the basis of morphologically annotated texts, without using additional domain resources. We work on Polish data but the described approaches can be used for other languages too. We test two methods to select candidates for word abbreviation expansions. The first one automatically selects all words in text which might be an expansion of an abbreviation according to the language rules. The second method uses clustering of abbreviation occurrences to select representative elements which are manually annotated to determine lists of potential expansions. We then train a classifier to assign expansions to abbreviations based on three training sets : automatically obtained, consisting of manual annotation, and concatenation of the two previous ones. The results obtained for the manually annotated training data significantly outperform automatically obtained training data. Adding the automatically obtained training data to the manually annotated data improves the results, in particular for less frequent abbreviations. In this context the proposed a priori data driven selection of possible extensions turned out to be crucial.

pdf bib
Extracting relevant information from physician-patient dialogues for automated clinical note taking
Serena Jeblee | Faiza Khan Khattak | Noah Crampton | Muhammad Mamdani | Frank Rudzicz

We present a system for automatically extracting pertinent medical information from dialogues between clinicians and patients. The system parses each dialogue and extracts entities such as medications and symptoms, using context to predict which entities are relevant. We also classify the primary diagnosis for each conversation. In addition, we extract topic information and identify relevant utterances. This serves as a baseline for a system that extracts information from dialogues and automatically generates a patient note, which can be reviewed and edited by the clinician.

pdf bib
What does the language of foods say about us?
Hoang Van | Ahmad Musa | Hang Chen | Stephen Kobourov | Mihai Surdeanu

In this work we investigate the signal contained in the language of food on social media. We experiment with a dataset of 24 million food-related tweets, and make several observations. First, thelanguageoffoodhaspredictive power. We are able to predict if states in the United States (US) are above the medianratesfortype2diabetesmellitus(T2DM), income, poverty, and education outperforming previous work by 418 %. Second, we investigate the effect of socioeconomic factors (income, poverty, and education) on predicting state-level T2DM rates. Socioeconomic factors do improve T2DM prediction, with the greatestimprovementcomingfrompovertyinformation(6%),but, importantly, thelanguage of food adds distinct information that is not captured by socioeconomics. Third, we analyze how the language of food has changed over a five-year period (2013 2017), which is indicative of the shift in eating habits in the US during that period. We find several food trends, and that the language of food is used differently by different groups such as differentgenders. Last, weprovideanonlinevisualization tool for real-time queries and semantic analysis.

pdf bib
Dreaddit : A Reddit Dataset for Stress Analysis in Social MediaDreaddit: A Reddit Dataset for Stress Analysis in Social Media
Elsbeth Turcan | Kathy McKeown

Stress is a nigh-universal human experience, particularly in the online world. While stress can be a motivator, too much stress is associated with many negative health outcomes, making its identification useful across a range of domains. However, existing computational research typically only studies stress in domains such as speech, or in short genres such as Twitter. We present Dreaddit, a new text corpus of lengthy multi-domain social media data for the identification of stress. Our dataset consists of 190 K posts from five different categories of Reddit communities ; we additionally label 3.5 K total segments taken from 3 K posts using Amazon Mechanical Turk. We present preliminary supervised learning methods for identifying stress, both neural and traditional, and analyze the complexity and diversity of the data and characteristics of each category.

pdf bib
Towards Understanding of Medical Randomized Controlled Trials by Conclusion Generation
Alexander Te-Wei Shieh | Yung-Sung Chuang | Shang-Yu Su | Yun-Nung Chen

Randomized controlled trials (RCTs) represent the paramount evidence of clinical medicine. Using machines to interpret the massive amount of RCTs has the potential of aiding clinical decision-making. We propose a RCT conclusion generation task from the PubMed 200k RCT sentence classification dataset to examine the effectiveness of sequence-to-sequence models on understanding RCTs. We first build a pointer-generator baseline model for conclusion generation. Then we fine-tune the state-of-the-art GPT-2 language model, which is pre-trained with general domain data, for this new medical domain task. Both automatic and human evaluation show that our GPT-2 fine-tuned models achieve improved quality and correctness in the generated conclusions compared to the baseline pointer-generator model. Further inspection points out the limitations of this current approach and future directions to explore.

pdf bib
Dilated LSTM with attention for Classification of Suicide NotesLSTM with attention for Classification of Suicide Notes
Annika M Schoene | George Lacey | Alexander P Turner | Nina Dethlefs

In this paper we present a dilated LSTM with attention mechanism for document-level classification of suicide notes, last statements and depressed notes. We achieve an accuracy of 87.34 % compared to competitive baselines of 80.35 % (Logistic Model Tree) and 82.27 % (Bi-directional LSTM with Attention). Furthermore, we provide an analysis of both the grammatical and thematic content of suicide notes, last statements and depressed notes. We find that the use of personal pronouns, cognitive processes and references to loved ones are most important. Finally, we show through visualisations of attention weights that the Dilated LSTM with attention is able to identify the same distinguishing features across documents as the linguistic analysis.

pdf bib
Writing habits and telltale neighbors : analyzing clinical concept usage patterns with sublanguage embeddings
Denis Newman-Griffis | Eric Fosler-Lussier

Natural language processing techniques are being applied to increasingly diverse types of electronic health records, and can benefit from in-depth understanding of the distinguishing characteristics of medical document types. We present a method for characterizing the usage patterns of clinical concepts among different document types, in order to capture semantic differences beyond the lexical level. By training concept embeddings on clinical documents of different types and measuring the differences in their nearest neighborhood structures, we are able to measure divergences in concept usage while correcting for noise in embedding learning. Experiments on the MIMIC-III corpus demonstrate that our approach captures clinically-relevant differences in concept usage and provides an intuitive way to explore semantic characteristics of clinical document collections.