link

August 2, Thursday
10:00 – 11:00

Domain Adaptation for Medical Language Processing
Computer Science seminar
Lecturer : Raphael Cohen
Affiliation : CS, BGU
Location : 202/37
Host : Dr. Aryeh Kontorovich
The increasing availability of Electronic Health Record (EHR) data and specifically free-text patient notes presents opportunities for the extraction of phenotypes, treatment and treatment outcome on a large scale. These data can make a significant contribution to basic science in many fields that require detailed phenotypic information such as linking phenotypes to genetic variance. General purpose text processing tools perform poorly on text in the medical domain, because the medical language uses specific words, word distributions and syntactic constructs. I present three domain-adaptation techniques to help adapt existing text processing tools to the medical domain, with specific attention to Hebrew:

a) Medical text uses many technical terms to refer to anatomy, biology or diseases. Most Medical-NLP tools rely on the UMLS, a medical vocabulary with over 300K unique terms and more than 1M synonyms. We present a method for automatically creating a Hebrew-UMLS lexicon. We show that acquiring this resource reduces the error for the NLP tasks of segmentation and Part of Speech (POS) tagging. We examine the impact of this improvement on a classification task: identifying patients with Epilepsy from the notes of the Children Neurology Unit in Soroka, resulting in F1 improvement from 92% to 96%.

b) EHR text is characterized by high-level of copy-paste redundancy in medical notes of the same patients. We quantatively show that this type of redundant word distribution is highly prevalent in both Hebrew and English notes and empirically demonstrate that this characteristics of medical notes collections has a deleterious effect on classical NLP algorithms. We present a novel algorithm for Topic Modeling with Latent Dirichlet Allocations (LDA) which is immune to the redundancy noise. This algorithm also performs better than the baseline for redundant news reports clusters.

c) Syntactic dependency parsing is a useful technique for Information Extraction, widely used in the biomedical domain. However, syntactic parsers suffer a major decline in accuracy when used in a domain different from the training data. We present a method for using Selectional Preferences, the affinity of different word pair or triplets, modeled with LDA to improve dependency parsing using un-annotated data in the target domain with significant improvement.

Taken together, the techniques provide infrastructure which allows practical processing of medical text in Hebrew. We make available a first set of language resources for Hebrew medical text processing (treebank, lexicon, part of speech tagger, syntactic parser, topic modeling toolkit). This infrastructure has been applied for practical text mining of hospital patient reports.