BGU NLP - Medical Hebrew Machine Learning Suite

Raphael Cohen and Michael Elhadad

January 2013

This is a package for applying machine learning to medical notes in Hebrew.


  1. License
  2. Download
  3. Usage

This package extends the Hebrew Medical Entity Linking / Language Processing tool.

To use this package you will need to first use the aforementioned Entity Linking tool to annotate all of the notes. The default setting creates decision trees based on labeling of the medical notes (for example "Epilepsy" vs. "Not Epilepsy").

License

BGU NLP - Medical Hebrew Machine Learning Suite is distributed under a GPL license.

    BGU NLP - Medical Hebrew Machine Learning Suite is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    See the GNU General Public License.

Download

BGU NLP - Medical Hebrew Machine Learning Suite.
The prerequisites are java 1.6 and python (it has been tested with Python 2.7) including scikit-learn.

Use


The classification suite requires 4 easy steps:
  • 1. Annotate the notes with our annotation service
  • 2. Extract Features - Part One. Produce features and do feature selection.
  • 3. Extract Features - Part Two (produce feature matrix from features list)
  • 4. Classify
    We provide a toy example for classifying "lung cancer" vs. "leukemia" physician answers from an online forum. Download here.

  • Generate the Entity, POS and Chunking input.
    This can be done in 3 ways:
    1. Use the  Entity Linking tool to annotate each note using the plain-text output
    2. Use it as a web service (we do not keep any logs) with the annotateMedicalWithWebService.py script.
    3. If you cannot use the service, contact us for a local installation / secure service password.
    
    The output of a file looks like this:
    אני	:PRONOUN-MF,S,1:	הוא	B-NP	O
    סובל	:VERB-M,S,A,BEINONI:	סבל	O	suffering-C0683278
    מכאב	PREPOSITION:NOUN-M,S,CONST:	כאב	B-NP	pain-C0030193
    ראש	:NOUN-M,S,ABS:	ראש	I-NP	head-C0018670
    ,	:PUNCUATION:	,	O	O
    קוצר נשימה	:TERM:	קוצר נשימה	O	shortness of breath-C0013404
    ומטופל	CONJ:NOUN-M,S,ABS:	מטופל	I-NP	patient-C0030705
    באקמול	PREPOSITION:NOUN-M,S,ABS:	אקמול	B-NP	O
    ,	:PUNCUATION:	,	O	O
    ונטולין	:PROPERNAME:	ונטולין	B-NP	salbutamol-C0001927
    וtylenol	CONJ:PROPERNAME:	tylenol	B-NP	rxconso-202433-tylenol
    .	:PUNCUATION:	.	O	O
    
    For our classification example:
    % python annotateMedicalWithWebService.py mlSample mlSampleAnnotated
    

  • Extract Features - Part One
    This script extracts the top 300 common noun phrases and terms from a directory of annotated notes (constructed above).
    % python phenotypeExtractorStep1.py [directory-name]
    
    The result is a file with possibe features named: [directory-name]-possibleFeatures
    For our classification example:
    % python phenotypeExtractorStep1.py mlSampleAnnotated
    
    Will create a feature file names mlSampleAnnotated-possibleFeatures
    The feature file can be manually edited to remove noisy or meaningless features.
    In our example we removed 28 noisy features such as physician name (you can return to this stage later once you have reviewed the result).
    *To change the number of extracted features edit the configuration file "phenotypeExtractor.cfg".

  • Extract Features - Part Two
    After editing the features file, we create a feature matrix:
    % python phenotypeExtractorStep2.py [directory-name]
    
    This creates 2 files: [directory-name]-featuresData and [directory-name]-featuresData-labels
    *This script assumes that class labels are contained in the file name, change "classLabel" in the configuration file (phenotypeExtractor.cfg).
    For our classification example:
    % python phenotypeExtractorStep2.py mlSampleAnnotated
    
    This creates: mlSampleAnnotated-featuresData and mlSampleAnnotated-featuresData-labels .
    Note that the word ריאה is used in phenotypeExtractor.cfg, this marks the lung cancer patients as the negative examples for the classification output.

  • Classify
    This script used Sci-Kit learn to classify your data.
    The default setting is decision tree.
    % python classifyWithSK.py [directory-name]-featuresData
    
    To see the resulting decision trees you can use "dot"
    %dot graph0 -Tpng -o graph0.png
    
    For our classification example:
    % python classifyWithSK.py mlSampleAnnotated-featuresData
                 precision    recall  f1-score   support
    
             -1       0.93      0.60      0.73        72
              1       0.75      0.97      0.85        92
    
    avg / total       0.83      0.80      0.80       164
    
       ... (the default is running the classifier on 10 different data splits) 
       
    			 precision    recall  f1-score   support
    
             -1       0.97      0.66      0.78        87
              1       0.71      0.97      0.82        77
    
    avg / total       0.85      0.80      0.80       164
    
    0.831200365875
    
    % dot graph9 -Tpng -o graph9.png
    
    0.831200365875 is the average F-measure.
    See the resulting decision tree here

    You can change the classifier to SVM in the configuration file (the "classifier" variable).
    Try doing this with our example - the F-measure should rise to 0.88 (note that no visualization is available for SVM).


    Last modified Jan 13, 2013