BGU NLP - LemLDA: an LDA Package for Hebrew

Meni Adler, Raphael Cohen, Yoav Goldberg and Michael Elhadad

July 2011

Last Updated Jan 2019 - Ported to Python 3.7 (Michael Elhadad)

LemLDA is a topic modeling software package that implements Latent Dirichlet Allocation (LDA) for Hebrew.


  1. License
  2. Download
  3. Installation
  4. Topic and Corpus Exploration
  5. Test Data

The package is based on Heinrich's java implementation of collapsed Gibbs sampling with an extra variable to model the generative nature of lemmas in Hebrew.

LemLDA is based on preprocessing with the Morphological Disambiguator (Tagger) by Adler et al.:

The tagger itself relies on the lexicon and morphological analyzer from the Mila Knowledge Center.

License

LemLDA is distributed under a GPL license.

    LemLDA is free software: you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation, either version 3 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    See the GNU General Public License.

Download

The following archive includes all necessary software to use LemLDA: LemLDA (125MB - about 500MB unzipped). The prerequisites are java and python (it has been tested with Java 1.8 and Python 3.7).

Installation

These instructions were tested on Linux and Windows.
  1. Install LemLDA
    wget http://www.cs.bgu.ac.il/~elhadad/nlpproj/lemlda.zip
    unzip lemlda.zip
    
  2. Tag your corpus
    % cd tagger
    % tag 
    
    Usage:    tag [corpusIn] [corpusOut]
    Example:  tag ..\corpus ..\corpusTagged
    -
        This will iterate over text files found in [corpusIn]
        and perform morphological tagging on each file.
        The tagged files will be placed in [corpusOut] in the format
        of one word per line followed by a number which is a compact encoding as
        a bitmask of the tag.  The tag includes the Part of speech of the word,
        its segmentation (in case of prefixes like ha- and bklm, mshvklv) and
        its morphological properties (number, gender, tense, person etc).
        The tagged files are encoded in UTF8.
    -
    java must be in your path.
    
    % tag ..\corpus ..\corpusTagged
    
  3. NOTE ON TOKENIZATION: there may be problems when running the tag command on Hebrew text which contains parentheses and non-Hebrew characters. These would appear with errors of this type:
    java.lang.StringIndexOutOfBoundsException: String index out of range: -83
    
    If this happens, you should pre-process your text using the following tokenizer before you apply the tag command: hebtokenizer.
  4. Generate the LDA input: At this stage, we filter the words in the corpus based on their part of speech and add vocalization (nikud) to the words to ease topic-based document exploration.
    % dot
    Usage:    dot [corpusTagged] [corpusTaggedDotted] [model:-token/-word/-lemma]
    Example:  dot ..\corpusTagged ..\corpusTaggedDotted -lemma
    -
        This will iterate over text files found in [corpusTagged]
        and add vocalization (nikud) to each tagged word in each file.
        The dotted files will be placed in [corpusTaggedDotted] in the format
        of one word per line followed by its vocalized version.
        The dotted files are encoded in UTF8.
    -
        The files in corpusTagged must have been produced by tag.
    -
        The model can be either: -token, -word or -lemma
        If your text can be reasonably well handled by the tagger use -lemma, 
        otherwise choose -token
    -
    java must be in your path.
    
    % dot ..\corpusTagged ..\corpusTaggedDotted -token
    
  5. Build LDA model
    % cd lemlda-0.1
    
    [Linux]
    % make train TOPICS= <topic number> DOCS_DIR=<in directory>  DOCS_PAT=<input files pattern> OUT=<out>
    
    [Windows]
    % train 
    
    Usage:    train [topics] [docsDir] [outPath]
    Example:  train 5 ..\corpusTaggedDotted ..\lda\model
    - java and python must be in your path.
    - topics is the desired number of topics to be learned by the LDA algorithm.
    - docsDir must be a full path where the documents encoded in UTF8 must be located.
      train will iterate over files with extension txt.
      The files must have been tagged and dotted using tag and dot.
    - outPath must be a full path to a file (for example model).
      train will generate several files in this folder whose name starts with the prefix given.
      for example model.dat, model.doc etc.
    -
    
    % train 5 ..\corpusTaggedDotted ..\lda\model
    
    
    where:
    1. <topic number> Number of topics to be extracted from the text
    2. <in directory> The input directory generated by dot above
    3. <input files pattern> The pattern of the input files name - default *.txt
    4. <out> Path to an output model (default 'model')
    There are other parameters in the makefile or bat file (have a look at them), specifically, you could play with alpha, beta, gamma: look at comment.
  6. Launch a Web server to explore topics:
    
    [Linux]
    % python python/result_browser.py 8080 <model>
    
    [Windows]
    % browse 
    Usage:    browse [modelPath] [port]
    Example:  browse ..\lda\model 8080
    -
        This lauches an HTTP server that allows you to browse the documents
        in your corpus by topic and the topics.
    -
    - modelPath refers to a model created by "train".
      It must point to a path with the prefix name of the model.
    - port is the TCP port on which the web browser will listen.
    -
    python must be in your path.
    -
    % browse ..\lda\model 8080
    Consult http://localhost:8080/topics/
    http://0.0.0.0:8080/
    
(Pay attention that the last "/" must be present in the URL.)

Topic and Corpus Exploration

The generated topics can now be explored in a Web browser at http://localhost:8080/topics/ You can define a query and get relevant documents by: http://localhost:8080/query/ for example, query on דם: http://localhost:8080/query/דם

The topic exploration proceeds as follows:

  1. Browse the list of learned topics:

  2. List the documents in which a topic is activated:

  3. View the topics associated to a specific document:

    Note that when viewing the document, you only see the words that have been kept for indexing by the LDA model (when running the dot script above). In this example, we only see words tagged as nouns.

Test Data

The package contains about 50 small text files in Hebrew from the Infomed corpus in the eHealth domain.


Last modified 14 Jan 2019