Morphological Tagging of Hebrew
Hebrew has complex morphology and ambiguous word formation rules.
The main complexity is created by the agglutination of prefixes (known as "משה וכלב") to the words they follow.
These letters correspond to the definite article, prepositions and conjunctions and when they are attached to the following word,
the segmentation of the word becomes ambiguous. For example, the word "ברק" (barak) can be read as "ב רק" (be rak) or as a single word (barak).
This ambiguity makes it difficult to tokenize Hebrew text into a sequence of words.
The thesis of Meni Adler Hebrew Morphological Disambiguation: An Unsupervised Stochastic Word-based Approach, 2007
describes a method to perform disambiguation of Hebrew morphology. This document describes how the Java implementation of Meni's system can be used.
An online demo of the system is available here. Experiment with it to get a feeling of its performance.
Installing the system
Prerequisites::
- Make sure you have Java 6 or higher available.
- Make sure you have Python available.
- Download tagger.zip
- Download hebtokenizer.py to tokenize your input text into separate word tokens.
- Download bitmasks_to_tags.py to translate the bitmask codes returned by Meni Adler's tagger to readable tags.
Operation
All tools operate over Hebrew text encoded in UTF-8, as for example this Wikipedia article on David Ben-Gurion.
We first tokenize the raw text into space separated tokens:
python hebtokenizer.py < ben-gurion.txt > ben-gurion-tokenized.txt
This produces a tokenized file like this one.
We next run the tagger on the tokenized file:
java -Xmx1G -XX:MaxPermSize=256m -cp trove-2.0.2.jar;morphAnalyzer.jar;opennlp.jar;gnu.jar;chunker.jar;splitsvm.jar;duck1.jar;tagger.jar vohmm.application.BasicTagger .\ ben-gurion-tokenized.txt ben-gurion.pos -bWST
This produces the file ben-gurion.pos which is the analyzed text where each word has its analysis, each word is written on a different line followed by its analysis.
The analysis is encoded as a compact bitmask, which contains information on each feature of the word. This compact information is useful if you want to store it in an efficient manner in a database.
But it is not readable. To decode the bitmask codes, use the following script:
python bitmasks_to_tags.py ben-gurion.pos > ben-gurion.pos.analyzed
This produces a file with easy to read tags descriptions: ben-gurion.pos.analyzed.
The tags encode the part of speech information for each word, but also prefixes, suffixes and morphological features (number, person, gender, construct state for nouns, etc).
In particular, prefixes are encoded as follows:
למניין PREPOSITION:NOUN-M,S,CONST:
שבהם REL-SUBCONJ:PREPOSITION:PRONOMINAL-M,P,3
which indicates this segmentation:
ל מניין PREPOSITION:NOUN-M,S,CONST:
ש ב הם REL-SUBCONJ:PREPOSITION:PRONOMINAL-M,P,3
Efficiency and Adaptation Notes
The tagger starts by loading in RAM the statistical models it uses.
This loading operation is slow (it can take up to 2 minutes).
It is important to load the tagger only once and apply it in batch on many texts, instead of running it each time on each text separately.
The Python scripts should be adapted to the domain. For example, when tagging Wikipedia text, it is good to segment Wikitext annotations like "[[" and "]]" as single tokens.
Bitmask decoding is demonstrated in Java in the sample program NewDemo.java available in the tagger.zip archive.
AnalysisInterface bitmaskResolver = new BitmaskResolver(anal.getTag().getBitmask(),token.getOrigStr(),false);
out.println("\tPOS: " + bitmaskResolver.getPOS());
out.println("\tPOS type: " + bitmaskResolver.getPOSType()); // the type of participle is "noun/adjective" or "verb"
out.println("\tGender: " + bitmaskResolver.getGender());
out.println("\tNumber: " + bitmaskResolver.getNumber());
out.println("\tPerson: " + bitmaskResolver.getPerson());
out.println("\tStatus: " + bitmaskResolver.getStatus());
out.println("\tTense: " + bitmaskResolver.getTense());
out.println("\tPolarity: " + bitmaskResolver.getPolarity());
out.println("\tDefiniteness: " + bitmaskResolver.isDefinite());
if (bitmaskResolver.hasPrefix()) {
out.print("\tPrefixes: ");
List<AffixInterface> prefixes = bitmaskResolver.getPrefixes();
if (prefixes != null) {
for (AffixInterface prefix : prefixes) {
out.print(prefix.getStr() + " " + Tag.toString(prefix.getBitmask(),true) + " ");
}
out.print("\n");
} else {
out.println("\tPrefixes: None");
}
}
if (bitmaskResolver.hasSuffix()) {
out.println("\tSuffix Function: " + bitmaskResolver.getSuffixFunction());
out.println("\tSuffix Gender: " + bitmaskResolver.getSuffixGender());
out.println("\tSuffix Number: " + bitmaskResolver.getSuffixNumber());
out.println("\tSuffix Person: " + bitmaskResolver.getSuffixPerson());
} else {
out.println("\tSuffix: None");
}
Last modified July 26th, 2012