Hebrew OCR with Nikud

Adi Oz and Vered Shani

Dec 2012

Presentation on the Project

Introduction

This site presents our BSc project. Our goal is to write a program that takes as input an Hebrew text file (without Nikud) and returns an Hebrew text file with the correct Nikud. Our program uses Hidden Markov Model (HMM) for achieving this, where the model's states are Nikud characters (Shva, Patah, Kamatz,...) and the model's observations are Hebrew letters (Alef, Beit, Gimel, …).

We used a scan of an old “Britannica Lanoar” given to us by “Matah” as training data for creating HMM, which raised the need for an Hebrew OCR that also recognizes Nikud. We had trouble finding an Hebrew OCR with reasonable performance. Therefore, we decided to train an Hebrew OCR for recognizing Nikud. We used a well known tool called Tesseract-OCR which is trainable for new fonts and new languages. We used our scanned “Britannica” for the OCR training mission.

After we've finished training an Hebrew OCR with Nikud, we've got a large data of Hebrew text with Nikud which can be used for training HMM model for our goal program.

Installation & running instructions

1. MoshPyTT

MoshPyTT is a program to open and display Tesseract training files (image and box file) side by side to allow the box files to be corrected. This is a vital step in training Tesseract to new text.

It is written using Python and PyGTK so it can be run on different platforms.

Autotrain.py is a utility script to take box file/image pairs and distill down to a traineddata file, which is otherwise a long and tedious process.

For using MoshPyTT, a 32 bit Python interpreter is required.

Next, download PyGTK:

Finally, download the latest moshpytt-x.x.tar.gz file and decompress it (for using Autotrain.py see Tesseract download instructions).

Command Line:
MoshPyTT:        python moshpytt.py [-i optional image name]

autotrain.py: From the directory containing the boxfile/image pairs, run: 
				"...../moshpytt/autotrain.py"
This is because the Tesseract tools only produce the files in the current working directory. You may need to run with elevated privileges in order for the script to move the generated files to the "tessdata" directory if that directory is in a protected area (eg /usr/).

2. Tesseract: The OCR Engine

Download: tesseract

You need at least one of these or Tesseract will not work. Note that tesseract-x.xx.tar.gz unpacks to the tesseract-ocr directory. tesseract-x.xx.<lang>.tar.gz unpacks to the tessdata directory which belongs inside your tesseract-ocr directory. It is therefore best to download them into your tesseract-x.xx directory, so you can use unpack here or equivalent. You can unpack as many of the language packs as you care.

For Windows use installer (for 3.00 and above). Tesseract is library with a command line interface. For GUI, check the AddOns wiki page.

For Unix: You have to tell Tesseract through a standard unix mechanism where to find its data directory. You must either run:

$ ./autogen.sh
$ ./configure
$ make
$ make install
$ sudo ldconfig
to move the data files to the standard place, or:
$ export TESSDATA_PREFIX="directory in which your tessdata resides/" 

More information on tesseract download: README

Command-Line:
tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfiles...]

Note: After Training and creating “lang.traineddata” file (lang is the language abbreviation for the trained language), the “lang.traineddata“ file should be copied to the “ tessdata” directory inside “tesseract-octx.xx “ directory created when tesseract was installed. You may need to run with elevated privileges in order for the script to move the generated files to the "tessdata" directory if that directory is in a protected area (eg /usr/).

Running Example: Picture:

Output:

לָפוּמְבְּדִיתָא, שֶׁהָיְתָה מֶרְכָּז יְהוּדִי חָשׁוּב מִיָמָיו שֶׁל הַתַנָא מַר
שְׁמוּאֵל, בַּר-הַפְּ לֶ גְתָא שֻׁ ל רַבִּי יְהוּדָה הַנָשִׂיא עוֹרֵךְ הַמִשְׁנָה.
מַר שְׁמוּאֵל הָיָה נֶאֱמָן לַפַּרְסִים וּפָסַק כִּי בְּעִנְיָנִים אֶזְרָחִיִים
מְחַיְבִים חֻקֵי הַמְדִינָה שֶׁבָּהּ יוֹשְׁבִים הַיְהוּדִים מֵ מָשׁ כְּאִלוּ הָיוּ
חֻקֵי הַתוֹרָה. הוּא קָבַע אֶת הַכְּלָל: "דִינָא דְמַלְכוּתָא - דִינָא".
בִּתְקוּפָה זוֹ, לְאַחַר חֲתִימַת הַמִשְׁנָה עַל ידֵי יְהוּדָה הַנָשְׂיא,
פָּעֲלוּ בִּישִׁיבוֹת בָּבֶל הָאָמוֹרָאִים חַכְמֵי הַתַלְמוּדִ.

3. Creating 'box' files

Tesseract needs a 'box' file to go with each training image. The box file is a text file that lists the characters in the training image, in order, one per line, with the coordinates of the bounding box around the image.

Run Tesseract on each of your training images using this command line:

$ tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] batch.nochop makebox
  e.g.:
$ tesseract eng.timesitalic.exp0.tif eng.timesitalic.exp0 batch.nochop makebox
Now the hard part: You have to edit the file [lang].[fontname].exp[num].box and put the UTF-8 codes for each character in the file at the start of each line, in place of an incorrect character put there by Tesseract.

In order to make this process easier we can use MoshPyTT. You simply run it using the following:

$ ./moshpytt.py -i <image-name>
  e.g.
$ ./moshpytt.py -i eng.example.001.tif
If you place your cursor into the text pane, the box described by that line will be drawn on top of the image in the image pane. If you select a range of boxes with your mouse, all the selected boxes will be shown.

You can edit boxes in several ways:

  1. Direct editing: You can directly edit the boxfile's text in the text pane. The new box will be shown immediately in the image pane. This is the only way to edit the text in the box, but you can also edit the dimensions of the box in this way.
  2. Keyboard shortcuts for editing the dimensions of boxes. You can stretch, shrink, move, merge, split and delete boxes using shortcut keys. The shortcut keys are arranged to use the number-pad for ease of remembering, but the "normal" number keys are also usable if you don't have a numpad.
    1. Directions: 8 - Up, 4 - Left, 6 - Right, 2 - Down, 5 - All
    2. Ctrl-direction: Stretch box in direction
    3. Ctrl-shift-direction: Shrink box in direction
    4. Alt-direction: Move box in direction
    5. Ctrl-0: Delete selected boxes
    6. Ctrl-1: Merge selected boxes
    7. Ctrl-3: Split selected boxes
    8. Ctrl-Z: Undo change
    9. Ctrl-Y: Redo change
    The merge, split and delete options are also accessible by the "Edit" menu.
  3. Text attributes: You can add text attributes (bold, italic and underline) by clicking the checkboxes in the bottom left corner.
  4. Finding text: You can use the text box in the bottom right corner to find a given string in the boxfile, which can be very useful when looking for a box with an incorrect string which is upsetting the training process.
  5. Saving boxfiles: You can save boxfiles by pressing Ctrl-S or choosing "File>Save". If you save the boxfile as, the boxfile will be saved with a new name, and the image file will be copied to match it. If you make changes and close the program, you will be prompted to save your changes. moshPyTT autosaves your work regularly. If the program exits improperly, the autosave file will remain, and you can continue near where you left off. If moshPyTT exits properly, whether or not you choose to save your changes, the autosave file will be removed.

4. Example Box File

5. Training Tesseract

After creating boxfiles as instructed above and received boxfile-image pairs we have two options:
  1. Use autotrain.py downloaded with MoshPyTT (see Installation & running instructions).
  2. Do the process that autotrain.py does. The advantage of this option is that in the beginning of the training tesseract makes many mistakes when trying to identify the characters in the boxing stage (using 'makebox' command, see Creating 'box' files). Therefore, when creating all boxfiles at once, we will have many mistakes to repair manually, before the training process. In contrast, when making boxfile and training tesseract for each image separatly, tesseract will “learn” from mistakes and every box file will be more accurate than its predecessors.
The stages are:
  1. Generate Training Images: The first step is to determine the full character set to be used, and prepare a text or word processor file containing a set of examples. The most important points to bear in mind when creating a training file are:
    1. Make sure there are a minimum number of samples of each character. 10 is good, but 5 is OK for rare characters.
    2. There should be more samples of the more frequent characters - at least 20.
    3. Don't make the mistake of grouping all the non-letters together. Make the text more realistic. For example,
      The quick brown fox jumps over the lazy dog. 0123456789 !@#$%^&(),.{}<>/?   
      		
      is terrible. Much better is:
      The (quick) brown {fox} jumps! over the $3,456.78 <lazy> #90 dog & duck/goose, as 12.5% of E-mail 
      from spammer@website.com is spam?
      		
      This gives the textline finding code a much better chance of getting sensible baseline metrics for the special characters.
    4. It is ABSOLUTELY VITAL to space out the text a bit when printing, so up the inter-character and inter-line spacing in your word processor.
    5. The training data should be grouped by font. Ideally, all samples of a single font should go in a single tiff file, so the total training data in a single font may be many pages and many 10s of thousands of characters, allowing training for large-character-set languages.
    6. There is no need to train with multiple sizes. 10 point will do.
    7. DO NOT MIX FONTS IN AN IMAGE FILE.
    8. It is best to create a mix of fonts and styles (but in separate files), including italic and bold.
    9. Clarification for large amounts of training data: The 32 images limit is for the number of FONTS.
    10. Training images names' must be of the form: [lang].font-name.exp[i].tif, with lang being an ISO-639 three letter abbreviation for your language, and i is a natural number (0,1,2, …).
  2. Create 'box' files (see Creating 'box' files above). You'll now have a file called font-name.exp0.box, and you'll need to open it in a box-file editor (MoshPyTT). Once you've opened it, go through every letter, and make sure it was detected correctly. If a letter was skipped, add it as a row to the box file. Similarly, if two letters were detected as one, break them up into two lines.
  3. Feed the box file back into tesseract. Run the command:
    	$ tesseract [lang].font-name.exp[i].tif [lang].font-name.exp[i].box nobatch box.train.stderr
    	
    This will generate the [lang].font-name.exp[i].tr file which contains the features of each character of the training page. [lang].[fontname].exp[num].txt will also be written with a single newline and no text. There is no need to edit the content of the [lang].[fontname].exp[num].tr file.

NOTES

6. Compute the Character Set

Tesseract needs to know the set of possible characters it can output. To generate the unicharset data file, use the unicharset_extractor program on the box files generated so far:
$ unicharset_extractor [lang].font-name.exp0.box [lang].font-name.exp1.box ...

NOTES

7. The font_properties File

The purpose of this text file is to provide font style information that will appear in the output when the font is recognized. Please make sure that the file does not have BOM markers. Each line of the font_properties file is formatted as follows:
  <fontname> <italic> <bold> <fixed> <serif> <fraktur> 

where <fontname> is a string naming the font (no spaces allowed!), and <italic>, <bold>, <fixed>, <serif> and <fraktur> are all simple 0 or 1 flags indicating whether the font has the named property. When running mftraining, each .tr filename must match an entry in the font_properties file, or mftraining will abort. The name of the .tr file may be either fontname.tr or [lang].[fontname].exp[num].tr.
Example:
  font_properties file:
  timesitalic 1 0 0 1 0

Putting it all Together

That is all there is to it! All you need to do now is collect together all (shapetable, normproto, inttemp, pffmtable) the files and rename them with a lang. prefix, where lang is the 3-letter code for your language taken from http://en.wikipedia.org/wiki/List_of_ISO_639-2_codes and then run combine_tessdata on them as follows:
combine_tessdata lang. 
NOTE: Don't forget dot at the end! The resulting lang.traineddata goes in your tessdata directory. Tesseract can then recognize text in your language (in theory) with the following command:
tesseract image.tif output -l lang 
Actually, you can use any string you like for the language code, but if you want anybody else to be able to use it easily, ISO 639 is the way to go. Additional material on training Tesseract:

The Tesseract Model

Static Character Classifier

This is how Tesseract models character recognition:
  1. Features: During training, segments of a polygonal approximation [2] are used for features, but in recognition, features of a small, fixed length (in normalized units) are extracted from the outline and matched many-to-one against the clustered prototype features of the training data. The features extracted from the unknown are 3-dimensional, (x, y position, angle), with typically 50-100 features in a character, and the prototype features are 4-dimensional (x, y, position, angle, length), with typically 10-20 features in a prototype configuration.
  2. Classification: Classification proceeds as a two-step process. In the first step, a class pruner creates a shortlist of character classes that the unknown might match. Each feature fetches, from a coarsely quantized 3-dimensional lookup table, a bit-vector of classes that it might match, and the bit-vectors are summed over all the features. The classes with the highest counts (after correcting for expected number of features) become the short-list for the next step. Each feature of the unknown looks up a bit vector of prototypes of the given class that it might match, and then the actual similarity between them is computed. Each prototype character class is represented by a logical sum-of-product expression with each term called a configuration, so the distance calculation process keeps a record of the total similarity evidence of each feature in each configuration, as well as of each prototype. The best combined distance, which is calculated from the summed feature and prototype evidences, is the best over all the stored configurations of the class.
  3. Training Data: In fact, the classifier was trained on a mere 20 samples of 94 characters from 8 fonts in a single size, but with 4 attributes (normal, bold, italic, bold italic), making a total of 60160 training samples. Tesseract contains relatively little linguistic analysis. Whenever the word recognition module is considering a new segmentation, the linguistic module (mis-named the permuter) chooses the best available word string in each of the following categories:
    Top frequent word, Top dictionary word, Top numeric word, Top UPPER case word, Top lower case word (with optional initial upper), Top classifier choice word.
    The final decision for a given segmentation is simply the word with the lowest total distance rating, where each of the above categories is multiplied by a different constant. Words from different segmentations may have different numbers of characters in them. It is hard to compare these words directly, even where a classifier claims to be producing probabilities, which Tesseract does not. This problem is solved in Tesseract by generating two numbers for each character classification. The first, called the confidence, is minus the normalized distance from the prototype. This enables it to be a “confidence” in the sense that greater numbers are better, but still a distance, as, the farther from zero, the greater the distance.

    The second output, called the rating, multiplies the normalized distance from the prototype by the total outline length in the unknown character. Ratings for characters within a word can be summed meaningfully, since the total outline length for all characters within a word is always the same.

    Tesseract also uses a font-sensitive adaptive classifier that is trained by the output of the static classifier is therefore commonly used to obtain greater discrimination within each document, where the number of fonts is limited.

    Tesseract does not employ a template classifier, but uses the same features and classifier as the static classifier. The only significant difference between the static classifier and the adaptive classifier, apart from the training data, is that the adaptive classifier uses isotropic baseline/x-height normalization, whereas the static classifier normalizes characters by the centroid (first moments) for position and second moments for anisotropic size normalization. The baseline/x-height normalization makes it easier to distinguish upper and lower case characters as well as improving immunity to noise specks. The main benefit of character moment normalization is removal of font aspect ratio and some degree of font stroke width. It also makes recognition of sub and superscripts simpler, but requires an additional classifier feature to distinguish some upper and lower case characters.

The Tesseract Documentation provides more information on how Tesseract processes images.

Code Exploration

The following sequence of functions is particularly significant in Tesseract's execution:
File path Function name
Tesseract-ocr\api\baseapi.cpp TessBaseAPI::TesseractRectUNLV
Tesseract-ocr\api\baseapi.cpp TessBaseAPI::Recognize
Tesseract-ocr\ccmain\control.cpp Tesseract::recog_all_words
Tesseract-ocr\ccmain\control.cpp Tesseract::classify_word_pass1
Tesseract-ocr\ccmain\tessbox.cpp Tesseract::tess_segment_pass1
Tesseract-ocr\ccmain\tfacepp.cpp Tesseract::recog_word
Tesseract-ocr\ccmain\tfacepp.cpp Tesseract::recog_word_recursive
Tesseract-ocr\ccmain\tface.cpp Wordrec::cc_recog
Tesseract-ocr\wordrec\chopper.cpp Wordrec::chop_word_main
Tesseract-ocr\wordrec\wordclass.cpp Wordrec::classify_blob:
Attempt to recognize a blob as a character. The recognition rating will be returned (statistics).
Tesseract-ocr\classify\adaptmatch.cpp Classify::AdaptiveClassifier:
This routine calls the adaptive matcher which returns (in an array) the class id of each class matched. It also returns the number of classes matched. For each class matched it places the best rating found for that class into the Ratings array. Bad matches are then removed so that they don't need to be sorted. The remaining good matches are then sorted and converted to choices.
Tesseract-ocr\classify\adaptmatch.cpp Classify::DoAdaptiveMatch: This routine performs an adaptive classification. If we have not yet adapted to enough classes, a simple classification to the pre-trained templates is performed. Otherwise, we match the blob against the adapted templates. If the adapted templates do not match well, we try a match against the pre-trained templates. If an adapted template match is found, we do a match to any pre-trained templates which could be ambiguous. The results from all of these classifications are merged together.
More information: http://tesseract-ocr.repairfaq.org/.

Tesseract and Bidi

Tesseract - Bidi - set RTL as default: Tesseract 3.02 can output results based on language. But we make the new training file without a dictionary. If you don't use the script for reversing the output, it will be displayed LTR. We found more data about Tesseract & Bidi online: