Hebrew Automatic Vocalization


The resources and code presented here were developed during the M.Sc. research made by Eran Tomer advised by Prof. Michael Elhadad at the Department of Computer Science, Ben-Gurion University.
In addition, this work could not have been completed without the kind help and advisement of Prof. Jacob Ben Tulila, Dr. Meni Adler and Dr. Yoav Goldberg.

The Research

Written Hebrew involves an exceptionally high ambiguity rate due to the lack of vowel markings in the vast majority of modern Hebrew texts. In this research we aim to implement tools and resources that may be used to enhance the capabilities of many Hebrew NLP related systems as automatic vowel restoration, Text-to-Speech systems and more. This work focuses on handling verbs which have the most complex morphology and vocalization mechanisms of all parts of speech in Hebrew.

During this work we developed a comprehensive generation mechanism, which produces vocalized and morphologically tagged Hebrew verbs given a non-vocalized verb in base-form and an indication of which pattern the verb follows. Next, we addressed the task of automatically segmenting vocalized words into syllables. We implemented two algorithms for syllable segmentation representing different approaches. Last, we implemented two classification mechanisms; the first classifies base-forms into corresponding Hebrew patterns. The second classifies base-forms into a corresponding, specific inflection table.

Insert a non vocalized Hebrew verb: