Hebrew Named Entity Recognition


HebrewNER is a named entity recognition package for Hebrew. Named Entity recognition is a form of information extraction in which we seek to classify every word in a document as being a person-name, organization, location, date, time monetary value, percentage, or "none of the above". Named Entity recognition is a foundation for work on more complex information extraction tasks.

In European language the problem, at least on the surface, doesn’t seem to be very complicated - since most of the named entities start with capital letter. The Hebrew language presents a lot of morphological ambiguity, which makes automatic processing difficult. There are many other qualities, unique to Hebrew language and culture that might influence the named entity recognition problem. For example: "smichut", agglutination. In addition to these difficulties Hebrew doesn't have the advantage of using capital letters.

This system was developed during M.Sc. research by Naama Ben-Mordecai advised by Dr. Michael Elhadad at the Department of Computer Science, Ben-Gurion University. The system is based on statistical models: a Maximum Entropy Model and a HMM. It is written in Java and contains several 3rd-party products such as Sentence Detector and Part of Speech tagger.