Hebrew Paraphrase Identification
Paraphrase identification is defined as the ability to
recognize that two text fragments which differ in articulation,
convey the same meaning to a human listener.
The task of paraphrase identification has proven challenging
both from the linguistic point of view, as well as the
computational point of view.
The ability of a computational model to recognize paraphrases
has implication in various fields of natural language processing,
such as Automatic text generation, automatic summarization
and automatic construction of thesaurus.
In recent years, vast research effort was given to
the task in the English domain
(as well as a few other languages)
This research introduced various different techniques to
model and identify paraphrases within a natural text.
In the Hebrew domain, however, very little research was done.
This work provides linguistic definition of Hebrew paraphrasing,
in view of the language unique features, such as
word agglutination, high morphological ambiguity,
and free-word order syntax, and under various linguistic schools.
Following this definition, a system for identifying paraphrases
is proposed.
The proposed system defines a new concept of parse tree matching (which was proven to be NP complete),
and applies it for identifying Hebrew paraphrases.
The system is a variation of an
English paraphrase detection algorithm
proposed by Socher et al (2011), which employs various learning
methods, among which are: neural networks, deep learning,
and auto-encoding.
Several resources were created in order to complete the system,
These include the creation of Hebrew word feature embedding,
and the collection and manual tagging of a first Hebrew
Paraphrase corpus. These resources, along with the code, models and results,
can be downloaded from this page.
This system was developed during M.Sc. research by Gabriel Stanovsky advised by
Prof. Michael Elhadad at the
Department of Computer Science,
Ben-Gurion University.
As a part of the system, Hebrew dictionary mapping onto a real dimensional vector space was needed. This is a resource
available for the English language (Turian,2010), and was generated
for the first time for Hebrew dictionary in the course of this work. As an experiment to validate the applicability of these embeddings,
the embeddings were tested at POS CRF classification, and were proven to improve performance as a plugin in existing system.
An Hebrew paraphrase corpus was collected from Hebrew news sites, and was hand tagged by human judges. This corpus can serve as an
equivalent to the MSRP corpus
and can serve as a reference corpus to compare other algorithms for identifying paraphrases.
A first Hebrew paraphrase identification system was created as a part of this work. The model can be downloaded and used
for other NLP application, and other algorithms can compare against the baseline results it has set.
Results
Download Thesis
Download Resources
Download Code and Models