Hebrew Paraphrase Identification


Abstract

Paraphrase identification is defined as the ability to recognize that two text fragments which differ in articulation, convey the same meaning to a human listener. The task of paraphrase identification has proven challenging both from the linguistic point of view, as well as the computational point of view. The ability of a computational model to recognize paraphrases has implication in various fields of natural language processing, such as Automatic text generation, automatic summarization and automatic construction of thesaurus.

In recent years, vast research effort was given to the task in the English domain (as well as a few other languages) This research introduced various different techniques to model and identify paraphrases within a natural text. In the Hebrew domain, however, very little research was done. This work provides linguistic definition of Hebrew paraphrasing, in view of the language unique features, such as word agglutination, high morphological ambiguity, and free-word order syntax, and under various linguistic schools.

Following this definition, a system for identifying paraphrases is proposed. The proposed system defines a new concept of parse tree matching (which was proven to be NP complete), and applies it for identifying Hebrew paraphrases. The system is a variation of an English paraphrase detection algorithm proposed by Socher et al (2011), which employs various learning methods, among which are: neural networks, deep learning, and auto-encoding.

Several resources were created in order to complete the system, These include the creation of Hebrew word feature embedding, and the collection and manual tagging of a first Hebrew Paraphrase corpus. These resources, along with the code, models and results, can be downloaded from this page.

This system was developed during M.Sc. research by Gabriel Stanovsky advised by Prof. Michael Elhadad at the Department of Computer Science, Ben-Gurion University.


Possible Match

Results


Embedding's improvment


Download Thesis


Download Resources


Download Code and Models

Tree Autoencoding