Hebrew Paraphrase Identification


 Hebrew Paraphrase Identification

Abstract

Paraphrase identification is defined as the ability to recognize that two text fragments which differ in articulation, convey the same meaning to a human listener. The task of paraphrase identification has proven challenging both from the linguistic point of view, as well as the computational point of view. The ability of a computational model to recognize paraphrases has implication in various fields of natural language processing, such as Automatic text generation, automatic summarization and automatic construction of thesaurus.

In recent years, vast research effort was given to the task in the English domain (as well as a few other languages) This research introduced various different techniques to model and identify paraphrases within a natural text. In the Hebrew domain, however, very little research was done. This work provides linguistic definition of Hebrew paraphrasing, in view of the language unique features, such as word agglutination, high morphological ambiguity, and free-word order syntax, and under various linguistic schools.

Following this definition, a system for identifying paraphrases is proposed. The proposed system defines a new concept of parse tree matching (which was proven to be NP complete), and applies it for identifying Hebrew paraphrases. The system is a variation of an English paraphrase detection algorithm proposed by Socher et al (2011), which employs various learning methods, among which are: neural networks, deep learning, and auto-encoding.

Several resources were created in order to complete the system, These include the creation of Hebrew word feature embedding, and the collection and manual tagging of a first Hebrew Paraphrase corpus. These resources, along with the code, models and results, can be downloaded from this page.

This system was developed during M.Sc. research by Gabriel Stanovsky advised by Prof. Michael Elhadad at the Department of Computer Science, Ben-Gurion University.

Results

Hebrew Word Embeddings
As a part of the system, Hebrew dictionary mapping onto a real dimensional vector space was needed. This is a resource available for the English language (Turian,2010), and was generated for the first time for Hebrew dictionary in the course of this work. As an experiment to validate the applicability of these embeddings, the embeddings were tested at POS CRF classification, and were proven to improve performance as a plugin in existing system.
Tagged Hebrew Paraphrase Corpus
An Hebrew paraphrase corpus was collected from Hebrew news sites, and was hand tagged by human judges. This corpus can serve as an equivalent to the MSRP corpus and can serve as a reference corpus to compare other algorithms for identifying paraphrases.
Paraphrase Identification System
A first Hebrew paraphrase identification system was created as a part of this work. The model can be downloaded and used for other NLP application, and other algorithms can compare against the baseline results it has set.

Download Thesis

Download Resources

Hebrew word feature embeddings (Space separated, encoded in CP1255)
Paraphrase Corpus
- Corpus
  - README
  - Download
- Tagging
CRF embedding experiment for improvment of POS Tagging (ran using the CRFSuite Package)
- README
- Download files

Download Code and Models