Modern Hebrew Dependency Treebank v.1

Download

hebdeptb.tgz

What

This is the Modern Hebrew Dependency Treebank which was created and used in my PhD thesis.

Description

The Dependency Treebank is converted from MILA's Hebrew Constituency Treebank (v2). The guidelines and conversion process are discussed in Chapter 5 of the thesis. The tagset is detailed in Chapter 4. The parts-of-speech tags are automatically assigned using the Ben-Gurion Hebrew POS-tagger by Meni Adler.

Data Format

The files are based on the CoNLL format.

The columns are:

      INDEX WORD LEMMA(empty) CPOS POS MORPH PARENT PREL empty empty
      
where CPOS and POS are coarse-grained and fine-grained parts-of-speech tags, MORPH is a |-separated list of extra features, PARENT is the index of the parent token and PREL is the label of the parent relation. The LEMMA field is always empty.

One notable difference is that the index and parent columns are not integers but floating point numbers.

The floating point representation is needed in order to encode token-segmentation information. An index of the form 3.0 indicates that the word is the first part of the third token, and 3.2 indicates that the word is the third part of the third token. For example, the sentence יצאתי מהבית will be represented as:

      1.0 יצאתי
      2.0 מ
      2.1 ה
      2.2 בית
      

The script renumber.py can be used to convert the data to integer-based indexing, making the data compatible with the CoNLL format while losing the original token information. (usage: renumber.py < in > out)

Files

The Treebank files are in the data directory.

License

The files are distributed under the GPL v3 license.

Citing

If you use this software in an academic publication, it would be nice of you to cite my PhD thesis "Automatic Syntactic Processing of Modern Hebrew", Yoav Goldberg, 2011.


Yoav Goldberg, 2011.