This is the Modern Hebrew Dependency Treebank which was created and used in my PhD thesis.
The Dependency Treebank is converted from MILA's Hebrew Constituency Treebank (v2). The guidelines and conversion process are discussed in Chapter 5 of the thesis. The tagset is detailed in Chapter 4. The parts-of-speech tags are automatically assigned using the Ben-Gurion Hebrew POS-tagger by Meni Adler.
The files are based on the CoNLL format.
The columns are:
INDEX WORD LEMMA(empty) CPOS POS MORPH PARENT PREL empty emptywhere CPOS and POS are coarse-grained and fine-grained parts-of-speech tags, MORPH is a |-separated list of extra features, PARENT is the index of the parent token and PREL is the label of the parent relation. The LEMMA field is always empty.
One notable difference is that the index and parent columns are not integers but floating point numbers.
The floating point representation is needed in order to encode token-segmentation information. An index of the form 3.0 indicates that the word is the first part of the third token, and 3.2 indicates that the word is the third part of the third token. For example, the sentence יצאתי מהבית will be represented as:
1.0 יצאתי 2.0 מ 2.1 ה 2.2 בית
The script renumber.py can be used to convert the data to integer-based indexing, making the data compatible with the CoNLL format while losing the original token information. (usage: renumber.py < in > out)
The Treebank files are in the data directory.