Corpus Description
The wiki-segmentation corpus is a corpus of sentences from the Hebrew Wikipedia, where some words in every sentence are tagged for word segmentation. Word segmentation was done by automatically extracting information from Wikipedia links.
For details see short article.
Word segmentation refers to stating the prefixes in a word, if exist, or stating that it has no prefixes. H (the definite article) is not treated as a prefix.
The corpus contains 363,090 sentences in which 523,599 words are segmented. Note that only words that are possibly ambiguous are tagged.
Tokenization was done using the tokenizer
provided by
MILA, the knowledge center for processing Hebrew, site.
Format
The corpus is given in XML format. XSD for the corpus.
Copyrights
The texts are taken from Wikipedia. Permission is given to use the corpus in any way under the GFDL license.
Download
Other Hebrew Corpora
For other Corpora in Hebrew, see: http://www.mila.cs.technion.ac.il/english/resources/corpora/index.html
Authors
Ziv Ben-Eliahu Ben-Elia@cs.bgu.ac.il
David Gabay gabayd@cs.bgu.ac.il