Corpus Description

The wiki-segmentation corpus is a corpus of sentences from the Hebrew Wikipedia, where some words in every sentence are tagged for word segmentation. Word segmentation was done by automatically extracting information from Wikipedia links.

For details see short article.

Word segmentation refers to stating the prefixes in a word, if exist, or stating that it has no prefixes. H (the definite article) is not treated as a prefix.

The corpus contains 363,090 sentences in which 523,599 words are segmented. Note that only words that are possibly ambiguous are tagged.

Tokenization was done using the tokenizer provided by MILA, the knowledge center for processing Hebrew, site.

Format

The corpus is given in XML format. XSD for the corpus.

Copyrights

The texts are taken from Wikipedia. Permission is given to use the corpus in any way under the GFDL license.

Download

Corpora (zipped)

Other Hebrew Corpora

For other Corpora in Hebrew, see: http://www.mila.cs.technion.ac.il/english/resources/corpora/index.html

Authors

Ziv Ben-Eliahu Ben-Elia@cs.bgu.ac.il

David Gabay gabayd@cs.bgu.ac.il