Hebrew Dependency Parser ========================== [Yoav Goldberg, September 2011] The technical details: ~~~~~~~~~~~~~~~~~~~~~~ The parser is based on my EasyFirst parsing algorithm (Goldberg and Elhadad 2010a) enhanced with morphology-based features (Goldberg and Elhadad 2010b). Labeling is performed with a perceptron-based Edge-labeler. The models are both accurate and relatively small thanks to training using a sparse-perceptron trick described in (Goldberg and Elhadad 2011). The parser is trained on the Hebrew Dependency Treebank which I converted from the Hebrew Constituency Treebank (Milea et al 2009). The dependency treebank, conversion process, parsing model, morphology features and the edge-labeler are also described in my PhD thesis (Goldberg 2011). The parser relies on the output of a POS-tagger (bundeled in this package). I use the (great!) tagger by Meni Adler (Adler and Elhadad 2006, Goldberg et al 2008, Adler 2009). The parser is trained on 5200 sentences from the Hebrew Dependency Treebank. It achieves a labeled dependency accuracy of about 80\% when evaluated on a test set of about 500 sentences in the news domain. Requirements/prerequisites: ~~~~~~~~~~~~~~~~~~~~~~~~~~~ In order to run the parsing package, you will need: - python version 2.7 (though 2.6 should work also) (Use python -V to verify) - java version 6 or above. 'java' is expected to be on your path. (for the tagger) (Use java -version to verify) In addition: - for speed, the parser uses a compiled python module. I supply a compiled binary for python2.7 32-bit Linux (ELF) and python2.7 32-bit Windows. For other platforms, you will have to compile the module yourself. See the instructions in code/ml/INSTALL . If you do not compile the extension module, the parser will still work, but much slower, as it will fallback to a pure-python module. Memory: - the tagger is great, but uses a lot of memory. You may want to run it on a separate machine (see below). The code was developed and tested on Linux but should work on OSX and Windows as well. Parsing text: ~~~~~~~~~~~~~ assuming you have the prerequisites installed, you should be able to parse Hebrew text by: 1) running the tagger server. On Unix: bash ./run_tagger_server.sh On Windows: run_tagger_server.cmd This will run an instance of the POS-tagger on your machine, listening on port 8080. (Edit the file to change the port). The parser will comunicate with the tagger as needed when parsing. Loading the tagger data to memory takes some time. Wait for it to load completely before running the parser (you should see the line similar to "2011-09-01 05:32:42.495::INFO: Started SocketConnector @ 0.0.0.0:8080") The tagger need a lot of memory (~1.2GB on a 32bit machine). You may want to run it on a dedicated machine (see below). If you cannot start with 1.2GB, you can reduce it to 800m - it will still run with most reasonably sized documents. 2) once the tagger is running, you can parse text by: python ./parse.py < sample.txt > parsed.txt The parser expects one sentence per line. Sentence-splitting is best handled on your application level. If you do not have your own sentence splitter, you can use the very basic one distributed with the parser: cat sample.txt | python code/utils/sentences.py | python parse.py > parsed.txt The parser does its own tokenization. If you want to perform your own tokenization, run it with the 'pretok' option: python ./parse.py pretok < tokenized.txt > parsed.txt The input and output are in UTF-8 encoding. The output is in CoNLL 2006 format. To verify things are working as intended, run: python ./parse.py < sample.txt > parsed.txt the content of parsed.txt should be identical to expected_output.txt The complete pipe (including sentence splitting and tokenization) is available as: On Unix: parse.sh < sample.txt > sample_parsed.txt On Windows: parse.cmd < sample.txt > sample_parsed.txt Viewing the output files: ~~~~~~~~~~~~~~~~~~~~~~~~~ I like to use the viewer by Federico Sangati: http://staff.science.uva.nl/~fsangati/Viewers/DepTreeeViewer_17_06_10.jar On Windows: tree_viewer.cmd Running the web-interface: ~~~~~~~~~~~~~~~~~~~~~~~~~~ You can run the webinterface using: webinterface/parse_server.py 8081 On Windows: web_interface.cmd you can then point your browser to http://localhost:8081 and be able to parse sentences. (advanced: you can programmatically POST your sentences to "/parse" and get responses in plaintext) Running the tagger on a different machine: ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The tagger need a lot of memory (~1.2GB on a 32bit machine). You may want to run it on a dedicated machine. To do that, copy the run_tagger_server.sh and the 'tagger' directory to the different machine, and run ./run_tagger_server.sh there. then, on the local machine (where the parser runs) edit the file tagger/tag.py and change the line f = urllib.urlopen("http://localhost:8080/bm", params) to reflect the host and port of the tagger server. Parser not working well for your domain? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The parser is trained on the news domain, and is most suitable for parsing news-like text. However, it should work quite well for related genres as well, and performance is expected to degrade the further you go from the news domain. Depending on the source of the difficulty in your specific domain, I might be able to suggest a solution. I am not promising anything, but it won't hurt to drop me an email. (yoav.goldberg / gmail) Bug-reports: ~~~~~~~~~~~~ Email me. License and copyright: ~~~~~~~~~~~~~~~~~~~~~~ The tagger is the work of Meni Adler (meni.adler / gmail.com). It is distributed under the GPL v3. The tagger is bundeled with statistics and lexical information which is derived from the MILA lexicon. The MILA lexicon data belongs to the Knowledge Center for Processing Hebrew, and is distributed under the GPL. The web interface uses bottle.py, which is under the MIT license. The tree-drawing code is due to "free source" by Emilio Cortegoso Lobato (2006). The rest of the code is developed by Yoav Goldberg (copyright 2010-2011), and distributed under the GPL v3 license. Free for academic use (please contact Yoav Goldberg regarding commercial usage). Citing: ~~~~~~~ If you use the parser in your academic work, it would be nice of you to cite my Phd thesis. References: ~~~~~~~~~~~ (Adler and Elhadad 2006) "An Unsupervised Morpheme-Based HMM for Hebrew Morphological Disambiguation". Meni Adler and Michael Elhadad, In Proceedings of COLING-ACL 2006 (Adler 2009) "Hebrew Morphological Disambiguation: An Unsupervised Stochastic Word-based Approach". Meni Adler. PhD Thesis at Ben Gurion University. (Goldberg et al 2008) "EM Can Find Pretty Good HMM POS-Taggers (When Given a Good Start)". Yoav Goldberg, Meni Adler and Michael Elhadad, in Proceedings of ACL 2008. (Goldberg and Elhadad 2010a) "An Efficient Algorithm for Easy-First Non-Directional Dependency Parsing". Yoav Goldberg and Michael Elhadad, in Proceedings of NAACL 2010. (Goldberg and Elhadad 2010b) "Easy-First Dependency Parsing of Modern Hebrew". Yoav Goldberg and Michael Elhadad, in Proceedings of the NAACL 2010 workshop on Statistical Parsing of Morphologically Rich Languages. (Goldberg and Elhadad 2011) "Learning Sparser Perceptron Models". Yoav Goldberg and Michael Elhadad. Tech-report at Ben Gurion University. (Goldberg 2011) "Automatic Syntactic Processing of Modern Hebrew". Yoav Goldberg. PhD Thesis at Ben Gurion University. (Milea et al 2009) "Adi Milea Noemie Guthmann, Yuval Krymolowski and Yoad Winter". Automatic annotation of morpho-syntactic dependencies in a Modern Hebrew Treebank, in Proceedings of TLT.