Hebrew Constituency Parser ========================== The technical details: ~~~~~~~~~~~~~~~~~~~~~~ The parser is based on a PCFG-LA model (Petrov et al 2006) with lattice-based inference (Chapelier et al 1999) for doing joint token-segmentation and parsing. It has a wide lexical coverage based on the MILA morphological analyzer, with tagging probabilities obtained using EM-HMM training (Adler and Elhadad 2006, Adler 2009, and Goldberg et al 2008). The details are discussed in my PhD thesis (Goldberg 2011) and a forthcoming journal paper (TBA). The various components were introduced in earlier publications (most notably Goldberg and Tsarfaty 2008, Goldberg et al 2009, Goldberg and Elhadad 2011). Requirements/prerequisits: ~~~~~~~~~~~~~~~~~~~~~~~~~~ In order to run the parsing package, you will need: - python version 2.7 (though 2.6 should work also) - java version 6 or above. 'java' is expected to be on your path. The code was developed and tested on Linux but should work on Windows as well. Parsing text: ~~~~~~~~~~~~~ assuming you have the prerequisits installed, you should be able to: python ./parse.py < sample.txt > parsed.txt this takes a while, because the parser needs to load some resources. However, once things are loaded, it should parse a couple of reasonably-long sentences a second. It is best to put all the text you want to parse in a file, and let the parser run. The constituency parser is not blazingly fast. For a much faster parser, try my dependency parser. The parser expects one sentence per line. Sentence-splitting is best handled on your application level. If you do not have your own sentence splitter, you can use the very basic one distributed with the parser: cat sample.txt | python utils/sentences.py | python parse.py > parsed.txt The parser does its own tokenization. If you want to perform your own tokenization, run it with the 'pretok' option: python ./parse.py pretok < tokenized.txt > parsed.txt The input and output are in UTF-8 encoding. To verify things are working as intended, run: python ./parse.py < sample.txt > parsed.txt the content of parsed.txt should be identical to expected_output.txt Viewing the output files: ~~~~~~~~~~~~~~~~~~~~~~~~~ I like to use the viewer by Federico Sangati: http://staff.science.uva.nl/~fsangati/Viewers/ConstTreeViewer_13_05_10.jar Running the web-interface: ~~~~~~~~~~~~~~~~~~~~~~~~~~ You can run the webinterface using: webinterface/parse_server.py 8080 you can then point your browser to http://localhost:8080 and be able to parse sentences. (advanced: you can programmatically POST your sentences to "/parse" and get responses in plaintext) Parser not working well for your domain? ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The parser is trained on the news domain, and is most suitable for parsing news-like text. However, it should work quite well for related genres as well, and performance is expected to degrade the further you go from the news domain. Depending on the source of the difficulty in your specific domain, I might be able to suggest a solution. I am not promising anything, but it won't hurt to drop me an email. (yoav.goldberg / gmail) Bug-reports: ~~~~~~~~~~~~ Email me. License and copyright: ~~~~~~~~~~~~~~~~~~~~~~ The MILA lexicon data belongs to the Knowledge Center for Processing Hebrew, and is distributed under the GPL. The distribution uses statistics derived from this lexicon. The PCFG-LA implementation is based on Slav Petrov's implementation of the Berkley parser, which I modified to support lattice parsing. The Berkeley parser is distributed under the GPL v2. The web interface uses bottle.py, which is under the MIT license. The tree-drawing code is due to "free source" by Emilio Cortegoso Lobato (2006). The rest of the code is developed by Yoav Goldberg (copyright 2010-2011), and distributed under the GPL v3 license. Free for academic use (please contact Yoav Goldberg regarding commercial usage). Citing: ~~~~~~~ If you use the parser in your academic work, it would be nice of you to cite my Phd thesis or journal paper. References: ~~~~~~~~~~~ (Adler and Elhadad 2006) "An Unsupervised Morpheme-Based HMM for Hebrew Morphological Disambiguation". Meni Adler and Michael Elhadad, In Proceedings of COLING-ACL 2006 (Adler 2009) "Hebrew Morphological Disambiguation: An Unsupervised Stochastic Word-based Approach". Meni Adler. PhD Thesis at Ben Gurion University. (Chapelier et al 1999) "Lattice parsing for Speech Reconition". J.C. Chappelier , M. Rajman , R. Arages and A. Rozenknop. (Goldberg et al 2008) "EM Can Find Pretty Good HMM POS-Taggers (When Given a Good Start)". Yoav Goldberg, Meni Adler and Michael Elhadad, in Proceedings of ACL 2008. (Goldberg and Tsarfaty 2008) "A Single Generative Model for Joint Morphological Segmentation and Syntactic Parsing". Yoav Goldberg and Reut Tsarfaty, in Proceedings of ACL 2008. (Goldberg et al 2009) "Enhancing Unlexicalized Parsing Performance using aWide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities". Yoav Goldberg, Reut Tsarfaty, Meni Adler and Michael Elhadad, in Proceedings of EACL 2009. (Goldberg and Elhadad 2011) "Joint Hebrew Segmentation and Parsing using a PCFGLA Lattice Parser". Yoav Goldberg and Michael Elhadad, in Proceedings of ACL 2011. (Goldberg 2011) "Automatic Syntactic Processing of Modern Hebrew". Yoav Goldberg. PhD Thesis at Ben Gurion University. (Petrov et al 2006) "Learning Accurate, Compact, and Interpretable Tree Annotation". Slav Petrov, Leon Barrett, Romain Thibaux, and Dan Klein, In proceedings of COLING-ACL 2006