Constituent-based Syntactic Parsing with NLTK

http://www.cs.bgu.ac.il/~elhadad/nlp21.html

NLTK contains classes to work with PCFGs. This document reviews existing building blocks in NLTK 3.4.

  1. Treebank
  2. Trees
  3. CFG
  4. PCFG
  5. Parsers

Treebank

NLTK includes a 5% fragment of the Penn Treebank corpus (about 4,000 sentences). You can observe the parsed trees using the treebank corpus reader:

The draw() method produces a graphical rendering of the trees (in a separate window):

To get a tree displayed in a Jupyter notebook, use the following method:

Trees

NLTK includes a tree class.

The class has accessors for root and children, parsers to read a parenthesized representation of trees, mutators to change nodes within a tree, and tree transformations to turn a tree into a Chomsky-Normal-Form (CNF) tree. There is also a method to draw trees in a graphical manner. The code in nltk.tree.demo() demonstrates how this class can be used.

Tree Accessors

Tree Mutators

Tree Transformations

Probabilistic Trees

CFG

NLTK includes several classes to encode CFG and PCFG grammars in the nltk.grammar module:

  1. Nonterminal
  2. Production (for rules)
  3. WeightedProduction (for rules in a PCFG)
  4. ContextFreeGrammar
  5. PCFG

The classes include convenient parsers to convert strings into grammars. The method grammar.check_coverage(ws) determines whether the words that appear in ws can ever be parsed by the grammar.

The function nltk.grammar.cfg_demo() illustrates how these classes are used.

PCFG

PCFGs are very similar to CFGs - they just have an additional probability for each production. For a given left-hand-side non-terminal, the sum of the probabilities must be 1.0.

The function nltk.grammar.pcfg_demo() illustrates how PCFGs can be constructed and manipulated. The most important method consists of inducing a PCFG from trees in a treebank (induce_pcfg()). Before we induce the PCFG, we take care of transforming the trees into CNF. Note how tree.productions() returns the list of CFG rules that "explain" the tree: for each non-terminal node in the tree, tree.productions() will return a production with the parent node as LHS and the children as RHS.

Parsers

There are different types of parsers implemented in NLTK. One that implements the Viterbi CKY n-best parses over a PCFG is available in the parse.viterbi module.

The parsers are defined in module nltk.parse.

End