Michael Elhadad

Constituent-based Syntactic Parsing with NLTK

NLTK contains classes to work with PCFGs. This document reviews existing building blocks in NLTK 2.0b9.
  1. Treebank
  2. Trees
  3. CFG
  4. PCFG
  5. Parsers

Treebank

NLTK includes a 5% fragment of the Penn Treebank corpus (about 1,600 sentences). You can observe the parsed trees using the treebank corpus reader:
>>> print nltk.corpus.treebank.parsed_sents('wsj_0001.mrg')[0]
(S
  (NP-SBJ
    (NP (NNP Pierre) (NNP Vinken))
    (, ,)
    (ADJP (NP (CD 61) (NNS years)) (JJ old))
    (, ,))
  (VP
    (MD will)
    (VP
      (VB join)
      (NP (DT the) (NN board))
      (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director)))
      (NP-TMP (NNP Nov.) (CD 29))))
  (. .))
>>> nltk.corpus.treebank.parsed_sents('wsj_0001.mrg')[0].draw()
>>> (draws a graphical version of the tree)

Trees

NLTK includes a tree class. The class has accessors for root and children, parsers to read a parenthesized representation of trees, mutators to change nodes within a tree, and tree transformations to turn a tree into a Chomsky-Normal-Form (CNF) tree. There is also a method to draw trees in a graphical manner. The code in nltk.tree.demo() demonstrates how this class can be used:
    from nltk import tree

    # Parse a tree from a string with parentheses.
    s = '(S (NP (DT the) (NN cat)) (VP (VBD ate) (NP (DT a) (NN cookie))))'
    t = Tree(s)
    print "Convert bracketed string into tree:"
    print t
    print t.__repr__()

    print "Display tree properties:"
    print t.node           # tree's constituent type
    print t[0]             # tree's first child
    print t[1]             # tree's second child
    print t.height()
    print t.leaves()
    print t[1]
    print t[1,1]
    print t[1,1,0]

    # Demonstrate tree modification.
    the_cat = t[0]
    the_cat.insert(1, tree.bracket_parse('(JJ big)'))
    print "Tree modification:"
    print t
    t[1,1,1] = tree.bracket_parse('(NN cake)')
    print t
    print

    # Tree transforms
    print "Collapse unary:"
    t.collapse_unary()
    print t
    print "Chomsky normal form:"
    t.chomsky_normal_form()
    print t
    print

    # Demonstrate probabilistic trees.
    pt = tree.ProbabilisticTree('x', ['y', 'z'], prob=0.5)
    print "Probabilistic Tree:"
    print pt
    print

    # Demonstrate parsing of treebank output format.
    t = tree.bracket_parse(t.pprint())
    print "Convert tree to bracketed string and back again:"
    print t
    print

    # Demonstrate LaTeX output
    print "LaTeX output:"
    print t.pprint_latex_qtree()
    print

    # Demonstrate Productions
    print "Production output:"
    print t.productions()
    print

    # Demonstrate tree nodes containing objects other than strings
    t.node = ('test', 3)
    print t

CFG

NLTK includes several classes to encode CFG and PCFG grammars in the nltk.grammar module:
  1. Nonterminal
  2. Production (for rules)
  3. WeightedProduction (for rules in a PCFG)
  4. ContextFreeGrammar
  5. WeightedGrammar (for PCFG)
The classes include convenient parsers to convert strings into grammars (see nltk.grammar.parse_cfg() and parse_pcfg()). The method grammar.covers(ws) determines whether the words that appear in ws can ever be parsed by the grammar. The function nltk.grammar.cfg_demo() illustrates how these classes are used:
    from nltk import nonterminals, Production, parse_cfg

    # Create some nonterminals
    S, NP, VP, PP = nonterminals('S, NP, VP, PP')
    N, V, P, Det = nonterminals('N, V, P, Det')
    VP_slash_NP = VP/NP

    print 'Some nonterminals:', [S, NP, VP, PP, N, V, P, Det, VP/NP]
    print '    S.symbol() =>', `S.symbol()`
    print

    print Production(S, [NP])

    # Create some Grammar Productions
    grammar = parse_cfg("""
      S -> NP VP
      PP -> P NP
      NP -> Det N | NP PP
      VP -> V NP | VP PP
      Det -> 'a' | 'the'
      N -> 'dog' | 'cat'
      V -> 'chased' | 'sat'
      P -> 'on' | 'in'
    """)

    print 'A Grammar:', `grammar`
    print '    grammar.start()       =>', `grammar.start()`
    print '    grammar.productions() =>',
    # Use string.replace(...) is to line-wrap the output.
    print `grammar.productions()`.replace(',', ',\n'+' '*25)
    print
    
    print 'Coverage of input words by a grammar:'
    print grammar.covers(['a','dog'])
    print grammar.covers(['a','toy'])

PCFG

PCFGs are very similar to CFGs - they just have an additional probability for each production. For a given left-hand-side non-terminal, the sum of the probabilities must be 1.0:
toy_pcfg1 = parse_pcfg("""
    S -> NP VP [1.0]
    NP -> Det N [0.5] | NP PP [0.25] | 'John' [0.1] | 'I' [0.15]
    Det -> 'the' [0.8] | 'my' [0.2]
    N -> 'man' [0.5] | 'telescope' [0.5]
    VP -> VP PP [0.1] | V NP [0.7] | V [0.2]
    V -> 'ate' [0.35] | 'saw' [0.65]
    PP -> P NP [1.0]
    P -> 'with' [0.61] | 'under' [0.39]
    """)
The function nltk.grammar.pcfg_demo() illustrates how PCFGs can be constructed and manipulated. The most important method consists of inducing a PCFG from trees in a treebank (induce_pcfg()). Before we induce the PCFG, we take care of transforming the trees into CNF. Note how tree.productions() returns the list of CFG rules that "explain" the tree: for each non-terminal node in the tree, tree.productions() will return a production with the parent node as LHS and the children as RHS.
    from nltk.corpus import treebank
    from nltk import treetransforms
    from nltk import induce_pcfg
    from nltk.parse import pchart

    pcfg_prods = toy_pcfg1.productions()

    pcfg_prod = pcfg_prods[2]
    print 'A PCFG production:', `pcfg_prod`
    print '    pcfg_prod.lhs()  =>', `pcfg_prod.lhs()`
    print '    pcfg_prod.rhs()  =>', `pcfg_prod.rhs()`
    print '    pcfg_prod.prob() =>', `pcfg_prod.prob()`
    print

    # extract productions from three trees and induce the PCFG
    print "Induce PCFG grammar from treebank data:"

    productions = []
    for item in treebank.items[:2]:
        for tree in treebank.parsed_sents(item):
            # perform optional tree transformations, e.g.:
            tree.collapse_unary(collapsePOS = False)    # Remove branches A-B-C into A-B+C
            tree.chomsky_normal_form(horzMarkov = 2)    # Remove A->(B,C,D) into A->B,C+D->D

            productions += tree.productions()

    S = Nonterminal('S')
    grammar = induce_pcfg(S, productions)
    print grammar
    print

Parsers

There are different types of parsers implemented in NLTK. One that implements the Viterbi CKY n-best parses over a PCFG is available in nltk.grammar.pchar.Viterbi
    print "Parse sentence using induced grammar:"
    parser = pchart.InsideChartParser(grammar)
    parser.trace(3)
    sent = treebank.parsed_sents('wsj_0001.mrg')[0].leaves()
    print sent
    for parse in parser.nbest_parse(sent):
        print parse

The parsers are defined in module nltk.parse.
    import sys, time
    from nltk import tokenize, toy_pcfg1
    from nltk.parse import pchart

    demos = [('I saw John with my telescope', toy_pcfg1)]
    sent, grammar = demos[choice]

    # Tokenize the sentence.
    tokens = sent.split()

    # Define a list of parsers.  We'll use all parsers.
    parsers = [
        ViterbiParser(grammar),
        pchart.InsideChartParser(grammar),
        pchart.RandomChartParser(grammar),
        pchart.UnsortedChartParser(grammar),
        pchart.LongestChartParser(grammar),
        pchart.InsideChartParser(grammar, beam_size = len(tokens)+1)
        ]

    # Run the parsers on the tokenized sentence.
    times = []
    average_p = []
    num_parses = []
    all_parses = {}
    for parser in parsers:
        print '\ns: %s\nparser: %s\ngrammar: %s' % (sent,parser,grammar)
        parser.trace(3)
        t = time.time()
        parses = parser.nbest_parse(tokens)
        times.append(time.time()-t)
        if parses: p = reduce(lambda a,b:a+b.prob(), parses, 0)/len(parses)
        else: p = 0
        average_p.append(p)
        num_parses.append(len(parses))
        for p in parses: all_parses[p.freeze()] = 1

    # Print summary statistics
    print
    print '       Parser      Beam | Time (secs)   # Parses   Average P(parse)'
    print '------------------------+------------------------------------------'
    for i in range(len(parsers)):
        print '%18s %4d |%11.4f%11d%19.14f' % (parsers[i].__class__.__name__,
                                             parsers[i].beam_size,
                                             times[i],num_parses[i],average_p[i])
    parses = all_parses.keys()
    if parses: p = reduce(lambda a,b:a+b.prob(), parses, 0)/len(parses)
    else: p = 0
    print '------------------------+------------------------------------------'
    print '%18s      |%11s%11d%19.14f' % ('(All Parses)', 'n/a', len(parses), p)

    for parse in parses:
        print parse


Last modified Apr 6th, 2011 Michael Elhadad