Constituent-based Syntactic Parsing with NLTK

NLTK contains classes to work with PCFGs. This document reviews existing building blocks in NLTK 2.0b9.

Treebank
Trees
CFG
PCFG
Parsers

Treebank

NLTK includes a 5% fragment of the Penn Treebank corpus (about 1,600 sentences). You can observe the parsed trees using the treebank corpus reader:

>>> print nltk.corpus.treebank.parsed_sents('wsj_0001.mrg')[0]
(S
  (NP-SBJ
    (NP (NNP Pierre) (NNP Vinken))
    (, ,)
    (ADJP (NP (CD 61) (NNS years)) (JJ old))
    (, ,))
  (VP
    (MD will)
    (VP
      (VB join)
      (NP (DT the) (NN board))
      (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director)))
      (NP-TMP (NNP Nov.) (CD 29))))
  (. .))
>>> nltk.corpus.treebank.parsed_sents('wsj_0001.mrg')[0].draw()
>>> (draws a graphical version of the tree)

Trees

NLTK includes a tree class. The class has accessors for root and children, parsers to read a parenthesized representation of trees, mutators to change nodes within a tree, and tree transformations to turn a tree into a Chomsky-Normal-Form (CNF) tree. There is also a method to draw trees in a graphical manner. The code in nltk.tree.demo() demonstrates how this class can be used:

    from nltk import tree

    # Parse a tree from a string with parentheses.
    s = '(S (NP (DT the) (NN cat)) (VP (VBD ate) (NP (DT a) (NN cookie))))'
    t = Tree(s)
    print "Convert bracketed string into tree:"
    print t
    print t.__repr__()

    print "Display tree properties:"
    print t.node           # tree's constituent type
    print t[0]             # tree's first child
    print t[1]             # tree's second child
    print t.height()
    print t.leaves()
    print t[1]
    print t[1,1]
    print t[1,1,0]

    # Demonstrate tree modification.
    the_cat = t[0]
    the_cat.insert(1, tree.bracket_parse('(JJ big)'))
    print "Tree modification:"
    print t
    t[1,1,1] = tree.bracket_parse('(NN cake)')
    print t
    print

    # Tree transforms
    print "Collapse unary:"
    t.collapse_unary()
    print t
    print "Chomsky normal form:"
    t.chomsky_normal_form()
    print t
    print

    # Demonstrate probabilistic trees.
    pt = tree.ProbabilisticTree('x', ['y', 'z'], prob=0.5)
    print "Probabilistic Tree:"
    print pt
    print

    # Demonstrate parsing of treebank output format.
    t = tree.bracket_parse(t.pprint())
    print "Convert tree to bracketed string and back again:"
    print t
    print

    # Demonstrate LaTeX output
    print "LaTeX output:"
    print t.pprint_latex_qtree()
    print

    # Demonstrate Productions
    print "Production output:"
    print t.productions()
    print

    # Demonstrate tree nodes containing objects other than strings
    t.node = ('test', 3)
    print t

CFG

NLTK includes several classes to encode CFG and PCFG grammars in the nltk.grammar module:

Nonterminal
Production (for rules)
WeightedProduction (for rules in a PCFG)
ContextFreeGrammar
WeightedGrammar (for PCFG)

The classes include convenient parsers to convert strings into grammars (see nltk.grammar.parse_cfg() and parse_pcfg()). The method grammar.covers(ws) determines whether the words that appear in ws can ever be parsed by the grammar. The function nltk.grammar.cfg_demo() illustrates how these classes are used:

    from nltk import nonterminals, Production, parse_cfg

    # Create some nonterminals
    S, NP, VP, PP = nonterminals('S, NP, VP, PP')
    N, V, P, Det = nonterminals('N, V, P, Det')
    VP_slash_NP = VP/NP

    print 'Some nonterminals:', [S, NP, VP, PP, N, V, P, Det, VP/NP]
    print '    S.symbol() =>', `S.symbol()`
    print

    print Production(S, [NP])

    # Create some Grammar Productions
    grammar = parse_cfg("""
      S -> NP VP
      PP -> P NP
      NP -> Det N | NP PP
      VP -> V NP | VP PP
      Det -> 'a' | 'the'
      N -> 'dog' | 'cat'
      V -> 'chased' | 'sat'
      P -> 'on' | 'in'
    """)

    print 'A Grammar:', `grammar`
    print '    grammar.start()       =>', `grammar.start()`
    print '    grammar.productions() =>',
    # Use string.replace(...) is to line-wrap the output.
    print `grammar.productions()`.replace(',', ',\n'+' '*25)
    print
    
    print 'Coverage of input words by a grammar:'
    print grammar.covers(['a','dog'])
    print grammar.covers(['a','toy'])

PCFG

PCFGs are very similar to CFGs - they just have an additional probability for each production. For a given left-hand-side non-terminal, the sum of the probabilities must be 1.0:

toy_pcfg1 = parse_pcfg("""
    S -> NP VP [1.0]
    NP -> Det N [0.5] | NP PP [0.25] | 'John' [0.1] | 'I' [0.15]
    Det -> 'the' [0.8] | 'my' [0.2]
    N -> 'man' [0.5] | 'telescope' [0.5]
    VP -> VP PP [0.1] | V NP [0.7] | V [0.2]
    V -> 'ate' [0.35] | 'saw' [0.65]
    PP -> P NP [1.0]
    P -> 'with' [0.61] | 'under' [0.39]
    """)

The function nltk.grammar.pcfg_demo() illustrates how PCFGs can be constructed and manipulated. The most important method consists of inducing a PCFG from trees in a treebank (induce_pcfg()). Before we induce the PCFG, we take care of transforming the trees into CNF. Note how tree.productions() returns the list of CFG rules that "explain" the tree: for each non-terminal node in the tree, tree.productions() will return a production with the parent node as LHS and the children as RHS.

    from nltk.corpus import treebank
    from nltk import treetransforms
    from nltk import induce_pcfg
    from nltk.parse import pchart

    pcfg_prods = toy_pcfg1.productions()

    pcfg_prod = pcfg_prods[2]
    print 'A PCFG production:', `pcfg_prod`
    print '    pcfg_prod.lhs()  =>', `pcfg_prod.lhs()`
    print '    pcfg_prod.rhs()  =>', `pcfg_prod.rhs()`
    print '    pcfg_prod.prob() =>', `pcfg_prod.prob()`
    print

    # extract productions from three trees and induce the PCFG
    print "Induce PCFG grammar from treebank data:"

    productions = []
    for item in treebank.items[:2]:
        for tree in treebank.parsed_sents(item):
            # perform optional tree transformations, e.g.:
            tree.collapse_unary(collapsePOS = False)    # Remove branches A-B-C into A-B+C
            tree.chomsky_normal_form(horzMarkov = 2)    # Remove A->(B,C,D) into A->B,C+D->D

            productions += tree.productions()

    S = Nonterminal('S')
    grammar = induce_pcfg(S, productions)
    print grammar
    print

Parsers

There are different types of parsers implemented in NLTK. One that implements the Viterbi CKY n-best parses over a PCFG is available in nltk.grammar.pchar.Viterbi

    print "Parse sentence using induced grammar:"
    parser = pchart.InsideChartParser(grammar)
    parser.trace(3)
    sent = treebank.parsed_sents('wsj_0001.mrg')[0].leaves()
    print sent
    for parse in parser.nbest_parse(sent):
        print parse

The parsers are defined in module nltk.parse.

    import sys, time
    from nltk import tokenize, toy_pcfg1
    from nltk.parse import pchart

    demos = [('I saw John with my telescope', toy_pcfg1)]
    sent, grammar = demos[choice]

    # Tokenize the sentence.
    tokens = sent.split()

    # Define a list of parsers.  We'll use all parsers.
    parsers = [
        ViterbiParser(grammar),
        pchart.InsideChartParser(grammar),
        pchart.RandomChartParser(grammar),
        pchart.UnsortedChartParser(grammar),
        pchart.LongestChartParser(grammar),
        pchart.InsideChartParser(grammar, beam_size = len(tokens)+1)
        ]

    # Run the parsers on the tokenized sentence.
    times = []
    average_p = []
    num_parses = []
    all_parses = {}
    for parser in parsers:
        print '\ns: %s\nparser: %s\ngrammar: %s' % (sent,parser,grammar)
        parser.trace(3)
        t = time.time()
        parses = parser.nbest_parse(tokens)
        times.append(time.time()-t)
        if parses: p = reduce(lambda a,b:a+b.prob(), parses, 0)/len(parses)
        else: p = 0
        average_p.append(p)
        num_parses.append(len(parses))
        for p in parses: all_parses[p.freeze()] = 1

    # Print summary statistics
    print
    print '       Parser      Beam | Time (secs)   # Parses   Average P(parse)'
    print '------------------------+------------------------------------------'
    for i in range(len(parsers)):
        print '%18s %4d |%11.4f%11d%19.14f' % (parsers[i].__class__.__name__,
                                             parsers[i].beam_size,
                                             times[i],num_parses[i],average_p[i])
    parses = all_parses.keys()
    if parses: p = reduce(lambda a,b:a+b.prob(), parses, 0)/len(parses)
    else: p = 0
    print '------------------------+------------------------------------------'
    print '%18s      |%11s%11d%19.14f' % ('(All Parses)', 'n/a', len(parses), p)

    for parse in parses:
        print parse

Last modified Apr 6th, 2011 Michael Elhadad