Michael Elhadad
Constituent-based Syntactic Parsing with NLTK
NLTK contains classes to work with PCFGs. This document reviews existing building blocks in NLTK 2.0b9.
- Treebank
- Trees
- CFG
- PCFG
- Parsers
Treebank
NLTK includes a 5% fragment of the Penn Treebank corpus (about 1,600 sentences).
You can observe the parsed trees using the treebank corpus reader:
>>> print nltk.corpus.treebank.parsed_sents('wsj_0001.mrg')[0]
(S
(NP-SBJ
(NP (NNP Pierre) (NNP Vinken))
(, ,)
(ADJP (NP (CD 61) (NNS years)) (JJ old))
(, ,))
(VP
(MD will)
(VP
(VB join)
(NP (DT the) (NN board))
(PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director)))
(NP-TMP (NNP Nov.) (CD 29))))
(. .))
>>> nltk.corpus.treebank.parsed_sents('wsj_0001.mrg')[0].draw()
>>> (draws a graphical version of the tree)
Trees
NLTK includes a tree class.
The class has accessors for root and children, parsers to read a parenthesized representation of trees,
mutators to change nodes within a tree, and tree transformations to turn a tree into a Chomsky-Normal-Form (CNF)
tree. There is also a method to draw trees in a graphical manner.
The code in nltk.tree.demo() demonstrates how this class can be used:
from nltk import tree
# Parse a tree from a string with parentheses.
s = '(S (NP (DT the) (NN cat)) (VP (VBD ate) (NP (DT a) (NN cookie))))'
t = Tree(s)
print "Convert bracketed string into tree:"
print t
print t.__repr__()
print "Display tree properties:"
print t.node # tree's constituent type
print t[0] # tree's first child
print t[1] # tree's second child
print t.height()
print t.leaves()
print t[1]
print t[1,1]
print t[1,1,0]
# Demonstrate tree modification.
the_cat = t[0]
the_cat.insert(1, tree.bracket_parse('(JJ big)'))
print "Tree modification:"
print t
t[1,1,1] = tree.bracket_parse('(NN cake)')
print t
print
# Tree transforms
print "Collapse unary:"
t.collapse_unary()
print t
print "Chomsky normal form:"
t.chomsky_normal_form()
print t
print
# Demonstrate probabilistic trees.
pt = tree.ProbabilisticTree('x', ['y', 'z'], prob=0.5)
print "Probabilistic Tree:"
print pt
print
# Demonstrate parsing of treebank output format.
t = tree.bracket_parse(t.pprint())
print "Convert tree to bracketed string and back again:"
print t
print
# Demonstrate LaTeX output
print "LaTeX output:"
print t.pprint_latex_qtree()
print
# Demonstrate Productions
print "Production output:"
print t.productions()
print
# Demonstrate tree nodes containing objects other than strings
t.node = ('test', 3)
print t
CFG
NLTK includes several classes to encode CFG and PCFG grammars in the nltk.grammar module:
- Nonterminal
- Production (for rules)
- WeightedProduction (for rules in a PCFG)
- ContextFreeGrammar
- WeightedGrammar (for PCFG)
The classes include convenient parsers to convert strings into grammars (see nltk.grammar.parse_cfg() and parse_pcfg()).
The method grammar.covers(ws) determines whether the words that appear in ws can ever be parsed by the grammar.
The function nltk.grammar.cfg_demo() illustrates how these classes are used:
from nltk import nonterminals, Production, parse_cfg
# Create some nonterminals
S, NP, VP, PP = nonterminals('S, NP, VP, PP')
N, V, P, Det = nonterminals('N, V, P, Det')
VP_slash_NP = VP/NP
print 'Some nonterminals:', [S, NP, VP, PP, N, V, P, Det, VP/NP]
print ' S.symbol() =>', `S.symbol()`
print
print Production(S, [NP])
# Create some Grammar Productions
grammar = parse_cfg("""
S -> NP VP
PP -> P NP
NP -> Det N | NP PP
VP -> V NP | VP PP
Det -> 'a' | 'the'
N -> 'dog' | 'cat'
V -> 'chased' | 'sat'
P -> 'on' | 'in'
""")
print 'A Grammar:', `grammar`
print ' grammar.start() =>', `grammar.start()`
print ' grammar.productions() =>',
# Use string.replace(...) is to line-wrap the output.
print `grammar.productions()`.replace(',', ',\n'+' '*25)
print
print 'Coverage of input words by a grammar:'
print grammar.covers(['a','dog'])
print grammar.covers(['a','toy'])
PCFG
PCFGs are very similar to CFGs - they just have an additional probability for each production.
For a given left-hand-side non-terminal, the sum of the probabilities must be 1.0:
toy_pcfg1 = parse_pcfg("""
S -> NP VP [1.0]
NP -> Det N [0.5] | NP PP [0.25] | 'John' [0.1] | 'I' [0.15]
Det -> 'the' [0.8] | 'my' [0.2]
N -> 'man' [0.5] | 'telescope' [0.5]
VP -> VP PP [0.1] | V NP [0.7] | V [0.2]
V -> 'ate' [0.35] | 'saw' [0.65]
PP -> P NP [1.0]
P -> 'with' [0.61] | 'under' [0.39]
""")
The function nltk.grammar.pcfg_demo() illustrates how PCFGs can be constructed and manipulated.
The most important method consists of inducing a PCFG from trees in a treebank (induce_pcfg()).
Before we induce the PCFG, we take care of transforming the trees into CNF.
Note how tree.productions() returns the list of CFG rules that "explain" the tree:
for each non-terminal node in the tree, tree.productions() will return a production with the
parent node as LHS and the children as RHS.
from nltk.corpus import treebank
from nltk import treetransforms
from nltk import induce_pcfg
from nltk.parse import pchart
pcfg_prods = toy_pcfg1.productions()
pcfg_prod = pcfg_prods[2]
print 'A PCFG production:', `pcfg_prod`
print ' pcfg_prod.lhs() =>', `pcfg_prod.lhs()`
print ' pcfg_prod.rhs() =>', `pcfg_prod.rhs()`
print ' pcfg_prod.prob() =>', `pcfg_prod.prob()`
print
# extract productions from three trees and induce the PCFG
print "Induce PCFG grammar from treebank data:"
productions = []
for item in treebank.items[:2]:
for tree in treebank.parsed_sents(item):
# perform optional tree transformations, e.g.:
tree.collapse_unary(collapsePOS = False) # Remove branches A-B-C into A-B+C
tree.chomsky_normal_form(horzMarkov = 2) # Remove A->(B,C,D) into A->B,C+D->D
productions += tree.productions()
S = Nonterminal('S')
grammar = induce_pcfg(S, productions)
print grammar
print
Parsers
There are different types of parsers implemented in NLTK.
One that implements the Viterbi CKY n-best parses over a PCFG is available in
nltk.grammar.pchar.Viterbi
print "Parse sentence using induced grammar:"
parser = pchart.InsideChartParser(grammar)
parser.trace(3)
sent = treebank.parsed_sents('wsj_0001.mrg')[0].leaves()
print sent
for parse in parser.nbest_parse(sent):
print parse
The parsers are defined in module nltk.parse.
import sys, time
from nltk import tokenize, toy_pcfg1
from nltk.parse import pchart
demos = [('I saw John with my telescope', toy_pcfg1)]
sent, grammar = demos[choice]
# Tokenize the sentence.
tokens = sent.split()
# Define a list of parsers. We'll use all parsers.
parsers = [
ViterbiParser(grammar),
pchart.InsideChartParser(grammar),
pchart.RandomChartParser(grammar),
pchart.UnsortedChartParser(grammar),
pchart.LongestChartParser(grammar),
pchart.InsideChartParser(grammar, beam_size = len(tokens)+1)
]
# Run the parsers on the tokenized sentence.
times = []
average_p = []
num_parses = []
all_parses = {}
for parser in parsers:
print '\ns: %s\nparser: %s\ngrammar: %s' % (sent,parser,grammar)
parser.trace(3)
t = time.time()
parses = parser.nbest_parse(tokens)
times.append(time.time()-t)
if parses: p = reduce(lambda a,b:a+b.prob(), parses, 0)/len(parses)
else: p = 0
average_p.append(p)
num_parses.append(len(parses))
for p in parses: all_parses[p.freeze()] = 1
# Print summary statistics
print
print ' Parser Beam | Time (secs) # Parses Average P(parse)'
print '------------------------+------------------------------------------'
for i in range(len(parsers)):
print '%18s %4d |%11.4f%11d%19.14f' % (parsers[i].__class__.__name__,
parsers[i].beam_size,
times[i],num_parses[i],average_p[i])
parses = all_parses.keys()
if parses: p = reduce(lambda a,b:a+b.prob(), parses, 0)/len(parses)
else: p = 0
print '------------------------+------------------------------------------'
print '%18s |%11s%11d%19.14f' % ('(All Parses)', 'n/a', len(parses), p)
for parse in parses:
print parse
Last modified Apr 6th, 2011
Michael Elhadad