Assignment 3
Due: Tue 11 Jan 2016 Midnight
Natural Language Processing - Fall 2016 Michael Elhadad
This assignment covers the topic of syntactic parsing.
Submit your solution in the form of an iPython notebook (ipynb) file.
The following notes on PCFG in NLTK (notebook source) contain useful code to start this assignment.
Content
- Q1: Learning a PCFG from a Treebank
- Q1.1 Random PCFG Generation
- Q1.2 Learn a PCFG from a Treebank
- Q1.3 Induce a PCFG in Chomsky Normal Form
- Q1.4 Test CFG Independence Assumptions
- Q2: Building and Evaluating a Simple PCFG Parser
- Q2.1 Build a Parser
- Q2.2 Evaluate the Parser
- Q2.3 Accuracy per Distance
- Q2.4 Accuracy per Label
- Q2.5 Using the Universal Tagset
Question 1: Learning a PCFG from a Treebank
Question 1.1: Random PCFG Generation
Consider the PCFG datatype implementation provided in NLTK.
Consider the module nltk.parse.generate.
It contains a function generate(grammar) which given a CFG grammar produces all the sentences the grammar can produce.
Naturally, this may produce an infinite number of sentences if the grammar is recursive.
import itertools
from nltk.grammar import CFG
from nltk.parse import generate
demo_grammar = """
S -> NP VP
NP -> Det N
PP -> P NP
VP -> 'slept' | 'saw' NP | 'walked' PP
Det -> 'the' | 'a'
N -> 'man' | 'park' | 'dog'
P -> 'in' | 'with'
"""
grammar = CFG.fromstring(demo_grammar)
for n, sent in enumerate(generate.generate(grammar, n=10), 1):
print('%3d. %s' % (n, ' '.join(sent)))
1. the man slept
2. the man saw the man
3. the man saw the park
4. the man saw the dog
5. the man saw a man
6. the man saw a park
7. the man saw a dog
8. the man walked in the man
9. the man walked in the park
10. the man walked in the dog
Write a new function pcfg_generate(grammar)
which will
generate a random tree from a PCFG according to its generative
probability model. In other words, your task is to write a sample
method from the PCFG distribution.
Generation works top down: pick the start symbol of
the grammar, then randomly choose a production starting from the start
symbol, and add the RHS as children of the root node, then expand each
node until you obtain terminals. The result should be a tree and not
a sentence.
The generated parse trees should exhibit the distribution of non-terminal
distributions captured in the PCFG (that is, the generation process should
sample rules according to the weights of each non-terminal in the grammar).
Note: you must generate according to the PCFG distribution - not just randomly.
To this end, you should use the method generate()
from the
NTLK ProbDistI interface.
Specifically, you should use the
nltk.probability.DictionaryProbDist
class to represent the probability distribution of mapping an LHS non-terminal to one of the possible RHS.
from nltk.grammar import Nonterminal
from nltk.grammar import toy_pcfg2
from nltk.probability import DictionaryProbDist
productions = toy_pcfg2.productions()
# Get all productions with LHS=NP
np_productions = toy_pcfg2.productions(Nonterminal('NP'))
dict = {}
for pr in np_productions: dict[pr.rhs()] = pr.prob()
np_probDist = DictionaryProbDist(dict)
# Each time you call, you get a random sample
print(np_probDist.generate())
(Det, N)
print(np_probDist.generate())
(Name,)
print(np_probDist.generate())
(Name,)
Function Specification
pcfg_generate(grammar) -- return a tree sampled from the language described by the PCFG grammar
Validation
To validate the generate function, perform the following steps:
- Generate 1,000 random trees using nltk.grammar.toy_pcfg2 - store the resulting trees in a file "toy_pcfg2.gen".
- Compute the frequency distribution of each non-terminal and pre-terminal in the generated corpus.
You can look at the code of nltk.induce_pcfg(root, productions) to see how this can be done.
You should construct one distribution per non-terminal.
- For each distribution, compute the KL-divergence between the MLE estimation of the probability
distribution constructed on your test corpus and toy_pcfg2.
The MLE estimation is obtained by applying the
MLEProbDist estimator
to the observed empirical frequency distribution.
KL-divergence (Kullback-Leibler divergence
also known as relative entropy) estimates the difference between two distributions.
Read some precisions on handling diverging KL computations when the support of the 2 distributions p and q are not identical.
- Explain your observations.
Question 1.2: Learn a PCFG from a Treebank
In this question, we will learn a PCFG from a treebank with different
types of tree annotations. We start from a subset of 200 trees from
the Penn Treebank (which is distributed by NLTK
in the corpus named "treebank" - the NLTK version contains about 4,000 trees).
In this question, we want to induce a PCFG from the treebank and
investigate its properties. For reference, NLTK provides a function
called induce_pcfg
in the nltk.grammar module. It starts from a set of productions and constructs a weighted grammar with the MLE
estimation of the production distributions.
Function specification
At this stage, we learn the PCFG "as is" -- without Chomsky Form Normalization -- but with "simplified" tags. That is, we must use a function to simplify all tags in the tree (non-terminal and terminal):
from nltk.corpus import LazyCorpusLoader, BracketParseCorpusReader
def simplify_functional_tag(tag):
if '-' in tag:
tag = tag.split('-')[0]
return tag
treebank = LazyCorpusLoader('treebank/combined', BracketParseCorpusReader, r'wsj_.*\.mrg')
print(treebank.parsed_sents()[:1])
-- we need to transform the tree to remove NONE tags and simplify tags.
We want all tags of the form "NP-SBJ-2" to be simplified into "NP". We also want to filter out "NONE" elements from the treebank.
You must find the best way to perform NONE filtering on the trees. (NONE elements appear in the original Penn Treebank for example in the case of relative clauses.)
pcfg_learn(treebank, n)
-- treebank is the nltk.corpus.treebank lazy corpus reader
-- n indicates the number of trees to read
-- return an nltk.PCFG instance
Use the following code as a starting point:
from nltk.grammar import Production
from nltk import Tree, Nonterminal
def get_tag(tree):
if isinstance(tree, Tree):
return Nonterminal(simplify_functional_tag(tree.label()))
else:
return tree
def tree_to_production(tree):
return Production(get_tag(tree), [get_tag(child) for child in tree])
def tree_to_productions(tree):
yield tree_to_production(tree)
for child in tree:
if isinstance(child, Tree):
for prod in tree_to_productions(child):
yield prod
Data Exploration and Validation
Explore the following properties of the learned PCFG:
- How many productions are learned from the trees? How many interior nodes were in the treebank?
- Draw a plot of the distribution of productions according to their frequency (number of rules with frequency 1, 2, ...). What do you observe?
- If you induce a PCFG from 200 trees, and another one from 400 trees - compare the distribution of the rules you learn.
Question 1.3: Induce a PCFG in Chomsky Normal Form
We now want to learn a PCFG in Chomsky Normal Form from the treebank, with simplified tags, and with filtered NONE elements.
The strategy is to convert the trees in the treebank into CNF, then to induce the PCFG from the transformed trees.
Use the function chomsky_normal_form in nltk.treetransforms
to convert the trees. Pay attention to NONE filtering. Use horizontal Markov annotation and parent annotation:
chomsky_normal_form(tree, factor='right', horzMarkov=1, vertMarkov=1, childChar='|', parentChar='^')
pcfg_cnf_learn(treebank, n)
-- treebank is the nltk.corpus.treebank lazy corpus reader (simplified tags)
-- n indicates the number of trees to read
-- return an nltk.PCFG in CNF
- How many productions are learned from the CNF trees? How many interior nodes were in the original treebank, and in the CNF treebank?
- Compare the CNF and non-CNF figures (ratios). What do you conclude?
Question 1.4: Test CFG Independence Assumptions
We want to test the CFG hypothesis that a node expansion is independent from its location within a tree.
For the treebank before CNF transformation, report the following statistics, for the expansions of the NP category:
- Compute the distribution of the RHS given the NP LHS for all NPs in the treebank. Draw a histogram plot.
- Compute the same distribution only for NPs that appear directly below a S node (any variant of S in the Penn treebank tagset). Draw a histogram plot.
- Compute the same distribution only for NPs that appear directly below a VP node (any variant of VP in the Penn Treebank tagset). Draw a histogram plot.
- Compare the distributions above. Use KL-divergence.
- Conclude: does the data in the treebank confirm the CFG hypothesis for each configuration (original trees, annotated CNF trees)? (How do you calibrate the values of the KL-divergence to decide that 2 values are similar?)
Question 2: Building and Evaluating a Simple PCFG Parser
In this question, we will construct a Viterbi parser for the PCFG induced in question 1
and perform evaluation of this statistical parser.
Question 2.1: Build a Parser
Split the NLTK treebank corpus into 80% training (about 3,200 trees) and 20% (about 800 trees) testing sets.
Learn a PCFG over the Chomsky Normal Form version of this treebank.
Construct a ViterbiParser using this PCFG grammar.
Question 2.2: Evaluate the Parser
Your task is to compute the ParsEval metric for the constituent-based parse trees of CoreNLP on our treebank.
Report the following metrics:
- Precision
- Recall
- F-measure
- Labelled Precision
- Labelled Recall
- Labelled F-Measure
Where precision and recall are computed over "constituents" between
the parsed trees and the treebank trees: a constituent is a triplet
(interior-node, first-index, last-index) - where first-index is the
index of the first word in the sentence covered by the interior node,
and last-index that of the last word (that is, [first-index,
last-index] is the yield of the interior node).
In labelled precision and recall we count that 2 constituents match if
they have the same label for the interior node. For unlabelled metrics,
we just compare the first and last indexes. Make sure you compare the trees from
the treebank after they have been simplified: for example, S-TPC-1 must be compared with S and
ADVP-TMP with ADVP, and -NONE- nodes have been eliminated.
You can read more on this metric in the following references:
- An assignment on parsing evaluation by Scott Farrar includes detailed explanation and examples on how the metric is computed.
- Evalb software: this is the C program used in all standard evaluations of constituent parsers (by Satoshi Sekine and Michael Collins).
- Evalb in Java for those who prefer reading Java over C - is part of the Stanford CoreNLP package.
- Evalc-GUI in Java: in addition to computing the metrics, this package shows the trees that are compared and highlights
graphically the differences in each pair.
Question 2.3: Accuracy per Distance
In general, parsers do a much better job on short constituents than
long ones. Draw a plot of the accuracy of constituents per
constituent length. The length of a constituent (node, first, last)
is last-first+1. For a given constituent length X, accuracy is the
number of constituents of length X in the parsed tree that are
accurate divided by the total number of constituents of length X.
Question 2.4: Accuracy per Label
Report accuracy of constituents per label type (S, SBAR, NP, VP, PP etc).
For each node type, report number of occurrences and accuracy.
Note: report accuracy per node type WITHOUT Chomsky Normal Form modification.
For example, if the CNF of the tree generated a non-terminal NP^S, we should count this as NP
for this question.
Question 2.5: Working with the Universal Tagset [Optional]
Create a variant of the Viterbi parser that is induced from the same treebank but where we use the Universal Tagset for
pre-terminals (the nodes above the terminal nodes). Similarly, suggest a way to simplify the set of Non Terminals used
for interior nodes based on the universal tagset - so that the treebank uses altogether less non-terminal tags.
Report on the number of rules induced with your new annotation scheme before and after CNF transformation.
Use parent and Markov annotations (chomsky_normal_form(tree, factor='right', horzMarkov=2, vertMarkov=2, childChar='|', parentChar='^')).
Report the accuracy of the learned Viterbi parser with the Universal annotations.
Compare your results with the data provided in Improved PCKY
(by Scott Farrar).
Last modified 28 Dec 2015