Assignment 3

Due: Tue 11 Jan 2016 Midnight

Natural Language Processing - Fall 2016 Michael Elhadad

This assignment covers the topic of syntactic parsing. Submit your solution in the form of an iPython notebook (ipynb) file. The following notes on PCFG in NLTK (notebook source) contain useful code to start this assignment.

Content

Q1: Learning a PCFG from a Treebank

Q1.1 Random PCFG Generation
Q1.2 Learn a PCFG from a Treebank
Q1.3 Induce a PCFG in Chomsky Normal Form
Q1.4 Test CFG Independence Assumptions

Q2: Building and Evaluating a Simple PCFG Parser

Q2.1 Build a Parser
Q2.2 Evaluate the Parser
Q2.3 Accuracy per Distance
Q2.4 Accuracy per Label
Q2.5 Using the Universal Tagset

Question 1: Learning a PCFG from a Treebank

Question 1.1: Random PCFG Generation

Consider the PCFG datatype implementation provided in NLTK. Consider the module nltk.parse.generate. It contains a function generate(grammar) which given a CFG grammar produces all the sentences the grammar can produce. Naturally, this may produce an infinite number of sentences if the grammar is recursive.

import itertools
from nltk.grammar import CFG
from nltk.parse import generate

demo_grammar = """
  S -> NP VP
  NP -> Det N
  PP -> P NP
  VP -> 'slept' | 'saw' NP | 'walked' PP
  Det -> 'the' | 'a'
  N -> 'man' | 'park' | 'dog'
  P -> 'in' | 'with'
"""
grammar = CFG.fromstring(demo_grammar)

for n, sent in enumerate(generate.generate(grammar, n=10), 1):
        print('%3d. %s' % (n, ' '.join(sent)))
		
  1. the man slept
  2. the man saw the man
  3. the man saw the park
  4. the man saw the dog
  5. the man saw a man
  6. the man saw a park
  7. the man saw a dog
  8. the man walked in the man
  9. the man walked in the park
 10. the man walked in the dog

Write a new function pcfg_generate(grammar) which will generate a random tree from a PCFG according to its generative probability model. In other words, your task is to write a sample method from the PCFG distribution.

Generation works top down: pick the start symbol of the grammar, then randomly choose a production starting from the start symbol, and add the RHS as children of the root node, then expand each node until you obtain terminals. The result should be a tree and not a sentence.

The generated parse trees should exhibit the distribution of non-terminal distributions captured in the PCFG (that is, the generation process should sample rules according to the weights of each non-terminal in the grammar).

Note: you must generate according to the PCFG distribution - not just randomly. To this end, you should use the method generate() from the NTLK ProbDistI interface. Specifically, you should use the nltk.probability.DictionaryProbDist class to represent the probability distribution of mapping an LHS non-terminal to one of the possible RHS.

from nltk.grammar import Nonterminal
from nltk.grammar import toy_pcfg2
from nltk.probability import DictionaryProbDist

productions = toy_pcfg2.productions()

# Get all productions with LHS=NP
np_productions = toy_pcfg2.productions(Nonterminal('NP'))

dict = {}
for pr in np_productions: dict[pr.rhs()] = pr.prob()
np_probDist = DictionaryProbDist(dict)

# Each time you call, you get a random sample
print(np_probDist.generate())
(Det, N)

print(np_probDist.generate())
(Name,)

print(np_probDist.generate())
(Name,)

Function Specification

pcfg_generate(grammar) -- return a tree sampled from the language described by the PCFG grammar

Validation

To validate the generate function, perform the following steps:

Generate 1,000 random trees using nltk.grammar.toy_pcfg2 - store the resulting trees in a file "toy_pcfg2.gen".
Compute the frequency distribution of each non-terminal and pre-terminal in the generated corpus. You can look at the code of nltk.induce_pcfg(root, productions) to see how this can be done. You should construct one distribution per non-terminal.
For each distribution, compute the KL-divergence between the MLE estimation of the probability distribution constructed on your test corpus and toy_pcfg2. The MLE estimation is obtained by applying the MLEProbDist estimator to the observed empirical frequency distribution. KL-divergence (Kullback-Leibler divergence also known as relative entropy) estimates the difference between two distributions.
Read some precisions on handling diverging KL computations when the support of the 2 distributions p and q are not identical.
Explain your observations.

Question 1.2: Learn a PCFG from a Treebank

In this question, we will learn a PCFG from a treebank with different types of tree annotations. We start from a subset of 200 trees from the Penn Treebank (which is distributed by NLTK in the corpus named "treebank" - the NLTK version contains about 4,000 trees).

In this question, we want to induce a PCFG from the treebank and investigate its properties. For reference, NLTK provides a function called induce_pcfg in the nltk.grammar module. It starts from a set of productions and constructs a weighted grammar with the MLE estimation of the production distributions.

Function specification

At this stage, we learn the PCFG "as is" -- without Chomsky Form Normalization -- but with "simplified" tags. That is, we must use a function to simplify all tags in the tree (non-terminal and terminal):

from nltk.corpus import LazyCorpusLoader, BracketParseCorpusReader

def simplify_functional_tag(tag):
    if '-' in tag:
        tag = tag.split('-')[0]
    return tag

treebank = LazyCorpusLoader('treebank/combined', BracketParseCorpusReader, r'wsj_.*\.mrg')

print(treebank.parsed_sents()[:1])
-- we need to transform the tree to remove NONE tags and simplify tags.

We want all tags of the form "NP-SBJ-2" to be simplified into "NP". We also want to filter out "NONE" elements from the treebank. You must find the best way to perform NONE filtering on the trees. (NONE elements appear in the original Penn Treebank for example in the case of relative clauses.)

pcfg_learn(treebank, n)
-- treebank is the nltk.corpus.treebank lazy corpus reader
-- n indicates the number of trees to read
-- return an nltk.PCFG instance

Use the following code as a starting point:

from nltk.grammar import Production
from nltk import Tree, Nonterminal

def get_tag(tree):
    if isinstance(tree, Tree):
        return Nonterminal(simplify_functional_tag(tree.label()))
    else:
        return tree

def tree_to_production(tree):
    return Production(get_tag(tree), [get_tag(child) for child in tree])

def tree_to_productions(tree):
    yield tree_to_production(tree)
    for child in tree:
        if isinstance(child, Tree):
            for prod in tree_to_productions(child):
                yield prod

Data Exploration and Validation

Explore the following properties of the learned PCFG:

How many productions are learned from the trees? How many interior nodes were in the treebank?
Draw a plot of the distribution of productions according to their frequency (number of rules with frequency 1, 2, ...). What do you observe?
If you induce a PCFG from 200 trees, and another one from 400 trees - compare the distribution of the rules you learn.

Question 1.3: Induce a PCFG in Chomsky Normal Form

We now want to learn a PCFG in Chomsky Normal Form from the treebank, with simplified tags, and with filtered NONE elements. The strategy is to convert the trees in the treebank into CNF, then to induce the PCFG from the transformed trees. Use the function chomsky_normal_form in nltk.treetransforms to convert the trees. Pay attention to NONE filtering. Use horizontal Markov annotation and parent annotation: chomsky_normal_form(tree, factor='right', horzMarkov=1, vertMarkov=1, childChar='|', parentChar='^')

pcfg_cnf_learn(treebank, n)
-- treebank is the nltk.corpus.treebank lazy corpus reader (simplified tags)
-- n indicates the number of trees to read
-- return an nltk.PCFG in CNF

How many productions are learned from the CNF trees? How many interior nodes were in the original treebank, and in the CNF treebank?
Compare the CNF and non-CNF figures (ratios). What do you conclude?

Question 1.4: Test CFG Independence Assumptions

We want to test the CFG hypothesis that a node expansion is independent from its location within a tree.

For the treebank before CNF transformation, report the following statistics, for the expansions of the NP category:

Compute the distribution of the RHS given the NP LHS for all NPs in the treebank. Draw a histogram plot.
Compute the same distribution only for NPs that appear directly below a S node (any variant of S in the Penn treebank tagset). Draw a histogram plot.
Compute the same distribution only for NPs that appear directly below a VP node (any variant of VP in the Penn Treebank tagset). Draw a histogram plot.
Compare the distributions above. Use KL-divergence.
Conclude: does the data in the treebank confirm the CFG hypothesis for each configuration (original trees, annotated CNF trees)? (How do you calibrate the values of the KL-divergence to decide that 2 values are similar?)

Question 2: Building and Evaluating a Simple PCFG Parser

In this question, we will construct a Viterbi parser for the PCFG induced in question 1 and perform evaluation of this statistical parser.

Question 2.1: Build a Parser

Split the NLTK treebank corpus into 80% training (about 3,200 trees) and 20% (about 800 trees) testing sets.

Learn a PCFG over the Chomsky Normal Form version of this treebank.

Construct a ViterbiParser using this PCFG grammar.

Question 2.2: Evaluate the Parser

Your task is to compute the ParsEval metric for the constituent-based parse trees of CoreNLP on our treebank. Report the following metrics:

Precision
Recall
F-measure
Labelled Precision
Labelled Recall
Labelled F-Measure

Where precision and recall are computed over "constituents" between the parsed trees and the treebank trees: a constituent is a triplet (interior-node, first-index, last-index) - where first-index is the index of the first word in the sentence covered by the interior node, and last-index that of the last word (that is, [first-index, last-index] is the yield of the interior node).

In labelled precision and recall we count that 2 constituents match if they have the same label for the interior node. For unlabelled metrics, we just compare the first and last indexes. Make sure you compare the trees from the treebank after they have been simplified: for example, S-TPC-1 must be compared with S and ADVP-TMP with ADVP, and -NONE- nodes have been eliminated.

You can read more on this metric in the following references:

An assignment on parsing evaluation by Scott Farrar includes detailed explanation and examples on how the metric is computed.
Evalb software: this is the C program used in all standard evaluations of constituent parsers (by Satoshi Sekine and Michael Collins).
Evalb in Java for those who prefer reading Java over C - is part of the Stanford CoreNLP package.
Evalc-GUI in Java: in addition to computing the metrics, this package shows the trees that are compared and highlights graphically the differences in each pair.

Question 2.3: Accuracy per Distance

In general, parsers do a much better job on short constituents than long ones. Draw a plot of the accuracy of constituents per constituent length. The length of a constituent (node, first, last) is last-first+1. For a given constituent length X, accuracy is the number of constituents of length X in the parsed tree that are accurate divided by the total number of constituents of length X.

Question 2.4: Accuracy per Label

Report accuracy of constituents per label type (S, SBAR, NP, VP, PP etc). For each node type, report number of occurrences and accuracy. Note: report accuracy per node type WITHOUT Chomsky Normal Form modification. For example, if the CNF of the tree generated a non-terminal NP^S, we should count this as NP for this question.

Question 2.5: Working with the Universal Tagset [Optional]

Create a variant of the Viterbi parser that is induced from the same treebank but where we use the Universal Tagset for pre-terminals (the nodes above the terminal nodes). Similarly, suggest a way to simplify the set of Non Terminals used for interior nodes based on the universal tagset - so that the treebank uses altogether less non-terminal tags.

Report on the number of rules induced with your new annotation scheme before and after CNF transformation. Use parent and Markov annotations (chomsky_normal_form(tree, factor='right', horzMarkov=2, vertMarkov=2, childChar='|', parentChar='^')).

Report the accuracy of the learned Viterbi parser with the Universal annotations. Compare your results with the data provided in Improved PCKY (by Scott Farrar).

Last modified 28 Dec 2015