Assignment 2

Due: Monday 01 May 2011 Midnight

Natural Language Processing - Spring 2011 Michael Elhadad

This assignment investigates context-free grammars (CFGs), probabilistic CFGs (PCFGs) and ngram language models. It uses a small sample of parsed trees from the Penn Treebank, obtained from the NLTK distribution.

For all questions, you should use the code existing in NLTK covering these topics: This page contains information on all the classes you should need.

This assignment covers the following topics:

Question 1: Generate from a PCFG
Question 2: Learn a PCFG from a Treebank
Question 3: Test CFG Independence Assumptions
Question 4: Learn a bigram Language Model
Question 5: Validation of PCFG Generated text using an n-gram Model
Question 6: Recognize subjects in constituent trees

Submit your solution in the form of an HTML file, using the same CSS as this page with code inside <pre> tags. Images should be submitted as PNG files.

Question 1: Random Generation from PCFGs

Consider the PCFG datatype implementation provided in NLTK. Consider the module nltk.parse.generate. It contains a function generate(grammar) which given a CFG grammar produces all the sentences the grammar can produce. Naturally, this may produce an infinite number of sentences if the grammar is recursive.

>>> from nltk.parse import generate
the man saw the man
the man saw the park
the man saw the dog
the man saw the telescope
the man saw a man
the man saw a park
the man saw a dog
the man saw a telescope
the man walked the man
the man walked the park
the man walked the dog
the man walked the telescope
...

Write a new function pcfg_generate(grammar) which will generate a random tree from a PCFG. Generation works top down: pick the start symbol of the grammar, then randomly choose a production starting from the start symbol, and add the RHS as children of the root node, then expand each node until you obtain terminals. The result should be a tree and not a sentence. The generated parse trees should exhibit the distribution of non-terminal distributions captured in the PCFG (that is, the generation process should sample rules according to the weights of each non-terminal in the grammar). Note: you must generate according to the PCFG distribution - not just randomly. To this end, you should use the method generate() from the NTLK ProbDistI interface. Specifically, you should use the nltk.probability.DictionaryProbDist class to represent the probability distribution of mapping an LHS non-terminal to one of the possible RHS.

>>> from nltk.parse import Nonterminal
>>> from nltk.grammar import toy_pcfg2
>>> from nltk.probability import DictionaryProbDist

>>> productions = toy_pcfg2.productions()

# Get all productions with LHS=NP
>>> np_productions = toy_pcfg2.productions(Nonterminal('NP'))
[NP -> Det N [0.41], NP -> Name [0.28], NP -> NP PP [0.31]]

>>> dict = {}
>>> for pr in np_productions: dict[pr.rhs()] = pr.prob()
>>> np_probDist = DictionaryProbDist(dict)

# Generate a random RHS from np_probDist
>>> np_probDist.generate()

# Each time you call, you get a random sample
>>> pd1.generate()
(Det, N)
>>> pd1.generate()
(Name,)
>>> pd1.generate()
(Name,)
>>> pd1.generate()
(Name,)
>>> pd1.generate()
(NP, PP)
>>> pd1.generate()
(Det, N)

Function specification

 
pcfg_generate(grammar) -- return a tree sampled from the language described by the grammar

Validation

To validate the generate function, perform the following steps:

Generate 1,000 random sentences using nltk.grammar.toy_pcfg2
Compute the frequency distribution of each non-terminal and pre-terminal in the generated corpus. You can look at the code of nltk.induce_pcfg(root, productions) to see how this can be done. You should construct one distribution per non-terminal.
For each distribution, compute the KL-divergence between the MLE estimation of the probability distribution constructed on your test corpus and toy_pcfg2. The MLE estimation is obtained by applying the MLEProbDist estimator to the observed empirical frequency distribution. KL-divergence (Kullback-Leibler divergence also known as relative entropy) estimates the difference between two distributions.
Read some precisions on handling diverging KL computations.
Explain your observations.

Question 2: Learn a PCFG from a Treebank

In this question, we will learn a PCFG from a treebank. We start from a subset (of about 5%) of the trees of the Penn Treebank (distributed by NLTK in the corpus named "treebank").

In this question, we want to induce a PCFG from the treebank and investigate its properties. For reference, NLTK provides a function called induce_pcfg in the nltk.grammar module. It starts from a set of productions and constructs a weighted grammar with the MLE esitmation of the production distributions.

Function specification

At this stage, we learn the PCFG "as is" -- without Chomsky Form Normalization -- but with "simplified" tags. That is, we use the option: treebank = LazyCorpusLoader('treebank/combined', BracketParseCorpusReader, r'wsj_.*\.mrg', tag_mapping_function=simplify_wsj_tag) so that all tags of the form "NP-SBJ-2" are simplified into "NP". We also want to filter out "NONE" elements from the treebank. You must find the best way to perform NONE filtering on the trees.

 
pcfg_learn(treebank, n) 
-- treebank is the nltk.corpus.treebank lazy corpus reader
-- n indicates the number of trees to read
-- return an nltk.WeigthedGrammar

Data Exploration and Validation

Explore the following properties of the learned PCFG:

How many productions are learned from the trees? How many interior nodes were in the treebank?
Draw a plot of the distribution of productions according to their frequency (number of rules with frequency 1, 2, ...). What do you observe?
Assume we "cut" the tail of the learned PCFG, that is we remove the least frequent rules, so that we keep only the F most frequent rules out of the N rules in the PCFG. We want to assess the "harm" made by this reduction in the size of the grammar on its capacity to recognize the trees of the treebank. Write a function that determines whether a tree can be parsed by a grammar (cover_tree(grammar, tree)). (This function does not actually parse anything, it just tests that a given tree can be produced by a grammar.) Draw a plot that indicates the number of trees "missed" as the number of rules is reduced (sample every 10% of the size of the grammar).

CNF PCFG

We now want to learn a PCFG in Chomsky Normal Form from the treebank, with simplified tags, and with filtered NONE elements. The strategy is to convert the trees in the treebank into CNF, then to induce the PCFG from the transformed trees. Use the function chomsky_normal_form in nltk.treetransforms to convert the trees. Pay attention to NONE filtering.

pcfg_cnf_learn(treebank, n)
-- treebank is the nltk.corpus.treebank lazy corpus reader (simplified tags)
-- n indicates the number of trees to read
-- return an nltk.WeigthedGrammar in CNF

How many productions are learned from the CNF trees? How many interior nodes were in the original treebank, and in the CNF treebank?
Compare the CNF and non-CNF figures.

Question 3: Test CFG Independence Assumptions

We want to test the CFG hypothesis that a node expansion is independent from its location within a tree.

Report the following statistics, for the 5 most likely expansions of the NP category (in original form, simplified tags, NONE filtered):

Compute the distribution of each expansion (RHS) given the NP LHS for all NPs in the treebank. Draw a histogram plot.
Compute the same distribution only for NPs that appear directly below a S node. Draw a histogram plot.
Compute the same distribution only for NPs that appear directly below a VP node. Draw a histogram plot.
Compare the distributions above. Which metric will provide a good measure of this comparison? (Use KL-divergence.)
Conclude: does the data in the treebank confirm the CFG hypothesis?

We will most likely observe that the CFG independence assumption does not hold for NPs in this experiment. We now want to verify whether this lack of independence hurts when using the learned PCFG.

Question 4: Learn a bigram Language Model

To test the quality of the language generated by the learned PCFG, we will evaluate a sample of generated data with a bigram language model. We first need to learn such a bigram model.

A bigram model is learned as the estimation of a bigram probability distribution derived from a frequency distribution of bigrams. We read sentence by sentence the linearized sentences from our Penn Treebank subset (use tree.leaves()), and add the dummy START symbol (the dot symbol already appears as an END symbol in our dataset).

ngram models are implemented in the nltk.model.ngram module.

Function specification

 
bigram_learn(corpus, n, estimator)
-- return a bigram model acquired from the corpus
-- n is the number of trees from the corpus to use for training
-- estimator is a function that takes a ConditionalFreqDist and returns a ConditionalProbDist.  
-- By default, use None - meaning, use the MLEProbDist estimator.
-- return a bigram model

Validation

To validate the bigram model, perform the following steps:

Train your bigram model on 80% of the treebank text -- once with the MLEProbDist estimator and once with the LidstoneProbDist estimator. (See ngramModel.demo() for details on how the estimator should be used.)
Compute the entropy of the held out 20% of the treebank text using each of the 2 bigram models.
Generate 50 random sentences from the better model (see the method generate() in the ngramModel class).
Observe the generated sentences and qualitatively assess their quality.

Question 5: Validation of PCFG Generated text using an n-gram Model

We want to test the hypothesis that the language generated by the learned PCFG does not match well the n-gram model learned on the same sample textual data (because the independence assumptions of the PCFG do not hold).

To this end, construct a sample of 1,000 sentences randomly generated by the PCFG. Estimate the likelihood of this corpus using the learned bigram model. Compare the estimated likelihood with the ngram estimated likelihood on the testing set from real sentences. What are your conclusions?

Question 6: Recognize subjects in constituent trees

Consider the Penn Treebank corpus delivered with NLTK. You can observe the parsed trees using the treebank corpus reader:

>>> print nltk.corpus.treebank.parsed_sents('wsj_0001.mrg')[0]
(S
  (NP-SBJ
    (NP (NNP Pierre) (NNP Vinken))
    (, ,)
    (ADJP (NP (CD 61) (NNS years)) (JJ old))
    (, ,))
  (VP
    (MD will)
    (VP
      (VB join)
      (NP (DT the) (NN board))
      (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director)))
      (NP-TMP (NNP Nov.) (CD 29))))
  (. .))
>>> nltk.corpus.treebank.parsed_sents('wsj_0001.mrg')[0].draw()
>>>

When discussing the difference between constituency-based syntactic descriptions and dependency-based descriptions, we stressed that the main differences are:

Constituency-based descriptions have non-terminal nodes labelled with phrasal categories (NP, S, VP etc) while dependency representations have only word labels (internal nodes are words that operate as syntactic heads of phrasal constituents).
Dependency structures have labelled edges - the labels on the edges indicate the syntactic function of the children. For example, the SUBJECT label outgoing from a verb would indicate which dependent constituent is the subject of the verb.

Observe that some of this synctactic function is explicitly captured in the Penn Treebank, in the form of functional tag annotations. For example, the label NP-SBJ indicates that the NP constituent "Pierre Vinken" plays the role of SUBJECT in its parent S constituent.

Proponents of constituent-based descriptions believe that tree configurations identify syntactic function in a non-ambiguous manner. For example, the configuration (S (NP #1) (VP #2)) identifies the node #1 as a subject.

Your task in this question is to test this hypothesis for the SUBJECT function on the nltk Penn Treebank dataset fragment (about 1600 sentences). You must develop a classifier that will determine whether a NP node in a tree is a SUBJECT.

Observe the treebank full tags (not simplified), and explain how you will set up your experiment: how do you derive the training set and the test set?
What knowledge sources do you expect to play a role in the decision SBJ/non-SBJ?
Use the nltk.classify module to implement the classifier based on the knowledge sources you identified: implement a feature extractor and use the NaiveBayesClassifier to learn a classifier.
Report on accuracy, and perform error analysis (train on 80%, test on 20% of the dataset).

Last modified April 8th, 2011