Assignment 3

Due: Thu 10 Jan 2018 Midnight

Natural Language Processing - Fall 2019 Michael Elhadad

This assignment covers the topic of syntactic parsing. Submit your solution in the form of an iPython notebook (ipynb) file. The following notes on PCFG in NLTK (notebook source) contain useful code to start this assignment.

Content

Q1: Designing CFGs for NLP

Q1.1: Extend a CFG to support Number agreement, Pronouns and Dative Constructions
Q1.2: Extend a CFG to support Coordination and Prepositional Phrases

Q1: Learning a PCFG from a Treebank

Q2.1 Random PCFG Generation
Q2.2 Learn a PCFG from a Treebank
Q2.3 Induce a PCFG in Chomsky Normal Form
Q2.4 Test CFG Independence Assumptions

Q3: Building and Evaluating a Simple PCFG Parser

Q3.1 Build a Parser
Q3.2 Evaluate the Parser
Q3.3 Accuracy per Distance
Q3.4 Accuracy per Label

Question 1: Designing CFGs for NLP

NLTK provides library support to read CFGs from string representation, and parse sentences given a CFG using different parsing algorithms (either top-down or bottom-up). In this question, we manually develop a grammar to support increasingly complex phenomena in English syntax. The following code provides the starting point:

sg = """
S -> NP VP
VP -> IV | TV NP
NP -> 'John' | "bread"
IV -> 'left'
TV -> 'eats'
"""
g = CFG.fromstring(sg)

# Bottom-up  parser
sr_parser = nltk.ShiftReduceParser(g, trace=2)

# Parse sentences and observe the behavior of the parser
def parse_sentence(sent):
    tokens = sent.split()
    trees = sr_parser.parse(tokens)
    for tree in trees:
        print(tree)

parse_sentence("John left")
parse_sentence("John eats bread")

Question 1.1: Extend a CFG to support Number agreement, Pronouns and Dative Constructions

1.1.1 Extend the CFG so that the following sentences can be parsed:

    John left
    John loves Mary
    They love Mary
    They love her
    She loves them
    Everybody loves John
    A boy loves Mary
    John gave Mary a heavy book
    John gave it to Mary

NOTE: pronouns in English are characterized by the following morphological attributes:

Number: singular / plural (e.g., he/they)
Gender: masculine / feminine / neutral (e.g., he/she/it)
Case: nominative / accusative (e.g., he/him)

Make sure the appropriate form of pronouns are used in each of the possible contexts. Do you need to encode gender in the grammar? Explain why.

1.1.2 Check whether your grammar overgenerates by giving an example of ungrammatical sentence that it recognizes as grammatical.

Question 1.2: Extend a CFG to support Coordination and Prepositional Phrases

1.2.1 Extend the grammar so that it can parse the following sentences:

John saw a man with a telescope
John saw a man on the hill with a telescope

Mary knows men and women
Mary knows men, children and women

John and Mary eat bread
John and Mary eat bread with cheese

1.2.2 What is the number of a coordination such as "John and Mary"? What is the number of a coordination such as "John or Mary"? "John or the children"?

Propose (without implementing) ways to support such variations.

1.2.3 Demonstrate ways in which your grammar over-generates. Explain your observations.

Question 2: Learning a PCFG from a Treebank

Question 2.1: Random PCFG Generation

Consider the PCFG datatype implementation provided in NLTK.

Consider the module nltk.parse.generate. It contains a function generate(grammar) which given a CFG grammar produces all the sentences the grammar can produce. Naturally, this may produce an infinite number of sentences if the grammar is recursive.

import itertools
from nltk.grammar import CFG
from nltk.parse import generate

demo_grammar = """
  S -> NP VP
  NP -> Det N
  PP -> P NP
  VP -> 'slept' | 'saw' NP | 'walked' PP
  Det -> 'the' | 'a'
  N -> 'man' | 'park' | 'dog'
  P -> 'in' | 'with'
"""
grammar = CFG.fromstring(demo_grammar)

for n, sent in enumerate(generate.generate(grammar, n=10), 1):
        print('%3d. %s' % (n, ' '.join(sent)))
		
  1. the man slept
  2. the man saw the man
  3. the man saw the park
  4. the man saw the dog
  5. the man saw a man
  6. the man saw a park
  7. the man saw a dog
  8. the man walked in the man
  9. the man walked in the park
 10. the man walked in the dog

Write a new function pcfg_generate(grammar) which will generate a random tree from a PCFG according to its generative probability model. In other words, your task is to write a sample method from the PCFG distribution.

Generation works top down: pick the start symbol of the grammar, then randomly choose a production starting from the start symbol, and add the RHS as children of the root node, then expand each node until you obtain terminals. The result should be a tree and not a sentence.

The generated parse trees should exhibit the distribution of non-terminal distributions captured in the PCFG (that is, the generation process should sample rules according to the weights of each non-terminal in the grammar).

Note: you must generate according to the PCFG distribution - not just randomly. To this end, you should use the method generate() from the NTLK ProbDistI interface. Specifically, you should use the nltk.probability.DictionaryProbDist class to represent the probability distribution of mapping an LHS non-terminal to one of the possible RHS.

from nltk.grammar import Nonterminal
from nltk.grammar import toy_pcfg2
from nltk.probability import DictionaryProbDist

productions = toy_pcfg2.productions()

# Get all productions with LHS=NP
np_productions = toy_pcfg2.productions(Nonterminal('NP'))

dict = {}
for pr in np_productions: dict[pr.rhs()] = pr.prob()
np_probDist = DictionaryProbDist(dict)

# Each time you call, you get a random sample
print(np_probDist.generate())
(Det, N)

print(np_probDist.generate())
(Name,)

print(np_probDist.generate())
(Name,)

Function Specification

pcfg_generate(grammar) -- return a tree sampled from the language described by the PCFG grammar

Validation

To validate the generate function, perform the following steps:

2.1.1 Generate 1,000 random trees using nltk.grammar.toy_pcfg2 - store the resulting trees in a file "toy_pcfg2.gen".
2.1.2 Compute the frequency distribution of each non-terminal and pre-terminal in the generated corpus. You can look at the code of nltk.induce_pcfg(root, productions) to see how this can be done. You should construct one distribution per non-terminal.
2.1.3 For each distribution, compute the KL-divergence between the MLE estimation of the probability distribution constructed on your test corpus and toy_pcfg2. The MLE estimation is obtained by applying the MLEProbDist estimator to the observed empirical frequency distribution. KL-divergence (Kullback-Leibler divergence also known as relative entropy) estimates the difference between two distributions.
Read some precisions on handling diverging KL computations when the support of the 2 distributions p and q are not identical.
2.1.4 Explain your observations.

Question 2.2: Learn a PCFG from a Treebank

In this question, we will learn a PCFG from a treebank with different types of tree annotations. We start from a subset of 200 trees from the Penn Treebank (which is distributed by NLTK in the corpus named "treebank" - the NLTK version contains about 4,000 trees).

In this question, we want to induce a PCFG from the treebank and investigate its properties. For reference, NLTK provides a function called induce_pcfg in the nltk.grammar module. It starts from a set of productions and constructs a weighted grammar with the MLE estimation of the production distributions.

2.2.1 Function specification

At this stage, we learn the PCFG "as is" -- without Chomsky Form Normalization -- but with "simplified" tags. That is, we must use a function to simplify all tags in the tree (non-terminal and terminal):

from nltk.corpus import LazyCorpusLoader, BracketParseCorpusReader

def simplify_functional_tag(tag):
    if '-' in tag:
        tag = tag.split('-')[0]
    return tag

treebank = LazyCorpusLoader('treebank/combined', BracketParseCorpusReader, r'wsj_.*\.mrg')

# Raw form
print(treebank.parsed_sents()[:1])

# Pretty print
print(treebank.parsed_sents()[0])

# we need to transform the tree to remove NONE tags and simplify tags.

We want all tags of the form "NP-SBJ-2" to be simplified into "NP". We also want to filter out "NONE" elements from the treebank. You must find the best way to perform NONE filtering on the trees. (NONE elements appear in the original Penn Treebank for example in the case of relative clauses.)

pcfg_learn(treebank, n)
-- treebank is the nltk.corpus.treebank lazy corpus reader
-- n indicates the number of trees to read
-- return an nltk.PCFG instance

Use the following code as a starting point:

from nltk.grammar import Production
from nltk import Tree, Nonterminal

def get_tag(tree):
    if isinstance(tree, Tree):
        return Nonterminal(simplify_functional_tag(tree.label()))
    else:
        return tree

def tree_to_production(tree):
    return Production(get_tag(tree), [get_tag(child) for child in tree])

def tree_to_productions(tree):
    yield tree_to_production(tree)
    for child in tree:
        if isinstance(child, Tree):
            for prod in tree_to_productions(child):
                yield prod

Data Exploration and Validation

Explore the following properties of the learned PCFG:

2.2.1 How many productions are learned from the trees? How many interior nodes were in the treebank?
2.2.2 Draw a plot of the distribution of productions according to their frequency (number of rules with frequency 1, 2, ...). What do you observe?
2.2.3 Induce a PCFG from 200 trees, and another one from 400 trees - compare the distribution of the rules you learn.

Question 2.3: Induce a PCFG in Chomsky Normal Form

We now want to learn a PCFG in Chomsky Normal Form from the treebank, with simplified tags, and with filtered NONE elements. The strategy is to convert the trees in the treebank into CNF, then to induce the PCFG from the transformed trees. Use the function chomsky_normal_form in nltk.treetransforms to convert the trees. Pay attention to NONE filtering. Use horizontal Markov annotation and parent annotation:


chomsky_normal_form(tree, factor='right', horzMarkov=1, vertMarkov=1, childChar='|', parentChar='^')

pcfg_cnf_learn(treebank, n)
-- treebank is the nltk.corpus.treebank lazy corpus reader (simplified tags)
-- n indicates the number of trees to read
-- return an nltk.PCFG in CNF

2.3.1 Implement pcfg_cnf_learn
2.3.2 How many productions are learned from the CNF trees? How many interior nodes were in the original treebank, and in the CNF treebank?
2.3.3 Compare the CNF and non-CNF figures (ratios). What do you conclude?

Question 2.4: Test CFG Independence Assumptions

We want to test the CFG hypothesis that a node expansion is independent from its location within a tree.

For the treebank before CNF transformation, report the following statistics, for the expansions of the NP category:

2.4.1 Compute the distribution of the RHS given the NP LHS for all NPs in the treebank. Draw a histogram plot.
2.4.2 Compute the same distribution only for NPs that appear directly below a S node (any variant of S in the Penn treebank tagset). Draw a histogram plot.
2.4.3 Compute the same distribution only for NPs that appear directly below a VP node (any variant of VP in the Penn Treebank tagset). Draw a histogram plot.
2.4.4 Compare the distributions above. Use KL-divergence.
2.4.5 Conclude: does the data in the treebank confirm the CFG hypothesis for each configuration (original trees, annotated CNF trees)? (How do you calibrate the values of the KL-divergence to decide that 2 values are similar?)

Question 3: Building and Evaluating a Simple PCFG Parser

In this question, we will construct a Viterbi parser for the PCFG induced in Question 2 and perform evaluation of this statistical parser.

Question 3.1: Build a Parser

3.1.1 Split the NLTK treebank corpus into 80% training (about 3,200 trees) and 20% (about 800 trees) testing sets.

3.1.2 Learn a PCFG over the Chomsky Normal Form version of this treebank.

3.1.3 Construct a ViterbiParser using this PCFG grammar. Test the parser on a few sentences.

Question 3.2: Evaluate the Parser

Your task is to compute the ParsEval metric for the constituent-based parse trees on the test dataset of our treebank. Report the following metrics:

Precision
Recall
F-measure
Labelled Precision
Labelled Recall
Labelled F-Measure

Where precision and recall are computed over "constituents" between the parsed trees and the treebank trees: a constituent is a triplet (interior_node, first_index, last_index) - where first_index is the index of the first word in the sentence covered by the interior node, and last_index that of the last word (that is, [first_index, last_index] is the yield of the interior node).

In labelled precision and recall we count that 2 constituents match if they have the same label for the interior node. For unlabelled metrics, we just compare the first and last indexes. Make sure you compare the trees from the treebank after they have been simplified: for example, S-TPC-1 must be compared with S and ADVP-TMP with ADVP, and -NONE- nodes have been eliminated.

You can read more on this metric in the following references:

An assignment on parsing evaluation by Scott Farrar includes detailed explanation and examples on how the metric is computed.
Evalb software: this is the C program used in all standard evaluations of constituent parsers (by Satoshi Sekine and Michael Collins).
Evalb in Java for those who prefer reading Java over C - is part of the Stanford CoreNLP package.

3.2.1 Generate a list of labelled constituents of the form [label first_index last_index] given a parse tree.
3.2.2 Compute unlabelled Precision, Recall and F-measure for your parser on the test dataset
3.2.3 Compute labelled Precision, Recall and F-measure for your parser on the test dataset

Question 3.3: Accuracy per Distance

In general, parsers do a much better job on short constituents than long ones. Draw a plot of the accuracy of constituents per constituent length. The length of a constituent (node, first, last) is last-first+1. For a given constituent length X, accuracy is the number of constituents of length X in the parsed tree that are accurate divided by the total number of constituents of length X.

Question 3.4: Accuracy per Label

Report accuracy of constituents per label type (S, SBAR, NP, VP, PP etc). For each node type, report number of occurrences and accuracy. Note: report accuracy per node type WITHOUT Chomsky Normal Form modification. For example, if the CNF of the tree generated a non-terminal NP^S, we should count this as NP for this question.

Last modified 24 Dec 2018