Assignment 3

Due: Sunday 27 Jan 2013 Midnight

Natural Language Processing - Fall 2013 Michael Elhadad

This assignment covers the topic of syntactic parsing and classification. Submit your solution in the form of an HTML file, using the same CSS as this page with code inside <pre> tags. Images should be submitted as PNG files.


Question 1: PCFG Parsing

Question 1.1: Random PCFG Generation

Consider the PCFG datatype implementation provided in NLTK. Consider the module nltk.parse.generate. It contains a function generate(grammar) which given a CFG grammar produces all the sentences the grammar can produce. Naturally, this may produce an infinite number of sentences if the grammar is recursive.
>>> from nltk.parse import generate
the man saw the man
the man saw the park
the man saw the dog
the man saw the telescope
the man saw a man
the man saw a park
the man saw a dog
the man saw a telescope
the man walked the man
the man walked the park
the man walked the dog
the man walked the telescope
...
Write a new function pcfg_generate(grammar) which will generate a random tree from a PCFG according to its generative probability model: Generation works top down: pick the start symbol of the grammar, then randomly choose a production starting from the start symbol, and add the RHS as children of the root node, then expand each node until you obtain terminals. The result should be a tree and not a sentence. The generated parse trees should exhibit the distribution of non-terminal distributions captured in the PCFG (that is, the generation process should sample rules according to the weights of each non-terminal in the grammar). Note: you must generate according to the PCFG distribution - not just randomly. To this end, you should use the method generate() from the NTLK ProbDistI interface. Specifically, you should use the nltk.probability.DictionaryProbDist class to represent the probability distribution of mapping an LHS non-terminal to one of the possible RHS.
>>> from nltk.parse import Nonterminal
>>> from nltk.grammar import toy_pcfg2
>>> from nltk.probability import DictionaryProbDist

>>> productions = toy_pcfg2.productions()

# Get all productions with LHS=NP
>>> np_productions = toy_pcfg2.productions(Nonterminal('NP'))
[NP -> Det N [0.41], NP -> Name [0.28], NP -> NP PP [0.31]]

>>> dict = {}
>>> for pr in np_productions: dict[pr.rhs()] = pr.prob()
>>> np_probDist = DictionaryProbDist(dict)

# Generate a random RHS from np_probDist
>>> np_probDist.generate()

# Each time you call, you get a random sample
>>> pd1.generate()
(Det, N)
>>> pd1.generate()
(Name,)
>>> pd1.generate()
(Name,)
>>> pd1.generate()
(Name,)
>>> pd1.generate()
(NP, PP)
>>> pd1.generate()
(Det, N)

Function specification

pcfg_generate(grammar) -- return a tree sampled from the language described by the grammar

Validation

To validate the generate function, perform the following steps:

Question 1.2: Learn a PCFG from a Treebank

In this question, we will learn a PCFG from a treebank with different types of tree annotations. We start from a subset of 200 trees from the Penn Treebank (which is distributed by NLTK in the corpus named "treebank").

In this question, we want to induce a PCFG from the treebank and investigate its properties. For reference, NLTK provides a function called induce_pcfg in the nltk.grammar module. It starts from a set of productions and constructs a weighted grammar with the MLE estimation of the production distributions.

Function specification

At this stage, we learn the PCFG "as is" -- without Chomsky Form Normalization -- but with "simplified" tags. That is, we must use a function to simplify all tags in the tree (non-terminal and terminal):
from nltk.corpus import LazyCorpusLoader, BracketParseCorpusReader

def simplify_functional_tag(tag):
    if '-' in tag:
        tag = tag.split('-')[0]
    return tag

treebank = LazyCorpusLoader('treebank/combined', BracketParseCorpusReader, r'wsj_.*\.mrg')

print treebank.parsed_sents()[:1]
-- we need to transform the tree to remove NONE tags and simplify tags.
We want all tags of the form "NP-SBJ-2" to be simplified into "NP". We also want to filter out "NONE" elements from the treebank. You must find the best way to perform NONE filtering on the trees. (NONE elements appear in the original Penn Treebank for example in the case of relative clauses.)
pcfg_learn(treebank, n)
-- treebank is the nltk.corpus.treebank lazy corpus reader
-- n indicates the number of trees to read
-- return an nltk.WeigthedGrammar

Data Exploration and Validation

Explore the following properties of the learned PCFG:
  1. How many productions are learned from the trees? How many interior nodes were in the treebank?
  2. Draw a plot of the distribution of productions according to their frequency (number of rules with frequency 1, 2, ...). What do you observe?

CNF PCFG

We now want to learn a PCFG in Chomsky Normal Form from the treebank, with simplified tags, and with filtered NONE elements. The strategy is to convert the trees in the treebank into CNF, then to induce the PCFG from the transformed trees. Use the function chomsky_normal_form in nltk.treetransforms to convert the trees. Pay attention to NONE filtering. Use horizontal Markov annotation and parent annotation: chomsky_normal_form(tree, factor='right', horzMarkov=1, vertMarkov=1, childChar='|', parentChar='^')
pcfg_cnf_learn(treebank, n)
-- treebank is the nltk.corpus.treebank lazy corpus reader (simplified tags)
-- n indicates the number of trees to read
-- return an nltk.WeigthedGrammar in CNF
  1. How many productions are learned from the CNF trees? How many interior nodes were in the original treebank, and in the CNF treebank?
  2. Compare the CNF and non-CNF figures (ratios). What do you conclude?

Question 1.3: Test CFG Independence Assumptions

We want to test the CFG hypothesis that a node expansion is independent from its location within a tree.

For each of the 2 treebanks -- before CNF transformation and after CNF transformation with the parameters set in Question 1.2, report the following statistics, for the expansions of the NP category:

Question 1.4: Model Selection: Comparing 2 Bigram Language Models

We have induced a PCFG from an annotated treebank in the hope of obtaining a grammar where the PCFG independence assumption holds better. We will want to compare the 2 learned PCFGs, without and with annotations. We first address the issue of comparing two simple probabilistic models by comparing their entropy on heldout data. To practice, we will execute model selection on two variants of a bigram language model.

A bigram model is learned as the estimation of a bigram probability distribution derived from a frequency distribution of bigrams. We read sentence by sentence the linearized sentences from our Penn Treebank subset (use tree.leaves()), and add the dummy START symbol (the dot symbol already appears as an END symbol in our dataset).

ngram models are implemented in the nltk.model.ngram module.

Function Specification

bigram_learn(corpus, n, estimator)
-- return a bigram model acquired from the corpus
-- n is the number of trees from the corpus to use for training
-- estimator is a function that takes a ConditionalFreqDist and returns a ConditionalProbDist.
-- By default, use None - meaning, use the MLEProbDist estimator.
-- return a bigram model

Model Selection: Heldout Entropy

To validate the bigram model, perform the following steps:

Question 1.5: Model Selection: Annotated PCFG vs. Original PCFG

We now turn to the task of learning a PCFG on an annotated treebank and compare the PCFG learned with annotations and that without annotations. We will compare the 2 PCFGs learned in Question 1.2.

Perform the following steps:


Question 2: Error Analysis of a Parser

In this question, we will perform evaluation and error analysis of an existing statistical parser. We will assess the behavior of the Stanford CoreNLP parser on our treebank sample. (Note that this will give us optimistic results, since we will test on data that was used to train this parser in the first place.)

This parser is implemented in Java and can return either constituency or dependency syntactic descriptions of sentences. It includes sentence segmentation, word tokenization, word POS tagging and parsing and additional basic services. It provides close to state of the art results.

The execution of the CoreNLP process requires a 64-bit machine with at least 4GB of RAM. You will find the results of the CoreNLP parsing of the whole NLTK treebank sample in this archive. If you want, you can also download the parser and execute the parses on your own, but this is not necessary to answer this question. If you want to run the parser under Windows, the following batch file may be useful.

Question 2.1: Parseval

The CoreNLP parsed files appear in the following form, in the same order as in the merged files in the treebank:
Sentence #1 (18 tokens):
Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.
[Text=Pierre CharacterOffsetBegin=0 CharacterOffsetEnd=6 PartOfSpeech=NNP Lemma=Pierre NamedEntityTag=PERSON] [Text=Vinken CharacterOffsetBegin=7 CharacterOffsetEnd=13 PartOfSpeech=NNP Lemma=Vinken NamedEntityTag=PERSON] [Text=, CharacterOffsetBegin=13 CharacterOffsetEnd=14 PartOfSpeech=, Lemma=, NamedEntityTag=O] [Text=61 CharacterOffsetBegin=15 CharacterOffsetEnd=17 PartOfSpeech=CD Lemma=61 NamedEntityTag=DURATION NormalizedNamedEntityTag=61.0 Timex=61 years old] [Text=years CharacterOffsetBegin=18 CharacterOffsetEnd=23 PartOfSpeech=NNS Lemma=year NamedEntityTag=NUMBER NormalizedNamedEntityTag=0.0 Timex=61 years old] [Text=old CharacterOffsetBegin=24 CharacterOffsetEnd=27 PartOfSpeech=JJ Lemma=old NamedEntityTag=DURATION Timex=61 years old] [Text=, CharacterOffsetBegin=27 CharacterOffsetEnd=28 PartOfSpeech=, Lemma=, NamedEntityTag=O] [Text=will CharacterOffsetBegin=29 CharacterOffsetEnd=33 PartOfSpeech=MD Lemma=will NamedEntityTag=O] [Text=join CharacterOffsetBegin=34 CharacterOffsetEnd=38 PartOfSpeech=VB Lemma=join NamedEntityTag=O] [Text=the CharacterOffsetBegin=39 CharacterOffsetEnd=42 PartOfSpeech=DT Lemma=the NamedEntityTag=O] [Text=board CharacterOffsetBegin=43 CharacterOffsetEnd=48 PartOfSpeech=NN Lemma=board NamedEntityTag=O] [Text=as CharacterOffsetBegin=49 CharacterOffsetEnd=51 PartOfSpeech=IN Lemma=as NamedEntityTag=O] [Text=a CharacterOffsetBegin=52 CharacterOffsetEnd=53 PartOfSpeech=DT Lemma=a NamedEntityTag=O] [Text=nonexecutive CharacterOffsetBegin=54 CharacterOffsetEnd=66 PartOfSpeech=JJ Lemma=nonexecutive NamedEntityTag=O] [Text=director CharacterOffsetBegin=67 CharacterOffsetEnd=75 PartOfSpeech=NN Lemma=director NamedEntityTag=O] [Text=Nov. CharacterOffsetBegin=76 CharacterOffsetEnd=80 PartOfSpeech=NNP Lemma=Nov. NamedEntityTag=DATE NormalizedNamedEntityTag=XXXX-11-29 Timex=Nov. 29] [Text=29 CharacterOffsetBegin=81 CharacterOffsetEnd=83 PartOfSpeech=CD Lemma=29 NamedEntityTag=DATE NormalizedNamedEntityTag=XXXX-11-29 Timex=Nov. 29] [Text=. CharacterOffsetBegin=83 CharacterOffsetEnd=84 PartOfSpeech=. Lemma=. NamedEntityTag=O]
(ROOT
  (S
    (NP
      (NP (NNP Pierre) (NNP Vinken))
      (, ,)
      (ADJP
        (NP (CD 61) (NNS years))
        (JJ old))
      (, ,))
    (VP (MD will)
      (VP (VB join)
        (NP (DT the) (NN board))
        (PP (IN as)
          (NP (DT a) (JJ nonexecutive) (NN director)))
        (NP-TMP (NNP Nov.) (CD 29))))
    (. .)))

nn(Vinken-2, Pierre-1)
nsubj(join-9, Vinken-2)
num(years-5, 61-4)
npadvmod(old-6, years-5)
amod(Vinken-2, old-6)
aux(join-9, will-8)
root(ROOT-0, join-9)
det(board-11, the-10)
dobj(join-9, board-11)
det(director-15, a-13)
amod(director-15, nonexecutive-14)
prep_as(join-9, director-15)
tmod(join-9, Nov.-16)
num(Nov.-16, 29-17)
This includes for each sentence the following information:
  1. The sentence number within the file (starting at #1) and the number of tokens in the sentence.
  2. The raw sentence (reprinted after tokenization).
  3. A chunked representation of the sentence with named entities marked with [] parentheses.
  4. A constituent-based parse tree of the sentence (starting with a ROOT constituent).
  5. A dependency-based parse tree of the sentence represented as label(head-index, child-index) triplets.
Your task is to compute the ParsEval metric for the constituent-based parse trees of CoreNLP on our treebank. Report the following metrics:
  1. Precision
  2. Recall
  3. F-measure
  4. Labeled Precision
  5. Labeled Recall
  6. Labeled F-Measure
Where precision and recall are computed over "constituents" between the CoreNLP trees and the treebank trees: a constituent is a triplet (interior-node, first-index, last-index) - where first-index is the index of the first word in the sentence covered by the interior node, and last-index that of the last word (that is, [first-index, last-index] is the yield of the interior node).

In labeled precision and recall we count that 2 constituents match if they have the same label for the interior node. For unlabeled metrics, we just compare the first and last indexes. Make sure you compare the trees from the treebank after they have been simplified: for example, S-TPC-1 must be compared with S and ADVP-TMP with ADVP, and -NONE- nodes have been eliminated.

You can read more on this metric in the following references:

  1. Notes on parsing evaluation by M. Dickinson (Currently not available)
  2. An assignment on parsing evaluation by Scott Farrar includes detailed explanation and examples on how the metric is computed.
  3. Evalb software: this is the C program used in all standard evaluations of constituent parsers (by Satoshi Sekine and Michael Collins).
  4. Evalb in Java for those who prefer reading Java over C - is part of the Stanford CoreNLP package.
  5. Evalc-GUI in Java: in addition to computing the metrics, this package shows the trees that are compared and highlights graphically the differences in each pair.
Note: CoreNLP can label some NPs as NP-TMP (that is, an NP that denotes a temporal expression). In your comparison, you should use a version of the tag simplification which keeps those properly for example:
def simplify_corenlp_tag(tag):
    if tag[:6] = 'NP-TMP':
	return 'NP-TMP'
    if '-' in tag:
        tag = tag.split('-')[0]
    if '=' in tag:
        tag = tag.split('=')[0]
    return tag
( (S
    (S-TPC-1
      (NP-SBJ
        (NP
          (NP (DT A) (NN form) )
          (PP (IN of)
            (NP (NN asbestos) )))
        (RRC
          (ADVP-TMP (RB once) )
          (VP (VBN used)
            (NP (-NONE- *) )
            (S-CLR
              (NP-SBJ (-NONE- *) )
              (VP (TO to)
                (VP (VB make)
                  (NP (NNP Kent) (NN cigarette) (NNS filters) )))))))
      (VP (VBZ has)
        (VP (VBN caused)
          (NP
            (NP (DT a) (JJ high) (NN percentage) )
            (PP (IN of)
              (NP (NN cancer) (NNS deaths) ))
            (PP-LOC (IN among)
              (NP
                (NP (DT a) (NN group) )
                (PP (IN of)
                  (NP
                    (NP (NNS workers) )
                    (RRC
                      (VP (VBN exposed)
                        (NP (-NONE- *) )
                        (PP-CLR (TO to)
                          (NP (PRP it) ))
                        (ADVP-TMP
                          (NP
                            (QP (RBR more) (IN than) (CD 30) )
                            (NNS years) )
                          (IN ago) )))))))))))
    (, ,)
    (NP-SBJ (NNS researchers) )
    (VP (VBD reported)
      (SBAR (-NONE- 0)
        (S (-NONE- *T*-1) )))
    (. .) ))

Question 2.2: Accuracy per Distance

In general, parsers do a much better job on short constituents than long ones. Draw a plot of the accuracy of constituents per constituent length. The length of a constituent (node, first, last) is last-first+1. For a given constituent length X, accuracy is the number of constituents of length X in the parsed tree that are accurate divided by the total number of constituents of length X.

Question 2.3: Accuracy per Label

Report accuracy of constituents per label type (S, SBAR, NP, VP, PP etc). For each node type, report number of occurrences and accuracy.

Question 3: Fine-grained Labeling of Parse Trees (Optional for Bonus Credit)

We have observed that the original Penn Treebank uses fine-grained annotations on some of the labels used in the trees. These annotations indicate the type of grammatical function fulfilled by each constituent. The guidelines to add these functional annotations are summarized below (from Penn Treebank II Tags):
  1. Grammatical role:
  2. Adverbials: Adverbials are generally VP adjuncts.
Parsers such as CoreNLP do not recover these fine-grained labels, but only the simplified ones. (The main reason for this tradition of ignoring fine-grained labels is that it is widely believed that these labels are not consistent across Penn Treebank human annotators - see for example Handling noisy training and testing data.)

In this question, we attempt to recover the fine-grained labels of PP-TMP and PP-LOC (these two tags are the most frequent annotated tags in the treebank). This task is a classification task. Your objective is, given a parse tree containining only simplified labels (as produced by CoreNLP), to recover the tags PP-TMP and PP-LOC as they appear in the original Treebank.

To this end, you must define features which can be helpful predictors of the label PP-TMP, PP-LOC or PP with no annotation. You can use the additional information returned by CoreNLP to extract useful features:

  1. The full parse tree
  2. The dependency parse tree
  3. The chunked version with recognized named-entities.
Train your classifier on 80% of the PP nodes and test it on the remaining 20%. You should propose an encoding of structural features and choose an appropriate classifier. Evaluate your classifier and report per-label precision, recall, F-measure. Discuss your observations. Useful features include:
  1. Label of the constituent, of its parent, of its left and right siblings.
  2. Lexical term of the preposition
  3. Lexical head of the NP object of the PP (best obtained by using the dependency representation of the sentence)
  4. Word representations of the lexical heads of the NP object of the PP as provided Word Representations for NLP.
  5. Whether a named-entity is tagged as part of the NP object of the PP and the type of this NE.


Last modified 07 Jan 2013