Assignment 3

Due: Sunday 27 Jan 2013 Midnight

Natural Language Processing - Fall 2013 Michael Elhadad

This assignment covers the topic of syntactic parsing and classification. Submit your solution in the form of an HTML file, using the same CSS as this page with code inside <pre> tags. Images should be submitted as PNG files.

Question 1: PCFG Parsing

Question 1.1: Random PCFG Generation

Consider the PCFG datatype implementation provided in NLTK. Consider the module nltk.parse.generate. It contains a function generate(grammar) which given a CFG grammar produces all the sentences the grammar can produce. Naturally, this may produce an infinite number of sentences if the grammar is recursive.

>>> from nltk.parse import generate
the man saw the man
the man saw the park
the man saw the dog
the man saw the telescope
the man saw a man
the man saw a park
the man saw a dog
the man saw a telescope
the man walked the man
the man walked the park
the man walked the dog
the man walked the telescope
...

Write a new function pcfg_generate(grammar) which will generate a random tree from a PCFG according to its generative probability model: Generation works top down: pick the start symbol of the grammar, then randomly choose a production starting from the start symbol, and add the RHS as children of the root node, then expand each node until you obtain terminals. The result should be a tree and not a sentence. The generated parse trees should exhibit the distribution of non-terminal distributions captured in the PCFG (that is, the generation process should sample rules according to the weights of each non-terminal in the grammar). Note: you must generate according to the PCFG distribution - not just randomly. To this end, you should use the method generate() from the NTLK ProbDistI interface. Specifically, you should use the nltk.probability.DictionaryProbDist class to represent the probability distribution of mapping an LHS non-terminal to one of the possible RHS.

>>> from nltk.parse import Nonterminal
>>> from nltk.grammar import toy_pcfg2
>>> from nltk.probability import DictionaryProbDist

>>> productions = toy_pcfg2.productions()

# Get all productions with LHS=NP
>>> np_productions = toy_pcfg2.productions(Nonterminal('NP'))
[NP -> Det N [0.41], NP -> Name [0.28], NP -> NP PP [0.31]]

>>> dict = {}
>>> for pr in np_productions: dict[pr.rhs()] = pr.prob()
>>> np_probDist = DictionaryProbDist(dict)

# Generate a random RHS from np_probDist
>>> np_probDist.generate()

# Each time you call, you get a random sample
>>> pd1.generate()
(Det, N)
>>> pd1.generate()
(Name,)
>>> pd1.generate()
(Name,)
>>> pd1.generate()
(Name,)
>>> pd1.generate()
(NP, PP)
>>> pd1.generate()
(Det, N)

Function specification

pcfg_generate(grammar) -- return a tree sampled from the language described by the grammar

Validation

To validate the generate function, perform the following steps:

Generate 1,000 random sentences using nltk.grammar.toy_pcfg2 - store the resulting sentences in a file "toy_pcfg2.gen".
Compute the frequency distribution of each non-terminal and pre-terminal in the generated corpus. You can look at the code of nltk.induce_pcfg(root, productions) to see how this can be done. You should construct one distribution per non-terminal.
For each distribution, compute the KL-divergence between the MLE estimation of the probability distribution constructed on your test corpus and toy_pcfg2. The MLE estimation is obtained by applying the MLEProbDist estimator to the observed empirical frequency distribution. KL-divergence (Kullback-Leibler divergence also known as relative entropy) estimates the difference between two distributions.
Read some precisions on handling diverging KL computations.
Explain your observations.

Question 1.2: Learn a PCFG from a Treebank

In this question, we will learn a PCFG from a treebank with different types of tree annotations. We start from a subset of 200 trees from the Penn Treebank (which is distributed by NLTK in the corpus named "treebank").

In this question, we want to induce a PCFG from the treebank and investigate its properties. For reference, NLTK provides a function called induce_pcfg in the nltk.grammar module. It starts from a set of productions and constructs a weighted grammar with the MLE estimation of the production distributions.

Function specification

At this stage, we learn the PCFG "as is" -- without Chomsky Form Normalization -- but with "simplified" tags. That is, we must use a function to simplify all tags in the tree (non-terminal and terminal):

from nltk.corpus import LazyCorpusLoader, BracketParseCorpusReader

def simplify_functional_tag(tag):
    if '-' in tag:
        tag = tag.split('-')[0]
    return tag

treebank = LazyCorpusLoader('treebank/combined', BracketParseCorpusReader, r'wsj_.*\.mrg')

print treebank.parsed_sents()[:1]
-- we need to transform the tree to remove NONE tags and simplify tags.

We want all tags of the form "NP-SBJ-2" to be simplified into "NP". We also want to filter out "NONE" elements from the treebank. You must find the best way to perform NONE filtering on the trees. (NONE elements appear in the original Penn Treebank for example in the case of relative clauses.)

pcfg_learn(treebank, n)
-- treebank is the nltk.corpus.treebank lazy corpus reader
-- n indicates the number of trees to read
-- return an nltk.WeigthedGrammar

Data Exploration and Validation

Explore the following properties of the learned PCFG:

How many productions are learned from the trees? How many interior nodes were in the treebank?
Draw a plot of the distribution of productions according to their frequency (number of rules with frequency 1, 2, ...). What do you observe?

CNF PCFG

We now want to learn a PCFG in Chomsky Normal Form from the treebank, with simplified tags, and with filtered NONE elements. The strategy is to convert the trees in the treebank into CNF, then to induce the PCFG from the transformed trees. Use the function chomsky_normal_form in nltk.treetransforms to convert the trees. Pay attention to NONE filtering. Use horizontal Markov annotation and parent annotation: chomsky_normal_form(tree, factor='right', horzMarkov=1, vertMarkov=1, childChar='|', parentChar='^')

pcfg_cnf_learn(treebank, n)
-- treebank is the nltk.corpus.treebank lazy corpus reader (simplified tags)
-- n indicates the number of trees to read
-- return an nltk.WeigthedGrammar in CNF

How many productions are learned from the CNF trees? How many interior nodes were in the original treebank, and in the CNF treebank?
Compare the CNF and non-CNF figures (ratios). What do you conclude?

Question 1.3: Test CFG Independence Assumptions

We want to test the CFG hypothesis that a node expansion is independent from its location within a tree.

For each of the 2 treebanks -- before CNF transformation and after CNF transformation with the parameters set in Question 1.2, report the following statistics, for the expansions of the NP category:

Compute the distribution of each expansion (RHS) given the NP LHS for all NPs in the treebank. Draw a histogram plot.
Compute the same distribution only for NPs that appear directly below a S node. Draw a histogram plot.
Compute the same distribution only for NPs that appear directly below a VP node. Draw a histogram plot.
Compare the distributions above. Which metric will provide a good measure of this comparison? (Use KL-divergence.)
Conclude: does the data in the treebank confirm the CFG hypothesis for each configuration (original trees, annotated CNF trees)?
Do you observe that annotated trees better fit the CFG independence assumption than the original trees?

Question 1.4: Model Selection: Comparing 2 Bigram Language Models

We have induced a PCFG from an annotated treebank in the hope of obtaining a grammar where the PCFG independence assumption holds better. We will want to compare the 2 learned PCFGs, without and with annotations. We first address the issue of comparing two simple probabilistic models by comparing their entropy on heldout data. To practice, we will execute model selection on two variants of a bigram language model.

A bigram model is learned as the estimation of a bigram probability distribution derived from a frequency distribution of bigrams. We read sentence by sentence the linearized sentences from our Penn Treebank subset (use tree.leaves()), and add the dummy START symbol (the dot symbol already appears as an END symbol in our dataset).

ngram models are implemented in the nltk.model.ngram module.

Function Specification

bigram_learn(corpus, n, estimator)
-- return a bigram model acquired from the corpus
-- n is the number of trees from the corpus to use for training
-- estimator is a function that takes a ConditionalFreqDist and returns a ConditionalProbDist.
-- By default, use None - meaning, use the MLEProbDist estimator.
-- return a bigram model

Model Selection: Heldout Entropy

To validate the bigram model, perform the following steps:

Train your bigram model on 80% of the raw treebank text -- once with the MLEProbDist estimator and once with the LidstoneProbDist estimator. (See ngramModel.demo() for details on how the estimator should be used.)
Compute the entropy of the held out 20% of the treebank text using each of the 2 bigram models.
Which bigram model is better? Explain.

Question 1.5: Model Selection: Annotated PCFG vs. Original PCFG

We now turn to the task of learning a PCFG on an annotated treebank and compare the PCFG learned with annotations and that without annotations. We will compare the 2 PCFGs learned in Question 1.2.

Perform the following steps:

Train your PCFG models (original trees and annotated CNF trees) on 80% of the treebank.
Compute the entropy of the held out 20% of the treebank using each of the 2 PCFG models.
Which PCFG model is better? Explain.
Can you compare the heldout entropy of the bigram models and that of the PCFG models? Explain.

Question 2: Error Analysis of a Parser

In this question, we will perform evaluation and error analysis of an existing statistical parser. We will assess the behavior of the Stanford CoreNLP parser on our treebank sample. (Note that this will give us optimistic results, since we will test on data that was used to train this parser in the first place.)

This parser is implemented in Java and can return either constituency or dependency syntactic descriptions of sentences. It includes sentence segmentation, word tokenization, word POS tagging and parsing and additional basic services. It provides close to state of the art results.

The execution of the CoreNLP process requires a 64-bit machine with at least 4GB of RAM. You will find the results of the CoreNLP parsing of the whole NLTK treebank sample in this archive. If you want, you can also download the parser and execute the parses on your own, but this is not necessary to answer this question. If you want to run the parser under Windows, the following batch file may be useful.

Question 2.1: Parseval

The CoreNLP parsed files appear in the following form, in the same order as in the merged files in the treebank:

Sentence #1 (18 tokens):
Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.
[Text=Pierre CharacterOffsetBegin=0 CharacterOffsetEnd=6 PartOfSpeech=NNP Lemma=Pierre NamedEntityTag=PERSON] [Text=Vinken CharacterOffsetBegin=7 CharacterOffsetEnd=13 PartOfSpeech=NNP Lemma=Vinken NamedEntityTag=PERSON] [Text=, CharacterOffsetBegin=13 CharacterOffsetEnd=14 PartOfSpeech=, Lemma=, NamedEntityTag=O] [Text=61 CharacterOffsetBegin=15 CharacterOffsetEnd=17 PartOfSpeech=CD Lemma=61 NamedEntityTag=DURATION NormalizedNamedEntityTag=61.0 Timex=61 years old] [Text=years CharacterOffsetBegin=18 CharacterOffsetEnd=23 PartOfSpeech=NNS Lemma=year NamedEntityTag=NUMBER NormalizedNamedEntityTag=0.0 Timex=61 years old] [Text=old CharacterOffsetBegin=24 CharacterOffsetEnd=27 PartOfSpeech=JJ Lemma=old NamedEntityTag=DURATION Timex=61 years old] [Text=, CharacterOffsetBegin=27 CharacterOffsetEnd=28 PartOfSpeech=, Lemma=, NamedEntityTag=O] [Text=will CharacterOffsetBegin=29 CharacterOffsetEnd=33 PartOfSpeech=MD Lemma=will NamedEntityTag=O] [Text=join CharacterOffsetBegin=34 CharacterOffsetEnd=38 PartOfSpeech=VB Lemma=join NamedEntityTag=O] [Text=the CharacterOffsetBegin=39 CharacterOffsetEnd=42 PartOfSpeech=DT Lemma=the NamedEntityTag=O] [Text=board CharacterOffsetBegin=43 CharacterOffsetEnd=48 PartOfSpeech=NN Lemma=board NamedEntityTag=O] [Text=as CharacterOffsetBegin=49 CharacterOffsetEnd=51 PartOfSpeech=IN Lemma=as NamedEntityTag=O] [Text=a CharacterOffsetBegin=52 CharacterOffsetEnd=53 PartOfSpeech=DT Lemma=a NamedEntityTag=O] [Text=nonexecutive CharacterOffsetBegin=54 CharacterOffsetEnd=66 PartOfSpeech=JJ Lemma=nonexecutive NamedEntityTag=O] [Text=director CharacterOffsetBegin=67 CharacterOffsetEnd=75 PartOfSpeech=NN Lemma=director NamedEntityTag=O] [Text=Nov. CharacterOffsetBegin=76 CharacterOffsetEnd=80 PartOfSpeech=NNP Lemma=Nov. NamedEntityTag=DATE NormalizedNamedEntityTag=XXXX-11-29 Timex=Nov. 29] [Text=29 CharacterOffsetBegin=81 CharacterOffsetEnd=83 PartOfSpeech=CD Lemma=29 NamedEntityTag=DATE NormalizedNamedEntityTag=XXXX-11-29 Timex=Nov. 29] [Text=. CharacterOffsetBegin=83 CharacterOffsetEnd=84 PartOfSpeech=. Lemma=. NamedEntityTag=O]
(ROOT
  (S
    (NP
      (NP (NNP Pierre) (NNP Vinken))
      (, ,)
      (ADJP
        (NP (CD 61) (NNS years))
        (JJ old))
      (, ,))
    (VP (MD will)
      (VP (VB join)
        (NP (DT the) (NN board))
        (PP (IN as)
          (NP (DT a) (JJ nonexecutive) (NN director)))
        (NP-TMP (NNP Nov.) (CD 29))))
    (. .)))

nn(Vinken-2, Pierre-1)
nsubj(join-9, Vinken-2)
num(years-5, 61-4)
npadvmod(old-6, years-5)
amod(Vinken-2, old-6)
aux(join-9, will-8)
root(ROOT-0, join-9)
det(board-11, the-10)
dobj(join-9, board-11)
det(director-15, a-13)
amod(director-15, nonexecutive-14)
prep_as(join-9, director-15)
tmod(join-9, Nov.-16)
num(Nov.-16, 29-17)

This includes for each sentence the following information:

The sentence number within the file (starting at #1) and the number of tokens in the sentence.
The raw sentence (reprinted after tokenization).
A chunked representation of the sentence with named entities marked with [] parentheses.
A constituent-based parse tree of the sentence (starting with a ROOT constituent).
A dependency-based parse tree of the sentence represented as label(head-index, child-index) triplets.

Your task is to compute the ParsEval metric for the constituent-based parse trees of CoreNLP on our treebank. Report the following metrics:

Precision
Recall
F-measure
Labeled Precision
Labeled Recall
Labeled F-Measure

Where precision and recall are computed over "constituents" between the CoreNLP trees and the treebank trees: a constituent is a triplet (interior-node, first-index, last-index) - where first-index is the index of the first word in the sentence covered by the interior node, and last-index that of the last word (that is, [first-index, last-index] is the yield of the interior node).

In labeled precision and recall we count that 2 constituents match if they have the same label for the interior node. For unlabeled metrics, we just compare the first and last indexes. Make sure you compare the trees from the treebank after they have been simplified: for example, S-TPC-1 must be compared with S and ADVP-TMP with ADVP, and -NONE- nodes have been eliminated.

You can read more on this metric in the following references:

Notes on parsing evaluation by M. Dickinson (Currently not available)
An assignment on parsing evaluation by Scott Farrar includes detailed explanation and examples on how the metric is computed.
Evalb software: this is the C program used in all standard evaluations of constituent parsers (by Satoshi Sekine and Michael Collins).
Evalb in Java for those who prefer reading Java over C - is part of the Stanford CoreNLP package.
Evalc-GUI in Java: in addition to computing the metrics, this package shows the trees that are compared and highlights graphically the differences in each pair.

Note: CoreNLP can label some NPs as NP-TMP (that is, an NP that denotes a temporal expression). In your comparison, you should use a version of the tag simplification which keeps those properly for example:

def simplify_corenlp_tag(tag):
    if tag[:6] = 'NP-TMP':
	return 'NP-TMP'
    if '-' in tag:
        tag = tag.split('-')[0]
    if '=' in tag:
        tag = tag.split('=')[0]
    return tag

( (S
    (S-TPC-1
      (NP-SBJ
        (NP
          (NP (DT A) (NN form) )
          (PP (IN of)
            (NP (NN asbestos) )))
        (RRC
          (ADVP-TMP (RB once) )
          (VP (VBN used)
            (NP (-NONE- *) )
            (S-CLR
              (NP-SBJ (-NONE- *) )
              (VP (TO to)
                (VP (VB make)
                  (NP (NNP Kent) (NN cigarette) (NNS filters) )))))))
      (VP (VBZ has)
        (VP (VBN caused)
          (NP
            (NP (DT a) (JJ high) (NN percentage) )
            (PP (IN of)
              (NP (NN cancer) (NNS deaths) ))
            (PP-LOC (IN among)
              (NP
                (NP (DT a) (NN group) )
                (PP (IN of)
                  (NP
                    (NP (NNS workers) )
                    (RRC
                      (VP (VBN exposed)
                        (NP (-NONE- *) )
                        (PP-CLR (TO to)
                          (NP (PRP it) ))
                        (ADVP-TMP
                          (NP
                            (QP (RBR more) (IN than) (CD 30) )
                            (NNS years) )
                          (IN ago) )))))))))))
    (, ,)
    (NP-SBJ (NNS researchers) )
    (VP (VBD reported)
      (SBAR (-NONE- 0)
        (S (-NONE- *T*-1) )))
    (. .) ))

Question 2.2: Accuracy per Distance

In general, parsers do a much better job on short constituents than long ones. Draw a plot of the accuracy of constituents per constituent length. The length of a constituent (node, first, last) is last-first+1. For a given constituent length X, accuracy is the number of constituents of length X in the parsed tree that are accurate divided by the total number of constituents of length X.

Question 2.3: Accuracy per Label

Report accuracy of constituents per label type (S, SBAR, NP, VP, PP etc). For each node type, report number of occurrences and accuracy.

Question 3: Fine-grained Labeling of Parse Trees (Optional for Bonus Credit)

We have observed that the original Penn Treebank uses fine-grained annotations on some of the labels used in the trees. These annotations indicate the type of grammatical function fulfilled by each constituent. The guidelines to add these functional annotations are summarized below (from Penn Treebank II Tags):

Grammatical role:
- -DTV (dative) - marks the dative object in the unshifted form of the double object construction.
  - If the preposition introducing the "dative" object is for, it is considered benefactive (-BNF).
  - -DTV (and -BNF) is only used after verbs that can undergo dative shift.
- -LGS (logical subject) - is used to mark the logical subject in passives. It attaches to the NP object of by and not to the PP node itself.
- -PRD (predicate) - marks any predicate that is not VP. In the do so construction, the so is annotated as a predicate.
- -PUT - marks the locative complement of put.
- -SBJ (surface subject) - marks the structural surface subject of both matrix and embedded clauses, including those with null subjects.
- -TPC ("topicalized") - marks elements that appear before the subject in a topicalized declarative sentence, but in two cases only:
  - if the front element is associated with a *T* in the position of the gap.
  - if the fronted element is left-dislocated (i.e. it is associated with a resumptive pronoun in the position of the gap).
- -VOC (vocative) - marks nouns of address, regardless of their position in the sentence. It is not coindexed to the subject and not get -TPC when it is sentence-initial.
Adverbials: Adverbials are generally VP adjuncts.
- -BNF (benefactive) - marks the beneficiary of an action (attaches to NP or PP).
  - This tag is used only when (1) the verb can undergo dative shift and (2) the prepositional variant (with the same meaning) uses for.
  - The prepositional objects of dative-shifting verbs with other prepositions than for (such as to or of) are annotated -DTV.
- -DIR (direction) - marks adverbials that answer the questions "from where?" and "to where?"
  - It implies motion, which can be metaphorical as in "...rose 5 pts. to 57-1/2" or "increased 70% to 5.8 billion yen"
  - -DIR is most often used with verbs of motion/transit and financial verbs.
- -EXT (extent) - marks adverbial phrases that describe the spatial extent of an activity.
  - -EXT was incorporated primarily for cases of movement in financial space, but is also used in analogous situations elsewhere.
  - Obligatory complements do not receive -EXT.
  - Words such as fully and completely are absolutes and do not receive -EXT.
- -LOC (locative) - marks adverbials that indicate place/setting of the event.
  - -LOC may also indicate metaphorical location. There is likely to be some varation in the use of -LOC due to differing annotator interpretations.
  - In cases where the annotator is faced with a choice between -LOC or -TMP, the default is -LOC.
  - In cases involving SBAR, SBAR should not receive -LOC.
  - -LOC has some uses that are not adverbial, such as with place names that are adjoined to other NPs and NAC-LOC premodifiers of NPs.
  - The special tag -PUT is used for the locative argument of put.
- -MNR (manner) - marks adverbials that indicate manner, including instrument phrases.
- -PRP (purpose or reason) - marks purpose or reason clauses and PPs.
- -TMP (temporal) - marks temporal or aspectual adverbials that answer the questions when, how often, or how long.
  - It has some uses that are not strictly adverbial, auch as with dates that modify other NPs at S- or VP-level.
  - In cases of apposition involving SBAR, the SBAR should not be labeled -TMP.
  - Only in "financialspeak," and only when the dominating PP is a PP-DIR, may temporal modifiers be put at PP object level.
  - Note that -TMP is not used in possessive phrases.

Parsers such as CoreNLP do not recover these fine-grained labels, but only the simplified ones. (The main reason for this tradition of ignoring fine-grained labels is that it is widely believed that these labels are not consistent across Penn Treebank human annotators - see for example Handling noisy training and testing data.)

In this question, we attempt to recover the fine-grained labels of PP-TMP and PP-LOC (these two tags are the most frequent annotated tags in the treebank). This task is a classification task. Your objective is, given a parse tree containining only simplified labels (as produced by CoreNLP), to recover the tags PP-TMP and PP-LOC as they appear in the original Treebank.

To this end, you must define features which can be helpful predictors of the label PP-TMP, PP-LOC or PP with no annotation. You can use the additional information returned by CoreNLP to extract useful features:

The full parse tree
The dependency parse tree
The chunked version with recognized named-entities.

Train your classifier on 80% of the PP nodes and test it on the remaining 20%. You should propose an encoding of structural features and choose an appropriate classifier. Evaluate your classifier and report per-label precision, recall, F-measure. Discuss your observations. Useful features include:

Label of the constituent, of its parent, of its left and right siblings.
Lexical term of the preposition
Lexical head of the NP object of the PP (best obtained by using the dependency representation of the sentence)
Word representations of the lexical heads of the NP object of the PP as provided Word Representations for NLP.
Whether a named-entity is tagged as part of the NP object of the PP and the type of this NE.

Last modified 07 Jan 2013