Assignment 3

Due: Monday 3 July 2012 Midnight

Natural Language Processing - Spring 2012 Michael Elhadad

This assignment covers 2 topics:

Machine Learning methods (SVM) for Named Entity Recognition
CFG Parsing

Submit your solution in the form of an HTML file, using the same CSS as this page with code inside <pre> tags. Images should be submitted as PNG files.

Named Entity Recognition with SVM

Named Entity Recognition

The task of Named Entity Recognition (NER) involves the recognition of names of persons, locations, organizations, dates in free text. For example, the following sentence is tagged with sub-sequences indicating PER (for persons), LOC (for location) and ORG (for organization):

Wolff, currently a journalist in Argentina, played with Del Bosque in the final years of the seventies in Real Madrid.

[PER Wolff ] , currently a journalist in [LOC Argentina ] , played with [PER Del Bosque ] in the final years 
of the seventies in [ORG Real Madrid ] .

NER involves 2 sub-tasks: identifying the boundaries of such expressions (the open and close brackets) and labeling the expressions (with tags such as PER, LOC or ORG). As for the task of chunking, this sequence labeling task is mapped to a classification tag, using a BIO encoding of the data:

        Wolff B-PER
            , O
    currently O
            a O
   journalist O
           in O
    Argentina B-LOC
            , O
       played O
         with O
          Del B-PER
       Bosque I-PER
           in O
          the O
        final O
        years O
           of O
          the O
    seventies O
           in O
         Real B-ORG
       Madrid I-ORG
            . O

Dataset

The dataset we will use for this question is derived from the CoNLL 2002 shared task - which is about NER in Spanish and Dutch. The dataset is included in the NLTK distribution. Explanations on the dataset are provided in the CoNLL 2002 page.

To access the data in Python, do:

from nltk.corpus import conll2002

conll2002.chunked_sents('esp.train') # In Spanish
conll2002.chunked_sents('esp.testa') # In Spanish
conll2002.chunked_sents('esp.testb') # In Spanish

conll2002.chunked_sents('ned.train') # In Dutch
conll2002.chunked_sents('ned.testa') # In Dutch
conll2002.chunked_sents('ned.testb') # In Dutch

The data consists of three files per language (Spanish and Dutch): one training file and two test files testa and testb. The first test file is to be used in the development phase for optimizing parameters for the learning system. The second test file will be used for the final evaluation.

SVM Models

Support Vector Models are a machine learning classification method that is well adapted to sequence labeling and can combine very large sets of features.

We will use the libsvm implementation of SVM in this question. libsvm is written in C++ and includes a Python interface. The download comes with a pre-compiled Windows version and a makefile for Linux - including the Python wrapper.

A Practical Guide to Support Vector Classification (Hsu, Chang and Lin 2009) presents a detailed procedure on how to run libsvm on practical datasets.

libsvm takes as input ASCII files encoding the training data and learns a model. Test data is submitted to a trained model also encoded in ASCII files. The format of training and testing data file is:

<label> <index1>:<value1> <index2>:<value2> ...
.
.

Each line contains an instance and is ended by a '\n' character. For classification, <label> is an integer indicating the class label.

NOTE: libsvm supports multi-class classification (not only binary classification). In our case, we must use one class for each BIO tag (B-PER, I-PER, B-LOC, I-LOC, B-ORG, I-ORG, O). You should assign integer values for each class in the SVM encoding.

<index>:<value> gives a feature (attribute) value. <index> is an integer starting from 1 and <value> is a real number. In our case, we will only use binary features, so the value should only be 1 for present features. All absent features are assumed to be 0. (Do not encode 0 values explicitly otherwise the RAM usage will become unbearable.) Indices must be in ASCENDING order.

To check if your data is in a correct form, use `tools/checkdata.py' (details in `tools/README').

Type `svm-train ner', and the program will read the training data and output the model file `ner.model'. Put test data in the right format in a file called ner.t, then type `svm-predict ner.t ner.model output' to see the prediction accuracy. The `output' file contains the predicted class labels.

Alternatively, you can use the Python wrapper of libsvm to work directly on SVM models and parameters. The libsvm package includes a version pre-compiled under Windows for Python 2.7.

Here is a sample session in Python showing the basic functionality of the Python wrapper:

>>> from svm import *
# First parameter is a list of classes
# Second parameter is a list of vectors in the training set
>>> p = svm_problem([1,-1], [[1,0,1],[-1,0,-1]])  
>>> param = svm_parameter()
>>> param.kernel_type = LINEAR
>>> param.C = 10
>>> m = svm_model(p, param)
optimization finished, #iter = 1
nu = 0.025000
obj = -0.250000, rho = 0.000000
nSV = 2, nBSV = 0
Total nSV = 2
>>> m.predict([1,1,1])
1.0
>>> m.save('test.model')
>>> n = svm_model('test.model')
>>> n.predict([1,1,1])
1.0

If you have any trouble using the Python wrapper, you can use the regular command line tools on ASCII files as described above. Here is a sample training file a.train:

+1 1:1 10:1
+2 2:1 10:1
+3 3:1 11:1

The corresponding training:

\libsvm-3.12>windows\svm-train.exe -t 0 -c 10 a.train
*
optimization finished, #iter = 1
nu = 0.100000
obj = -1.000000, rho = 0.000000
nSV = 2, nBSV = 0
*
optimization finished, #iter = 1
nu = 0.050000
obj = -0.500000, rho = 0.000000
nSV = 2, nBSV = 0
*
optimization finished, #iter = 1
nu = 0.050000
obj = -0.500000, rho = 0.000000
nSV = 2, nBSV = 0
Total nSV = 3

This creates a file a.train.model. The corresponding test data is stored in the file a.test:

2:1 10:1
2:1 11:1

And to test this data, use svm_predict, which outputs predictions in the a.out file:

\libsvm-3.12>windows\svm-predict.exe a.test a.train.model a.out
Accuracy = 50% (1/2) (classification)

Your Task

Your task when using SVM consists of:

Choosing the right features for encoding the problem.
Encode your training dataset in the ASCII format expected by libsvm (that is, extract features).
Select the optimal parameters when running SVM. The most important parameters are the kernel type and the C parameter.
Train and test the model.
Perform error analysis.

Here is a list of features that have been found appropriate for NER in previous work:

The word form (the string as it appears in the sentence)
The POS of the word
ORT - a feature that captures the orthographic (letter) structure of the word. It can have any of the following values: number, contains-digit, contains-hyphen, capitalized, all-capitals, URL, punctuation, regular.
prefix1: first letter of the word
prefix2: first two letters of the word
prefix3: first three letters of the word
suffix1: last letter of the word
suffix2: last two letters of the word
suffix3: last three letters of the word

To encode sentences into feature vectors, you must first pre-process the training set and assign numbers to each observed feature. You must therefore write Python code that will maintain dictionaries of each feature, and be able to compute vectors representing each word in test data.

For example, given the following toy training data, the encoding of the features would be:

        Wolff NP  B-PER
            , ,   O
    currently RB  O
            a AT  O
   journalist NN  O
           in IN  O
    Argentina NP  B-LOC
            , ,   O
       played VBD O
         with IN  O
          Del NP  B-PER
       Bosque NP  I-PER
           in IN  O
          the AT  O
        final JJ  O
        years NNS O
           of IN  O
          the AT  O
    seventies NNS O
           in IN  O
         Real NP  B-ORG
       Madrid NP  I-ORG
            . .   O

Classes
1 B-PER
2 I-PER
3 B-LOC
4 I-LOC
5 B-ORG
6 I-ORG
7 O

Feature WORD-FORM:
1 Wolff
2 ,
3 currently
4 a
5 journalist
6 in
7 Argentina
8 played
9 with
10 Del
11 Bosque
12 the
13 final
14 years
15 of
16 seventies
17 Real
18 Madrid
19 .

Feature POS
20 NP
21 ,
22 RB
23 AT
24 NN
25 VBD
26 JJ
27 NNS
28 .

Feature ORT
29 number
30 contains-digit
31 contains-hyphen
32 capitalized
33 all-capitals
34 URL
35 punctuation
36 regular

Feature Prefix1
37 W
38 ,
39 c
40 a
41 j
42 i
43 A
44 p
45 w
46 D
47 B
48 t
49 f
50 y
51 o
52 s
53 .

Given this encoding, we can compute the vector representing the first word "Wolff NP B-PER" as:

# Class: B-PER=1
# Word-form: Wolff=1
# POS: NP=20
# ORT: Capitalized=32
# prefix1: W=37
1 1:1 20:1 32:1 37:1

When you encode the test dataset, some of the word-forms will be unknown (not seen in the training dataset). You should, therefore, plan for a special value for each feature of type "unknown" when this is expected.

Your task is to encode the training dataset and test dataset, experiment with good SVM parameters, and present accuracy and error analysis: Precision and Recall per tag, confusion matrix.

Question 2: PCFG Parsing

Question 2.1: Random PCFG Generation

Consider the PCFG datatype implementation provided in NLTK. Consider the module nltk.parse.generate. It contains a function generate(grammar) which given a CFG grammar produces all the sentences the grammar can produce. Naturally, this may produce an infinite number of sentences if the grammar is recursive.

>>> from nltk.parse import generate
the man saw the man
the man saw the park
the man saw the dog
the man saw the telescope
the man saw a man
the man saw a park
the man saw a dog
the man saw a telescope
the man walked the man
the man walked the park
the man walked the dog
the man walked the telescope
...

Write a new function pcfg_generate(grammar) which will generate a random tree from a PCFG. Generation works top down: pick the start symbol of the grammar, then randomly choose a production starting from the start symbol, and add the RHS as children of the root node, then expand each node until you obtain terminals. The result should be a tree and not a sentence. The generated parse trees should exhibit the distribution of non-terminal distributions captured in the PCFG (that is, the generation process should sample rules according to the weights of each non-terminal in the grammar). Note: you must generate according to the PCFG distribution - not just randomly. To this end, you should use the method generate() from the NTLK ProbDistI interface. Specifically, you should use the nltk.probability.DictionaryProbDist class to represent the probability distribution of mapping an LHS non-terminal to one of the possible RHS.

>>> from nltk.parse import Nonterminal
>>> from nltk.grammar import toy_pcfg2
>>> from nltk.probability import DictionaryProbDist

>>> productions = toy_pcfg2.productions()

# Get all productions with LHS=NP
>>> np_productions = toy_pcfg2.productions(Nonterminal('NP'))
[NP -> Det N [0.41], NP -> Name [0.28], NP -> NP PP [0.31]]

>>> dict = {}
>>> for pr in np_productions: dict[pr.rhs()] = pr.prob()
>>> np_probDist = DictionaryProbDist(dict)

# Generate a random RHS from np_probDist
>>> np_probDist.generate()

# Each time you call, you get a random sample
>>> pd1.generate()
(Det, N)
>>> pd1.generate()
(Name,)
>>> pd1.generate()
(Name,)
>>> pd1.generate()
(Name,)
>>> pd1.generate()
(NP, PP)
>>> pd1.generate()
(Det, N)

Function specification

 
pcfg_generate(grammar) -- return a tree sampled from the language described by the grammar

Validation

To validate the generate function, perform the following steps:

Generate 1,000 random sentences using nltk.grammar.toy_pcfg2
Compute the frequency distribution of each non-terminal and pre-terminal in the generated corpus. You can look at the code of nltk.induce_pcfg(root, productions) to see how this can be done. You should construct one distribution per non-terminal.
For each distribution, compute the KL-divergence between the MLE estimation of the probability distribution constructed on your test corpus and toy_pcfg2. The MLE estimation is obtained by applying the MLEProbDist estimator to the observed empirical frequency distribution. KL-divergence (Kullback-Leibler divergence also known as relative entropy) estimates the difference between two distributions.
Read some precisions on handling diverging KL computations.
Explain your observations.

Question 2.2: Learn a PCFG from a Treebank

In this question, we will learn a PCFG from a treebank. We start from a subset (of about 5%) of the trees of the Penn Treebank (distributed by NLTK in the corpus named "treebank").

In this question, we want to induce a PCFG from the treebank and investigate its properties. For reference, NLTK provides a function called induce_pcfg in the nltk.grammar module. It starts from a set of productions and constructs a weighted grammar with the MLE esitmation of the production distributions.

Function specification

At this stage, we learn the PCFG "as is" -- without Chomsky Form Normalization -- but with "simplified" tags. That is, we use the option: treebank = LazyCorpusLoader('treebank/combined', BracketParseCorpusReader, r'wsj_.*\.mrg', tag_mapping_function=simplify_wsj_tag) so that all tags of the form "NP-SBJ-2" are simplified into "NP". We also want to filter out "NONE" elements from the treebank. You must find the best way to perform NONE filtering on the trees.

 
pcfg_learn(treebank, n) 
-- treebank is the nltk.corpus.treebank lazy corpus reader
-- n indicates the number of trees to read
-- return an nltk.WeigthedGrammar

Data Exploration and Validation

Explore the following properties of the learned PCFG:

How many productions are learned from the trees? How many interior nodes were in the treebank?
Draw a plot of the distribution of productions according to their frequency (number of rules with frequency 1, 2, ...). What do you observe?
Assume we "cut" the tail of the learned PCFG, that is we remove the least frequent rules, so that we keep only the F most frequent rules out of the N rules in the PCFG. We want to assess the "harm" made by this reduction in the size of the grammar on its capacity to recognize the trees of the treebank. Write a function that determines whether a tree can be parsed by a grammar (cover_tree(grammar, tree)). (This function does not actually parse anything, it just tests that a given tree can be produced by a grammar.) Draw a plot that indicates the number of trees "missed" as the number of rules is reduced (sample every 10% of the size of the grammar).

CNF PCFG

We now want to learn a PCFG in Chomsky Normal Form from the treebank, with simplified tags, and with filtered NONE elements. The strategy is to convert the trees in the treebank into CNF, then to induce the PCFG from the transformed trees. Use the function chomsky_normal_form in nltk.treetransforms to convert the trees. Pay attention to NONE filtering.

pcfg_cnf_learn(treebank, n)
-- treebank is the nltk.corpus.treebank lazy corpus reader (simplified tags)
-- n indicates the number of trees to read
-- return an nltk.WeigthedGrammar in CNF

How many productions are learned from the CNF trees? How many interior nodes were in the original treebank, and in the CNF treebank?
Compare the CNF and non-CNF figures.

Question 2.3: Test CFG Independence Assumptions

We want to test the CFG hypothesis that a node expansion is independent from its location within a tree.

Report the following statistics, for the 5 most likely expansions of the NP category (in original form, simplified tags, NONE filtered):

Compute the distribution of each expansion (RHS) given the NP LHS for all NPs in the treebank. Draw a histogram plot.
Compute the same distribution only for NPs that appear directly below a S node. Draw a histogram plot.
Compute the same distribution only for NPs that appear directly below a VP node. Draw a histogram plot.
Compare the distributions above. Which metric will provide a good measure of this comparison? (Use KL-divergence.)
Conclude: does the data in the treebank confirm the CFG hypothesis?

We will most likely observe that the CFG independence assumption does not hold for NPs in this experiment. We now want to verify whether this lack of independence hurts when using the learned PCFG.

Question 2.4: Learn a bigram Language Model

To test the quality of the language generated by the learned PCFG, we will evaluate a sample of generated data with a bigram language model. We first need to learn such a bigram model.

A bigram model is learned as the estimation of a bigram probability distribution derived from a frequency distribution of bigrams. We read sentence by sentence the linearized sentences from our Penn Treebank subset (use tree.leaves()), and add the dummy START symbol (the dot symbol already appears as an END symbol in our dataset).

ngram models are implemented in the nltk.model.ngram module.

Function specification

 
bigram_learn(corpus, n, estimator)
-- return a bigram model acquired from the corpus
-- n is the number of trees from the corpus to use for training
-- estimator is a function that takes a ConditionalFreqDist and returns a ConditionalProbDist.  
-- By default, use None - meaning, use the MLEProbDist estimator.
-- return a bigram model

Validation

To validate the bigram model, perform the following steps:

Train your bigram model on 80% of the treebank text -- once with the MLEProbDist estimator and once with the LidstoneProbDist estimator. (See ngramModel.demo() for details on how the estimator should be used.)
Compute the entropy of the held out 20% of the treebank text using each of the 2 bigram models.
Generate 50 random sentences from the better model (see the method generate() in the ngramModel class).
Observe the generated sentences and qualitatively assess their quality.

Question 2.5: Validation of PCFG Generated text using an n-gram Model

We want to test the hypothesis that the language generated by the learned PCFG does not match well the n-gram model learned on the same sample textual data (because the independence assumptions of the PCFG do not hold).

To this end, construct a sample of 1,000 sentences randomly generated by the PCFG. Estimate the likelihood of this corpus using the learned bigram model. Compare the estimated likelihood with the ngram estimated likelihood on the testing set from real sentences. What are your conclusions?

Last modified 14 Jun 2012