Assignment 2

Due: Monday 06 July 2009 Midnight

Natural Language Processing - Spring 2009 Michael Elhadad

This assignment covers 3 topics:

Machine Learning methods (HMM) for Chunking
Machine Learning methods (SVM) for Named Entity Recognition
Tree Analysis

Submit your solution in the form of an HTML file, using the same CSS as this page with code inside <pre> tags. Images should be submitted as PNG files.

Chunking with HMM

Chunking

The task of syntactic chunking is presented in the CoNLL 2000 shared task description. Text chunking consists of dividing a sequence of words into non-recursive chunks corresponding to syntactic phrases such as Noun Phrases, Verb Phrases and more. In this assignment, we will only identify NP and VP chunks.

Chapter 6 of the NLTK book presents a thorough introduction to the task of chunking using a simple regular expression chunker, evaluation methods, simple baselines, and an n-gram chunker, similar to the n-gram tagger we reviewed in the POS chapter.

Chunking in general is easier than full parsing because:

Chunks are non-recursive
Chunks are non-overlapping
Chunks do not cover all words in the sentence

Dataset

The dataset we will use for the assignment is the one from CoNLL 2000, which is constructed from sentences from the Wall Street Journal tagged with POS information. This dataset is included in the set of corpora of NLTK. It contains about 270K words. The split between train and test is part of the shared task specification. The data is encoded according to the BIO encoding - for each word, the POS of the word is provided and an additional tag indicating whether the word is the Beginning of a chunk of type X (B-X) or Inside a chunk of type X (I-X) or Outside any chunk (O). Thus, if we work over a set of N chunk types, there will be 2N+1 BIO tags in the encoding. For example:

Confidence NN B-NP
in IN B-PP
the DT B-NP
pound NN I-NP
is VBZ B-VP
widely RB I-VP
expected VBN I-VP
to TO I-VP
take VB I-VP
another DT B-NP
sharp JJ I-NP
dive NN I-NP
if IN B-SBAR
trade NN B-NP
figures NNS I-NP
for IN B-PP
September NNP B-NP
, , O
due JJ B-ADJP
for IN B-PP
release NN B-NP
tomorrow NN B-NP
, , O
fail VB B-VP
to TO I-VP
show VB I-VP
a DT B-NP
substantial JJ I-NP
improvement NN I-NP
from IN B-PP
July NNP B-NP
and CC I-NP
August NNP I-NP
's POS B-NP
near-record JJ I-NP
deficits NNS I-NP
. . O

To read the dataset in Python, use this corpus reader:

from nltk.corpus import conll2000

print nltk.corpus.conll2000.chunked_sents('train.txt')[99]  # Print the 100th sentence from the training set

print nltk.corpus.conll2000.chunked_sents('train.txt', chunk_types=('NP', 'VP'))[99] # Only read the NP and VP chunks

HMM Models

First read a high-level introduction to HMM models: Jan Hajic lecture on HMM models introduces the basic concepts of HMM models.

NLTK includes a Python implementation of HMM models. Go over the Python source of the HMM module: you should understand how to use the supervised trainer and run the POS tagger example on the Brown corpus using the HMM module.

Important Note: The version of hmm.py included in the 0.9.9 has a bug. Please make sure to download an updated version of hmm.py from http://nltk.googlecode.com/svn/trunk/nltk/nltk/tag/hmm.py. To test this version, run this test:

>>> import nltk
>>> nltk.tag.hmm.demo_pos()

HMM POS tagging demo

Training HMM...
Testing...
Test: the/AT fulton/NP county/NN grand/JJ jury/NN said/VBD friday/NR an/AT investigation/NN of/IN atlanta's/NP$ 
recent/JJ primary/NN election/NN produced/VBD ``/`` no/AT evidence/NN ''/'' that/CS any/DTI irregularities/NNS
took/VBD place/NN ./.

Untagged: the fulton county grand jury said friday an investigation of atlanta's recent primary election 
produced `` no evidence '' that any irregularities took place .

HMM-tagged: the/AT fulton/NP county/NN grand/JJ jury/NN said/VBD friday/NR an/AT investigation/NN of/IN 
atlanta's/NP$ recent/JJ primary/NN election/NN produced/. ``/`` no/AT evidence/NN ''/'' that/CS any/AT 
irregularities/NNS took/TO place/VB ./.

Entropy: 55.219788385

------------------------------------------------------------

An application of HMM to the task of chunking is described in this paper: Shallow Parsing using Specialized HMMs by Antonio Molina and Ferran Pla, Journal of Machine Learning Research, 2002. The paper provides a detailed description of the modeling of chunking using HMM models.

Your Task

Your task is to develop a chunker for NP and VP chunks on the CoNLL 2000 dataset using a supervised HMM model. Train the model on the training.txt part of the dataset and test it on the test.txt part.

Perform evaluation as: per-Tag Precision, Recall, F-measure (for the 5 tags B-NP, I-NP, B-VP, I-VP, O) and accuracy.

Perform per-chunk accuracy measure: how often a full chunk is identified for each chunk type (NP and VP).

Perform error-analysis per chunk: how many chunks are too short and too long.

Named Entity Recognition with SVM

Named Entity Recognition

The task of Named Entity Recognition (NER) involves the recognition of names of persons, locations, organizations, dates in free text. For example, the following sentence is tagged with sub-sequences indicating PER (for persons), LOC (for location) and ORG (for organization):

Wolff, currently a journalist in Argentina, played with Del Bosque in the final years of the seventies in Real Madrid.

[PER Wolff ] , currently a journalist in [LOC Argentina ] , played with [PER Del Bosque ] in the final years 
of the seventies in [ORG Real Madrid ] .

NER involves 2 sub-tasks: identifying the boundaries of such expressions (the open and close brackets) and labeling the expressions (with tags such as PER, LOC or ORG). As for the task of chunking, this sequence labeling task is mapped to a classification tag, using a BIO encoding of the data:

        Wolff B-PER
            , O
    currently O
            a O
   journalist O
           in O
    Argentina B-LOC
            , O
       played O
         with O
          Del B-PER
       Bosque I-PER
           in O
          the O
        final O
        years O
           of O
          the O
    seventies O
           in O
         Real B-ORG
       Madrid I-ORG
            . O

Dataset

The dataset we will use for this question is derived from the CoNLL 2002 shared task - which is about NER in Spanish and Dutch. The dataset is included in the NLTK distribution. Explanations on the dataset are provided in the CoNLL 2002 page.

To access the data in Python, do:

from nltk.corpus import conll2002

conll2002.chunked_sents('esp.train') # In Spanish
conll2002.chunked_sents('esp.testa') # In Spanish
conll2002.chunked_sents('esp.testb') # In Spanish

conll2002.chunked_sents('ned.train') # In Dutch
conll2002.chunked_sents('ned.testa') # In Dutch
conll2002.chunked_sents('ned.testb') # In Dutch

The data consists of three files per language (Spanish and Dutch): one training file and two test files testa and testb. The first test file is to be used in the development phase for finding good parameters for the learning system. The second test file will be used for the final evaluation.

SVM Models

Support Vector Models are a machine learning method that is well adapted to sequence labeling and can combine very large sets of features.

We will use the libsvm implementation of SVM in this question. libsvm is written in C++ and includes a Python interface. The download comes with a pre-compiled Windows version and a makefile for Linux - including the Python wrapper.

A Practical Guide to Support Vector Classification (Hsu, Chang and Lin 2009) presents a detailed procedure on how to run libsvm on practical datasets.

libsvm takes as input ASCII files encoding the training data and learns a model. Test data is submitted to a trained model also encoded in ASCII files. The format of training and testing data file is:

<label> <index1>:<value1> <index2>:<value2> ...
.
.

Each line contains an instance and is ended by a '\n' character. For classification, <label> is an integer indicating the class label.

NOTE: libsvm supports multi-class classification (not only binary classification). In our case, we must use one class for each BIO tag (B-PER, I-PER, B-LOC, I-LOC, B-ORG, I-ORG, O). You should assign integer values for each class in the SVM encoding.

<index>:<value> gives a feature (attribute) value. <index> is an integer starting from 1 and <value> is a real number. In our case, we will only use binary features, so the value should only be 1 for present features. All absent features are assumed to be 0. Indices must be in ASCENDING order.

To check if your data is in a correct form, use `tools/checkdata.py' (details in `tools/README').

Type `svm-train ner', and the program will read the training data and output the model file `ner.model'. Put test data in the right format in a file called ner.t, then type `svm-predict ner.t ner.model output' to see the prediction accuracy. The `output' file contains the predicted class labels.

Alternatively, you can use the Python wrapper of libsvm to work directly on SVM models and parameters. The libsvm package includes a version pre-compiled under Windows for Python 2.6 (remember - nltk only works with Python 2.5). To install the libsvm Python 2.5 wrapper (to co-exist with nltk), perform the following steps:


Install libsvm in a separate folder: unzip libsvm-2.89.zip \libsvm
Adjust the paths to Python and your compiler in Makefile (or Makefile.win for Windows)
Compile libsvm for your platform: 
    
    Under Windows: 'nmake -f Makefile.win' then 'nmake -f Makefile.win python'
    
Under Linux: make then 'make python'
    
Copy windows\python\svmc.pyd python
cd python; \python25\python setup.py install

Under Windows, this compilation will only be compatible with Python2.5 if you use MSVC2003. If you use Python2.6, then use MSVC2005 or higher (or use the precompiled binaries coming with the libsvm 2.89 package.

Here is a sample session in Python showing the basic functionality of the Python wrapper:

>>> from svm import *
# First parameter is a list of classes
# Second parameter is a list of vectors in the training set
>>> p = svm_problem([1,-1], [[1,0,1],[-1,0,-1]])  
>>> param = svm_parameter(kernel_type=LINEAR, C=10)
>>> m = svm_model(p, param)
optimization finished, #iter = 1
nu = 0.025000
obj = -0.250000, rho = 0.000000
nSV = 2, nBSV = 0
Total nSV = 2
>>> m.predict([1,1,1])
1.0
>>> m.save('test.model')
>>> n = svm_model('test.model')
>>> n.predict([1,1,1])
1.0

If you have any trouble using the Python wrapper, you can use the regular command line tools on ASCII files as described above. Here is a sample training file a.train:

+1 1:1 10:1
+2 2:1 10:1
+3 3:1 11:1

The corresponding training:

\libsvm-2.89>windows\svm-train.exe -t 0 -c 10 a.train
*
optimization finished, #iter = 1
nu = 0.100000
obj = -1.000000, rho = 0.000000
nSV = 2, nBSV = 0
*
optimization finished, #iter = 1
nu = 0.050000
obj = -0.500000, rho = 0.000000
nSV = 2, nBSV = 0
*
optimization finished, #iter = 1
nu = 0.050000
obj = -0.500000, rho = 0.000000
nSV = 2, nBSV = 0
Total nSV = 3

This creates a file a.train.model. The corresponding test data is stored in the file a.test:

2:1 10:1
2:1 11:1

And to test this data, use svm_predict, which outputs predictions in the a.out file:

\libsvm-2.89>windows\svm-predict.exe a.test a.train.model a.out
Accuracy = 50% (1/2) (classification)

Your Task

Your task when using SVM consists of:

Choosing the right features for encoding the problem.
Encode your training dataset in the ASCII format expected by libsvm.
Select the optimal parameters when running SVM. The most important parameters are the kernel type and the C parameter.
Train and test the model.
Perform error analysis.

Here is a list of features that have been found appropriate for NER in previous work:

The word form (the string as it appears in the sentence)
The POS of the word
ORT - a feature that captures the orthographic (letter) structure of the word. It can have any of the following values: number, contains-digit, contains-hyphen, capitalized, all-capitals, URL, punctuation, regular.
prefix1: first letter of the word
prefix2: first two letters of the word
prefix3: first three letters of the word
suffix1: last letter of the word
suffix2: last two letters of the word
suffix3: last three letters of the word

To encode sentences into feature vectors, you must first pre-process the training set and assign numbers to each observed feature. You must therefore write Python code that will maintain dictionaries of each feature, and be able to compute vectors representing each word in test data.

For example, given the following toy training data, the encoding of the features would be:

        Wolff NP  B-PER
            , ,   O
    currently RB  O
            a AT  O
   journalist NN  O
           in IN  O
    Argentina NP  B-LOC
            , ,   O
       played VBD O
         with IN  O
          Del NP  B-PER
       Bosque NP  I-PER
           in IN  O
          the AT  O
        final JJ  O
        years NNS O
           of IN  O
          the AT  O
    seventies NNS O
           in IN  O
         Real NP  B-ORG
       Madrid NP  I-ORG
            . .   O

Classes
1 B-PER
2 I-PER
3 B-LOC
4 I-LOC
5 B-ORG
6 I-ORG
7 O

Feature WORD-FORM:
1 Wolff
2 ,
3 currently
4 a
5 journalist
6 in
7 Argentina
8 played
9 with
10 Del
11 Bosque
12 the
13 final
14 years
15 of
16 seventies
17 Real
18 Madrid
19 .

Feature POS
20 NP
21 ,
22 RB
23 AT
24 NN
25 VBD
26 JJ
27 NNS
28 .

Feature ORT
29 number
30 contains-digit
31 contains-hyphen
32 capitalized
33 all-capitals
34 URL
35 punctuation
36 regular

Feature Prefix1
37 W
38 ,
39 c
40 a
41 j
42 i
43 A
44 p
45 w
46 D
47 B
48 t
49 f
50 y
51 o
52 s
53 .

Given this encoding, we can compute the vector representing the first word "Wolff NP B-PER" as:

# Class: B-PER=1
# Word-form: Wolff=1
# POS: NP=20
# ORT: Capitalized=32
# prefix1: W=37
1 1:1 20:1 32:1 37:1

When you encode the test dataset, some of the word-forms will be unknown (not seen in the training dataset). You should, therefore, plan for a special value for each feature of type "unknown" when this is expected.

Your task is to encode the training dataset and test dataset, experiment with good SVM parameters, and present accuracy and error analysis: Precision and Recall per tag, confusion matrix.

Syntactic Tree Analysis

Consider the Penn Treebank corpus delivered with NLTK. You can observe the parsed trees using the treebank corpus reader:

>>> print nltk.corpus.treebank.parsed_sents('wsj_0001.mrg')[0]
(S
  (NP-SBJ
    (NP (NNP Pierre) (NNP Vinken))
    (, ,)
    (ADJP (NP (CD 61) (NNS years)) (JJ old))
    (, ,))
  (VP
    (MD will)
    (VP
      (VB join)
      (NP (DT the) (NN board))
      (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director)))
      (NP-TMP (NNP Nov.) (CD 29))))
  (. .))
>>> nltk.corpus.treebank.parsed_sents('wsj_0001.mrg')[0].draw()
>>>

When discussing the difference between constituency-based syntactic descriptions and dependency-based descriptions, we stressed that the main differences are:

Constituency-based descriptions have non-terminal nodes labelled with phrasal categories (NP, S, VP etc) while dependency representations have only word labels (internal nodes are words that operate as syntactic heads of phrasal constituents).
Dependency structures have labelled edges - the labels on the edges indicate the syntactic function of the children. For example, the SUBJECT label outgoing from a verb would indicate which dependent constituent is the subject of the verb.

Observe that some of this synctactic function is explicitly captured in the Penn Treebank, in the form of functional tag annotations. For example, the label NP-SBJ indicates that the NP constituent "Pierre Vinken" plays the role of SUBJECT in its parent S constituent.

Proponents of constituent-based descriptions believe that tree configurations identify syntactic function in a non-ambiguous manner. For example, the configuration (S (NP #1) (VP #2)) identifies the node #1 as a subject.

Your task in this question is to test this hypothesis for the SUBJECT function on the nltk Penn Treebank dataset fragment (about 1600 sentences). You must develop a classifier that will determine whether a NP node in a tree is a SUBJECT.

What knowledge sources do you expect to play a role in the decision SBJ/non-SBJ?

Report on accuracy, and perform error analysis.

Last modified 23 June 2009