Assignment 3

Due: Monday 6 June 2011 Midnight

Natural Language Processing - Spring 2011 Michael Elhadad

This assignment covers 2 topics:

Machine Learning methods (HMM) for Chunking
Machine Learning methods (SVM) for Named Entity Recognition

Submit your solution in the form of an HTML file, using the same CSS as this page with code inside <pre> tags. Images should be submitted as PNG files.

Chunking with HMM

Chunking

The task of syntactic chunking is presented in the CoNLL 2000 shared task description. Text chunking consists of dividing a sequence of words into non-recursive chunks corresponding to syntactic phrases such as Noun Phrases, Verb Phrases and more. In this assignment, we will only identify NP and VP chunks.

To read the dataset in Python, use this corpus reader:

from nltk.corpus import conll2000

print nltk.corpus.conll2000.chunked_sents('train.txt')[99]  # Print the 100th sentence from the training set

print nltk.corpus.conll2000.chunked_sents('train.txt', chunk_types=('NP', 'VP'))[99] # Only read the NP and VP chunks

HMM Models

Read a high-level introduction to HMM models: Jan Hajic lecture on HMM models introduces the basic concepts of HMM models.

NLTK includes a Python implementation of HMM models. Go over the Python source of the HMM module: you should understand how to use the supervised trainer and run the POS tagger example on the Brown corpus using the HMM module.

To test the ntlk HMM module, run this test:

>>> import nltk
>>> nltk.tag.hmm.demo_pos()

HMM POS tagging demo

Training HMM...
Testing...
Test: the/AT fulton/NP county/NN grand/JJ jury/NN said/VBD friday/NR an/AT investigation/NN of/IN atlanta's/NP$ 
recent/JJ primary/NN election/NN produced/VBD ``/`` no/AT evidence/NN ''/'' that/CS any/DTI irregularities/NNS
took/VBD place/NN ./.

Untagged: the fulton county grand jury said friday an investigation of atlanta's recent primary election 
produced `` no evidence '' that any irregularities took place .

HMM-tagged: the/AT fulton/NP county/NN grand/JJ jury/NN said/VBD friday/NR an/AT investigation/NN of/IN 
atlanta's/NP$ recent/JJ primary/NN election/NN produced/. ``/`` no/AT evidence/NN ''/'' that/CS any/AT 
irregularities/NNS took/TO place/VB ./.

Entropy: 55.219788385

------------------------------------------------------------

An application of HMM to the task of chunking is described in this paper: Shallow Parsing using Specialized HMMs by Antonio Molina and Ferran Pla, Journal of Machine Learning Research, 2002. The paper provides a detailed description of the modeling of chunking using HMM models.

Your Task

Your task is to develop a chunker for NP and VP chunks on the CoNLL 2000 dataset using a supervised HMM model. Train the model on the training.txt part of the dataset and test it on the test.txt part.

Perform evaluation as: per-Tag Precision, Recall, F-measure (for the 5 tags B-NP, I-NP, B-VP, I-VP, O) and accuracy.

Perform per-chunk accuracy measure: how often a full chunk is identified for each chunk type (NP and VP).

Perform error-analysis per chunk: how many chunks are too short and too long.

Named Entity Recognition with SVM

Named Entity Recognition

The task of Named Entity Recognition (NER) involves the recognition of names of persons, locations, organizations, dates in free text. For example, the following sentence is tagged with sub-sequences indicating PER (for persons), LOC (for location) and ORG (for organization):

Wolff, currently a journalist in Argentina, played with Del Bosque in the final years of the seventies in Real Madrid.

[PER Wolff ] , currently a journalist in [LOC Argentina ] , played with [PER Del Bosque ] in the final years 
of the seventies in [ORG Real Madrid ] .

NER involves 2 sub-tasks: identifying the boundaries of such expressions (the open and close brackets) and labeling the expressions (with tags such as PER, LOC or ORG). As for the task of chunking, this sequence labeling task is mapped to a classification tag, using a BIO encoding of the data:

        Wolff B-PER
            , O
    currently O
            a O
   journalist O
           in O
    Argentina B-LOC
            , O
       played O
         with O
          Del B-PER
       Bosque I-PER
           in O
          the O
        final O
        years O
           of O
          the O
    seventies O
           in O
         Real B-ORG
       Madrid I-ORG
            . O

Dataset

The dataset we will use for this question is derived from the CoNLL 2002 shared task - which is about NER in Spanish and Dutch. The dataset is included in the NLTK distribution. Explanations on the dataset are provided in the CoNLL 2002 page.

To access the data in Python, do:

from nltk.corpus import conll2002

conll2002.chunked_sents('esp.train') # In Spanish
conll2002.chunked_sents('esp.testa') # In Spanish
conll2002.chunked_sents('esp.testb') # In Spanish

conll2002.chunked_sents('ned.train') # In Dutch
conll2002.chunked_sents('ned.testa') # In Dutch
conll2002.chunked_sents('ned.testb') # In Dutch

The data consists of three files per language (Spanish and Dutch): one training file and two test files testa and testb. The first test file is to be used in the development phase for finding good parameters for the learning system. The second test file will be used for the final evaluation. You can ignore the MISC tag in this dataset and consider it to be O.

SVM Models

Support Vector Models are a machine learning method that is well adapted to sequence labeling and can combine very large sets of features.

We will use the libsvm implementation of SVM in this question. libsvm is written in C++ and includes a Python interface. The download comes with a pre-compiled Windows version and a makefile for Linux - including the Python wrapper.

A Practical Guide to Support Vector Classification (Hsu, Chang and Lin 2009) presents a detailed procedure on how to run libsvm on practical datasets.

libsvm takes as input ASCII files encoding the training data and learns a model. Test data is submitted to a trained model also encoded in ASCII files. The format of training and testing data file is:

<label> <index1>:<value1> <index2>:<value2> ...
.
.

Each line contains an instance and is ended by a '\n' character. For classification, <label> is an integer indicating the class label.

NOTE: libsvm supports multi-class classification (not only binary classification). In our case, we must use one class for each BIO tag (B-PER, I-PER, B-LOC, I-LOC, B-ORG, I-ORG, O). You should assign integer values for each class in the SVM encoding.

<index>:<value> gives a feature (attribute) value. <index> is an integer starting from 1 and <value> is a real number. In our case, we will only use binary features, so the value should only be 1 for present features. All absent features are assumed to be 0. Indices must be in ASCENDING order.

To check if your data is in a correct form, use `tools/checkdata.py' (details in `tools/README').

Type `svm-train ner', and the program will read the training data and output the model file `ner.model'. Put test data in the right format in a file called ner.t, then type `svm-predict ner.t ner.model output' to see the prediction accuracy. The `output' file contains the predicted class labels.

Alternatively, you can use the Python wrapper of libsvm to work directly on SVM models and parameters. The libsvm package includes a version pre-compiled under Windows for Python 2.6.

Here is a sample session in Python showing the basic functionality of the Python wrapper:

>>> from svm import *
# First parameter is a list of classes
# Second parameter is a list of vectors in the training set
>>> p = svm_problem([1,-1], [[1,0,1],[-1,0,-1]])  
>>> param = svm_parameter(kernel_type=LINEAR, C=10)
>>> m = svm_model(p, param)
optimization finished, #iter = 1
nu = 0.025000
obj = -0.250000, rho = 0.000000
nSV = 2, nBSV = 0
Total nSV = 2
>>> m.predict([1,1,1])
1.0
>>> m.save('test.model')
>>> n = svm_model('test.model')
>>> n.predict([1,1,1])
1.0

If you have any trouble using the Python wrapper, you can use the regular command line tools on ASCII files as described above. Here is a sample training file a.train:

+1 1:1 10:1
+2 2:1 10:1
+3 3:1 11:1

The corresponding training:

\libsvm-2.91>windows\svm-train.exe -t 0 -c 10 a.train
*
optimization finished, #iter = 1
nu = 0.100000
obj = -1.000000, rho = 0.000000
nSV = 2, nBSV = 0
*
optimization finished, #iter = 1
nu = 0.050000
obj = -0.500000, rho = 0.000000
nSV = 2, nBSV = 0
*
optimization finished, #iter = 1
nu = 0.050000
obj = -0.500000, rho = 0.000000
nSV = 2, nBSV = 0
Total nSV = 3

This creates a file a.train.model. The corresponding test data is stored in the file a.test:

2:1 10:1
2:1 11:1

And to test this data, use svm_predict, which outputs predictions in the a.out file:

\libsvm-2.91>windows\svm-predict.exe a.test a.train.model a.out
Accuracy = 50% (1/2) (classification)

Your Task

Your task when using SVM consists of:

Choosing the right features for encoding the problem.
Encode your training dataset in the ASCII format expected by libsvm.
Select the optimal parameters when running SVM. The most important parameters are the kernel type and the C parameter.
Train and test the model.
Perform error analysis.

Here is a list of features that have been found appropriate for NER in previous work:

The word form (the string as it appears in the sentence)
The POS of the word
ORT - a feature that captures the orthographic (letter) structure of the word. It can have any of the following values: number, contains-digit, contains-hyphen, capitalized, all-capitals, URL, punctuation, regular.
prefix1: first letter of the word
prefix2: first two letters of the word
prefix3: first three letters of the word
suffix1: last letter of the word
suffix2: last two letters of the word
suffix3: last three letters of the word

To encode sentences into feature vectors, you must first pre-process the training set and assign numbers to each observed feature. You must therefore write Python code that will maintain dictionaries of each feature, and be able to compute vectors representing each word in test data.

For example, given the following toy training data, the encoding of the features would be:

        Wolff NP  B-PER
            , ,   O
    currently RB  O
            a AT  O
   journalist NN  O
           in IN  O
    Argentina NP  B-LOC
            , ,   O
       played VBD O
         with IN  O
          Del NP  B-PER
       Bosque NP  I-PER
           in IN  O
          the AT  O
        final JJ  O
        years NNS O
           of IN  O
          the AT  O
    seventies NNS O
           in IN  O
         Real NP  B-ORG
       Madrid NP  I-ORG
            . .   O

Classes
1 B-PER
2 I-PER
3 B-LOC
4 I-LOC
5 B-ORG
6 I-ORG
7 O

Feature WORD-FORM:
1 Wolff
2 ,
3 currently
4 a
5 journalist
6 in
7 Argentina
8 played
9 with
10 Del
11 Bosque
12 the
13 final
14 years
15 of
16 seventies
17 Real
18 Madrid
19 .

Feature POS
20 NP
21 ,
22 RB
23 AT
24 NN
25 VBD
26 JJ
27 NNS
28 .

Feature ORT
29 number
30 contains-digit
31 contains-hyphen
32 capitalized
33 all-capitals
34 URL
35 punctuation
36 regular

Feature Prefix1
37 W
38 ,
39 c
40 a
41 j
42 i
43 A
44 p
45 w
46 D
47 B
48 t
49 f
50 y
51 o
52 s
53 .

Given this encoding, we can compute the vector representing the first word "Wolff NP B-PER" as:

# Class: B-PER=1
# Word-form: Wolff=1
# POS: NP=20
# ORT: Capitalized=32
# prefix1: W=37
1 1:1 20:1 32:1 37:1

When you encode the test dataset, some of the word-forms will be unknown (not seen in the training dataset). You should, therefore, plan for a special value for each feature of type "unknown" when this is expected.

Your task is to encode the training dataset and test dataset, experiment with good SVM parameters, and present accuracy and error analysis: Precision and Recall per tag, confusion matrix.

Last modified 24 May 2010