Assignment 2

Due: Tue 16 Dec 2014 Midnight

Natural Language Processing - Fall 2015 Michael Elhadad

This assignment covers the topic of statistical distributions, regression and classification. The objective is:

Learn about basic statistical distributions (Normal, Multinomial), expectations, sampling and estimation.
Learn how to perform regression using a synthetic dataset.
Experiment and evaluate classifiers for the tasks of named entity recognition and document classification.
Understand the difference between one-vs-all and all-vs-all strategies for multi-label classification.

Make sure you have installed scipy, scikit-learn and numpy to work on this assignment. This should be already available if you have installed the Anaconda distribution.

Submit your solution by email in the form of an iPython ipynb file. Images should be attached as PNG or JPG files. The whole code should also be submitted as a separate folder with all necessary code to run the questions separated in clearly documented functions from a standalone Python shell, with nltk, scipy, scikit-learn and numpy pre-installed.

Content

Q1: Polynomial curve fitting

Q1.1 Synthetic data generation
Q1.2 Polynomial Curve Fitting
Q1.3 Polynomial Curve Fitting with Regularization
Q1.4 Probabilistic Regression Model

Q2: Named Entity Recognition with SVM

Q2.1 Named Entity Recognition
Q2.2 The CoNLL 2002 Dataset
Q2.3 Using an SVM Classifier
Q2.4 Your Task

Q3: Document Classification with Naive Bayes

Q3.1 The Reuters Corpus
Q3.2 Naive Bayes Multiclass Classifier
Q3.3 Naive Bayes One-vs-All Classifier
Q3.4 Using TF-IDF Features

Q1: Polynomial Curve Fitting

In this question, we will reproduce the polynomial curve fitting example used in Bishop's book in Chapter 1. Polynomial curve fitting is an example of a learning task called regression (as opposed to classification we studied in class). In regression, the objective is to learn a function mapping an input variable X to a continuous target variable Y, while in classification, Y is a discrete variable. For educational purposes, it turns out that regression is simpler to grasp than classification. This task shows how to develop a probabilistic model of a task often described in geometric terms. The outline of the task is:

Generate a synthetic dataset of N points (x, t) for a known function y(x) with some level of noise.
Solve the curve fitting regression problem using error function optimization.
Observe the problems of over-fitting this method produces.
Introduce regularization to overcome over-fitting.
Finally, use Bayesian estimation to produce an interval estimation of the function y.

Q1.1 Synthetic Dataset Generation

Learn how to use the numpy.random package to sample random numbers from well-known distributions in this reference page. In particular, we will use in this question the Normal distribution: numpy.random.normal.

Generate a dataset of points in the form of 2 vectors x and t of size N where:

ti = y(xi) + Normal(mu, sigma)

where the xi values are equi-distant on the [0,1] segment (that is, x₁ = 0, x₂=1/N-1, x₃=2/N-1..., x_N = 1.0)
mu = 0.0
sigma = 0.03
y(x) = sin(2Πx)

Our objective will be to "learn" the function y from the noisy sparse dataset we generate.

The function generateDataset(N, f, sigma) should return a tuple with the 2 vectors x and t.

Draw the plot (scatterplot) of (x,t) using matplotlib for N=100. Look at the documentation of the numpy.random.normal function in Numpy for an example of usage. Look at the definition of the function numpy.linspace to generate your dataset.

Note: a useful property of Numpy arrays is that you can apply a function to a Numpy array as follows:

import math
import numpy as np
def s(x): return x**2
def f(x): return math.sin(2 * math.pi * x)
vf = np.vectorize(f)        # Create a vectorized version of f

z = np.array([1,2,3,4])

sz = s(z)                   # You can apply simple functions to an array
sz.shape                    # Same dimension as z (4)

fz = vf(z)                  # For more complex ones, you must use the vectorized version of f
fz.shape

Q1.2 Polynomial Curve Fitting

We will attempt to learn the function y given a synthetic dataset (x, t). We assume that y is a polynomial of degree M - that is:

y(x) = w₀ + w₁x + w₂x² + ... + w_Mx^M

Our objective is to estimate the vector w = (w₀...w_M) from the dataset (x, t).

We first attempt to solve this regression task by optimizing the square error function (this method is called least squares:

Define: E(w) = 1/2Σ_i(y(x_i) - t_i)²
             = 1/2Σ_i(Σ_kw_kx_i^k - t_i)²

If t = (t₁, ..., t_N), then define the design matrix to be the matrix Φ such that Φ_nm = x_n^m = Φ_m(x_n). We want to minimize the error function, and, therefore, look for a solution to the linear system of equations:

dE/dw_k = 0 for k = 0..M

When we work out the partial derivations, we find that the solution to the following system gives us the optimal value w_LS given (x, t):

w_LS = (Φ^TΦ)^-1Φ^Tt

(Note: Φ is a matrix of dimension Nx(M+1), w is a vector of dimension (M+1) and t is a vector of dimension N.)

Here is how you write this type of matrix operations in Python using the Numpy library:

import numpy as np
import scipy.linalg

t = np.array([1,2,3,4])                    # This is a vector of dim 4
t.shape                                    # (4,)
phi = np.array([[1,1],[2,4],[3,3],[2,4]])  # This is a 4x2 matrix
phi.shape                                  # (4, 2) 
prod = np.dot(phi.T, phi)                  # prod is a 2x2 matrix
prod.shape                                 # (2, 2)
i = np.linalg.inv(prod)                    # i is a 2x2 matrix
i.shape                                    # (2, 2)
m = np.dot(i, phi.T)                       # m is a 2x4 matrix
m.shape                                    # (2, 4)
w = np.dot(m, t)                           # w is a vector of dim 2
w.shape                                    # (2,)

Implement a method OptimizeLS(x, t, M) which given the dataset (x, t) returns the optimal polynomial of degree M that approximates the dataset according to the least squares objective. Plot the learned polynomial w*_M(x_i) and the real function sin(2Πx) for a dataset of size N=10 and M=1,3,5,10.

Q1.3 Polynomial Curve Fitting with Regularization

We observe that the solution to the least-squares optimization has a tendency to over-fit the dataset. To avoid over-fitting, we will use a method called regularization: the objective function we want to optimize will take into account the least-squares error as above, and in addition the complexity of the learned model w.

We define a new objective function:

Define E_PLS(w) = E(w) + λE_W(w)

Where E_PLS is called the penalized least-squares function of w
and   E_W is the penalty function.

We will use a standard penalty function:
      E_W(w) = 1/2 w^T.w = 1/2 Σ_m=1..Mw_m²

When we work out the partial derivatives of the minimization problem, we find in closed form, that the solution to the penalized least-squares is:

w_PLS = (Φ^TΦ + λI)^-1Φ^Tt

λ is called a hyper-parameter (that is, a parameter which influences the value of the model's parameters w). Its role is to balance the influence of how well the function fits the dataset (as in the least-squares model) and how smooth it is.

Write a function optimizePLS(x, t, M, lambda) which returns the optimal parameters w_PLS given M and lambda.

We want to optimize the value of λ. The way to optimize is to use a development set in addition to our training set.

To construct a development set, we will extend our synthetic dataset construction function to return 3 samples: one for training, one for development and one for testing. Write a function generateDataset3(N, f, sigma) which returns 3 pairs of vectors of size N each, (x_test, t_test), (x_validate, t_validate) and (x_train, t_train). The target values are generated as above with Gaussian noise N(0, sigma).

Look at the documentation of the function numpy.random.shuffle() as a way to generate 3 subsets of size N from the list of points generated by linspace.

Given the synthetic dataset, optimize for the value of λ by varying the value of log(λ) from -40 to -20 on the development set. Draw the plot of the normalized error of the model for the training, development and test for the case of N = 10 and the case of N=100. The normalized error of the model is defined as:

NE_w(x, t) = 1/N [Σ_i=1..N[t_i - Σ_m=1..Mw_mx_i^m]²]^1/2

Write the function optimizePLS(xt, tt, xv, tv, M) which selects the best value λ given a dataset for training (xt, tt) and a development test (xv, tv). Describe your conclusion from this plot.

Q1.4 Probabilistic Regression Framework

We now consider the same problem of regression (learning a function from a dataset) formulated in a probabilistic framework. Consider:

t_n = y(x_n; w) + ε_n

We now model the distribution of ε_n as a probabilistic model:

ε_n ~ N(0, σ²)

since t_n = y(x_n; w) + ε_n:

p(t_n | x_n, w, σ²) = N(y(x_n; w), σ²)

We now assume that the observed data points in the dataset are all drawn in an independent manner (iid), we can then express the likelihood of the whole dataset:

p(t | x,w,σ²) = ∏_n=1..Np(t_n | x_n, w, σ²)
              = ∏_n=1..N(2Πσ²)^-1/2exp[-{t_n - y(x_n, w)}² / 2σ²]

We consider this likelihood as a function of the parameters (w and σ) given a dataset (t, x).

If we consider the log-likelihood (which is easier to optimize because we have to derive a sum instead of a product), we get:

-log p(t | w, σ²) = N/2 log(2Πσ²) + 1/2σ²Σ_n=1..N{t_n - y(x_n;w)}²

We see that optimizing the log-likelihood of the dataset is equivalent to minimizing the error function of the least-squares method. That is to say, the least-squares method is understood as the maximum likelihood estimator (MLE) of the probabilistic model we just developed, which produces the values w_ML = w_LS.

We can also optimize this model with respect to the second parameter σ² which, when we work out the derivation and the solution of the equation, yields:

σ²_ML = 1/N Σ_n=1..N{y(x_n, w_ML) - t_n}²

Given w_ML and σ²_ML, we can now compute the plugin posterior predictive distribution, which gives us the probability distribution of the values of t given an input variable x:

p(t | x, w_ML, σ²_ML) = N(t | y(x,w_ML), σ²_ML)

This is a richer model than the least-squares model studied above, because it not only estimates the most-likely value t given x, but also the precision of this prediction given the dataset. This precision can be used to construct a confidence interval around t.

We further extend the probabilistic model by considering a Bayesian approach to the estimation of this probabilistic model instead of the maximum likelihood estimator (which is known to over-fit the dataset). We choose a prior over the possible values of w which we will, for convenience reasons, select to be of a normal form (this is a conjugate prior as explained in our review of basic probabilities):

p(w | α) = ∏_m=0..M(α / 2Π)^1/2 exp{-α/2 w_m²}
             = N(w | 0, 1/αI)

This prior distribution expresses our degree of belief over the values that w can take. In this distribution, α plays the role of a hyper-parameter (similar to λ in the regularization model above).

The Bayesian approach consists of applying Bayes rule to the estimation task of the posterior distribution given the dataset:


p(w | t, α, σ²) = likelihood . prior / normalizing-factor
                = p(t | w, σ²)p(w | α) / p(t | α, σ²)

Since we wisely chose a conjugate prior for our distribution over w, we can compute the posterior analytically:

p(w | x, t, α, σ²) = N(μ, Σ)

where Φ is the design matrix as above:

μ = (Φ^TΦ + σ²αI)^-1Φ^Tt

Σ = σ²(Φ^TΦ + σ²αI)^-1

Given this approach, instead of learning a single point estimate of w as in the least-squares and penalized least-squares methods above, we have inferred a distribution over all possible values of w given the dataset. In other words, we have updated our belief about w from the prior (which does not include any information about the dataset) using new information derived from the observed dataset.

We can determine w by maximizing the posterior distribution over w given the dataset and the prior belief. This approach is called the maximum posterior (usually written MAP). If we solve the MAP given our selection of the normal conjugate prior, we obtain that the posterior reaches its maximum on the minimum of the following function of w:

1/2σ²Σ_n=1..N{y(x_n, w) - t_n}² + α/2w^Tw

We find thus that w_MAP is in fact the same as the solution of the penalized least-squares method for λ = α σ².

A fully Bayesian approach, however, does not look for point-estimators of parameters like w. Instead, we are interested in the predictive distribution p(t | x, x, t). The Bayesian approach consists of marginalizing the predictive distribution over all possible values of the parameters:

p(t | x, x, t) = ∫ p(t|x, w)p(w | x, t) dw

(For simplicity, we have hidden the dependency on the hyper-parameters α and σ in this formula.) On this simple case, with a simple normal distribution and normal prior over w, we can solve this integral analytically, and we obtain:

p(t | x, x, t) = N(t | m(x), s²(x))

where the mean and variance are:

m(x) = 1/σ² Φ(x)^TS Σ_n=1..NΦ(x_n)t_n

s²(x) = σ² + Φ(x)^TSΦ(x)

S^-1 = αI + 1/σ² Σ_n=1..NΦ(x_n)Φ(x_n)^T

Φ(x) = (Φ₀(x) ... Φ_M(x))^T = (1 x x² ... x^M)^T

Note that the mean and the variance of this predictive distribution depend on x.

Your task: write a function bayesianEstimator(x, t, M, alpha, sigma2) which given the dataset (x, t) of size N, and the parameters M, alpha, and sigma2 (the variance), returns a tuple of 2 functions (m(x) var(x)) which are the mean and variance of the predictive distribution inferred from the dataset, based on the parameters and the normal prior over w. As you can see, in the Bayesian approach, we do not learn an optimal value for the parameter w, but instead, marginalize out this parameter and directly learn the predictive distribution.

Note that in Python, a function can return a function (like in Scheme) using the following syntax:

def adder(x): 
    return lambda(y): x+y

a2 = adder(2)

print(a2(3))       // prints 5
print(adder(4)(3)) // prints 7

Draw the plot of the original function y = sin(2Πx) over the range [0..1], the mean of the predictive distribution m(x) and the confidence interval (m(x) - var(x)^1/2) and (m(x) + var(x)^1/2) (that is, one standard deviation around each predicted point) for the values:

alpha = 0.005
sigma2 = 1/11.1
M = 9

over a synthetic dataset of size N=10 and N=100. The plot should look similar to the Figure below (from Bishop p.32).

Interpret the height of the band around the most likely function in terms of the distribution of the xs in your synthetic dataset. Can you think of ways to make this height very small in one segment of the function and large in another?

Q2: Named Entity Recognition with SVM

Q2.1 Named Entity Recognition

The task of Named Entity Recognition (NER) involves the recognition of names of persons, locations, organizations, dates in free text. For example, the following sentence is tagged with sub-sequences indicating PER (for persons), LOC (for location) and ORG (for organization):

Wolff, currently a journalist in Argentina, played with Del Bosque in the final years of the seventies in Real Madrid.

[PER Wolff ] , currently a journalist in [LOC Argentina ] , played with [PER Del Bosque ] in the final years 
of the seventies in [ORG Real Madrid ] .

NER involves 2 sub-tasks: identifying the boundaries of such expressions (the open and close brackets) and labelling the expressions (with tags such as PER, LOC or ORG). As for the task of chunking, this sequence labelling task is mapped to a classification tag, using a BIO encoding of the data:

        Wolff B-PER
            , O
    currently O
            a O
   journalist O
           in O
    Argentina B-LOC
            , O
       played O
         with O
          Del B-PER
       Bosque I-PER
           in O
          the O
        final O
        years O
           of O
          the O
    seventies O
           in O
         Real B-ORG
       Madrid I-ORG
            . O

Q2.2 The CoNLL 2002 Dataset

The dataset we will use for this question is derived from the CoNLL 2002 shared task - which is about NER in Spanish and Dutch. The dataset is included in the NLTK distribution. Explanations on the dataset are provided in the CoNLL 2002 page.

To access the data in Python, do:

from nltk.corpus import conll2002

etr = conll2002.chunked_sents('esp.train') # In Spanish
eta = conll2002.chunked_sents('esp.testa') # In Spanish
etb = conll2002.chunked_sents('esp.testb') # In Spanish

dtr = conll2002.chunked_sents('ned.train') # In Dutch
dta = conll2002.chunked_sents('ned.testa') # In Dutch
dtb = conll2002.chunked_sents('ned.testb') # In Dutch

The data consists of three files per language (Spanish and Dutch): one training file and two test files testa and testb. The first test file is to be used in the development phase for finding good parameters for the learning system. The second test file will be used for the final evaluation.

Q2.3 Using an SVM Classifier

Your task is to encode the training dataset and test dataset, experiment with good SVM parameters, and present accuracy and error analysis: Precision and Recall per tag and confusion matrix.

Use the NLTK SklearmClassifier wrapper for the sklearn.svm classifier, which is used as shown here:

from nltk.classify import SklearnClassifier
train_data = [({"a": 4, "b": 1, "c": 0}, "ham"),
              ({"a": 5, "b": 2, "c": 1}, "ham"),
              ({"a": 0, "b": 3, "c": 4}, "spam"),
              ({"a": 5, "b": 1, "c": 1}, "ham"),
              ({"a": 1, "b": 4, "c": 3}, "spam")]
test_data = [{"a": 3, "b": 2, "c": 1},
             {"a": 0, "b": 3, "c": 7}]
from sklearn.svm import SVC
classif = SklearnClassifier(SVC(), sparse=False).train(train_data)
classif.classify_many(test_data)

Read about the SVM classifier in the sklearn svm documentation. The key parameters to optimize when training an SVM classifier are:

The C parameter indicates to what extent the classifier is ready to accept samples that are classified on the "wrong" side of the decision boundary.
The kernel parameter determines the shape of the decision boundary. You can try 'linear' and 'rbf' for your tests.
The SVC() class uses an all-vs-all multiclasss strategy, whereas the LinearSVC() class uses an one-vs-all strategy.

classifl = SklearnClassifier(SVC(kernel='linear', C=1.0), sparse=False).train(train_data)
classifr = SklearnClassifier(SVC(kernel='rbf', C=0.5), sparse=False).train(train_data)
from sklearn.svm import LinearSVC
classif_ova = SklearnClassifier(LinearSVC(C=1.0), sparse=False).train(train_data) # LinearSVC() always uses a linear kernel.

You can find code in the Classification notebook that shows how to optimize the C value on a dataset using sklearn methods.

Q2.4 Your Task

Your task consists of:

Choosing good features for encoding the problem.
Encode your training dataset.
Run a supervised SVM classifier over the training dataset.
Train and test the model.
Perform error analysis and fine tune model parameters on the testa part of the datasets.
Perform evaluation over the testb part of the dataset, reporting on accuracy, per label precision, per label recall and per label F-measure, and confusion matrix.

Here is a list of features that have been found appropriate for NER in previous work:

The word form (the string as it appears in the sentence)
The POS of the word (which is provided in the dataset)
ORT - a feature that captures the orthographic (letter) structure of the word. It can have any of the following values: number, contains-digit, contains-hyphen, capitalized, all-capitals, URL, punctuation, regular.
prefix1: first letter of the word
prefix2: first two letters of the word
prefix3: first three letters of the word
suffix1: last letter of the word
suffix2: last two letters of the word
suffix3: last three letters of the word

To encode sentences into feature vectors, you must first pre-process the training set and assign numbers to each observed feature. You must therefore write Python code that will maintain dictionaries of each feature, and be able to compute vectors representing each word in test data.

For example, given the following toy training data, the encoding of the features would be:

        Wolff NP  B-PER
            , ,   O
    currently RB  O
            a AT  O
   journalist NN  O
           in IN  O
    Argentina NP  B-LOC
            , ,   O
       played VBD O
         with IN  O
          Del NP  B-PER
       Bosque NP  I-PER
           in IN  O
          the AT  O
        final JJ  O
        years NNS O
           of IN  O
          the AT  O
    seventies NNS O
           in IN  O
         Real NP  B-ORG
       Madrid NP  I-ORG
            . .   O

Classes
1 B-PER
2 I-PER
3 B-LOC
4 I-LOC
5 B-ORG
6 I-ORG
7 O

Feature WORD-FORM:
1 Wolff
2 ,
3 currently
4 a
5 journalist
6 in
7 Argentina
8 played
9 with
10 Del
11 Bosque
12 the
13 final
14 years
15 of
16 seventies
17 Real
18 Madrid
19 .

Feature POS
20 NP
21 ,
22 RB
23 AT
24 NN
25 VBD
26 JJ
27 NNS
28 .

Feature ORT
29 number
30 contains-digit
31 contains-hyphen
32 capitalized
33 all-capitals
34 URL
35 punctuation
36 regular

Feature Prefix1
37 W
38 ,
39 c
40 a
41 j
42 i
43 A
44 p
45 w
46 D
47 B
48 t
49 f
50 y
51 o
52 s
53 .

Given this encoding, we can compute the vector representing the first word "Wolff NP B-PER" as:

# Class: B-PER=1
# Word-form: Wolff=1
# POS: NP=20
# ORT: Capitalized=32
# prefix1: W=37
1 1:1 20:1 32:1 37:1

When you encode the test dataset, some of the word-forms will be unknown (not seen in the training dataset). You should, therefore, plan for a special value for each feature of type "unknown" when this is expected.

You can either write the code as explained above, or learn to use the Scikit-learn pipeline library - specifically looking at the DictVectorizer class.

Q3: Document Classification with Naive Bayes

In this question, we will investigate document classification, using the Reuter Dataset and investigate the behavior of Naive Bayes classifiers.

Q3.1 The Reuters-21578 Dataset

The Reuters dataset is distributed with NLTK. It contains a collection of news wires that are categorized into 90 different categories. The distribution across categories is extremely unbalanced - with one very large category (companies earning reports) and many very small categories. The dataset contains about 1.7M words and 50K documents.

Here are the methods available to access the dataset in NLTK:

from nltk.corpus import reuters

print(reuters.readme())

cats = reuters.categories()
print("Reuters has %d categories:\n%s" % (len(cats), cats))

total = len(reuters.paras())
total_multi = 0
for c in cats:
    lc = len(reuters.paras(categories=[c]))
    total_multi += lc
    print("%s ---- %d documents out of %d" % (c, lc, total))
print("Articles belong to %.4f categories on average" % ((total_multi * 1.0) / total))
print("There are %.4f articles per category on average" % ((total * 1.0) / len(cats)))

As you see, in a few cases, a document belongs to more than one category (on average a document belongs to 1.2 categories).

reuters.fileids(categories=['yen'])

[u'test/14913',
 u'test/15400',
 u'test/15432',
 u'test/15454',
 ...
 u'training/10364',
 u'training/10679',
 u'training/10681',
 u'training/10684',
 u'training/10689',
 ...]

Q3.2 Naive Bayes for Multi-label Classification

Read the code of the Naive Bayes Classifier in NLTK.

The NLTK code uses the ELEProbDist estimator to convert the observed counts into a smoothed probability distribution. ELEProbDist is a variant of the Lidstone estimator with a parameter of gamma=0.5. Using the fact that a Dirichlet distribution is the conjugate of a multinomial distribution, show that the Lidstone estimator corresponds to the MAP of the empirical distribution with a symmetric Dirichlet prior with parameter gamma.

Show that this version of Naive Bayes implements a version of the "all-vs-all" multi-class strategy. You should find inspiration in this article: One-versus-all alters Naive Bayes (J. Rennie, 2006). Show in your notebook the specific lines of code that implement the all-vs-all strategy.

Q3.3 One-vs-all Naive Bayes for Multi-label Classification

Because the labels are very unbalanced, it will help in our case to construct a one-vs-all version of Naive Bayes. This can be done by pre-processing the labelled data passed to the existing NLTK implementation - or, in a more efficient manner, by writing a new version of the Naive Bayes classifier that implements the one-vs-all strategy.

For background, you can see the demonstration that one-vs-all helps improve the accuracy of unbalanced document classification in this article (Tackling the Poor Assumptions of Naive Bayes Text Classifiers, Rennie et al, ICML 2003).

Your task:

Run the original NLTK version of Naive Bayes classifier on the Reuters dataset. Report accuracy, per category precision and recall. In your implementation, keep only the top-K most frequent words present in the vocabulary of the dataset as features - with K=1000 and K=2000 (there are about 41K distinct words in the dataset). In the rest, keep the value of K that gives better accuracy.
Do you observe a relationship between size of the category and quality of the classifier? Draw a plot to illustrate your point.
Write a one-vs-all version of Naive Bayes. Test it on the Reuters dataset and compare the results with those ofthe all-vs-all version.

Q3.4 TF*IDF Feature Selection for Naive Bayes for Multi-label Classification

We now want to use the tf-idf values for the words instead of their frequency. We will use the Scikit-Learn tf-idf vectorizer to transform the original document representations in the corpus into vectors of tf-idf values.

The following notebook illustrates how this transformation is performed: ReutersDataset.ipynb.

Your task:

Train the one-vs-all Naive Bayes classifier over tf-idf values (on the training set of the Reuters dataset).
Test this classifier on the test part of the Reuters dataset. Report on accuracy, per label precision and recall.

Last modified Nov 30, 2014