Assignment 2

Due: Sun 10 June 2012 Midnight

Natural Language Processing - Spring 2012 Michael Elhadad

This assignment covers the topic of statistical distributions, regression and classification. The objective is:

Learn about basic statistical distributions (Normal, Multinomial), expectations, sampling and estimation.
Learn how to perform regression using a synthetic dataset.
Experiment and evaluate classifiers for the tasks of sentiment analysis.

Make sure you have installed scipy and numpy to work on this assignment. (Easiest way is to download the setup for scipy in scipy at sourceforge.)

Submit your solution by email in the form of an HTML file, using the same CSS as this page with code inside <pre> tags. Images should be attached as PNG or JPG files. The whole code should also be submitted as a separate folder with all necessary code to run the questions separated in clearly documented functions from a standalone Python shell, with nltk, scipy and numpy pre-installed.

Q1: Polynomial Curve Fitting

In this question, we will reproduce the polynomial curve fitting example used in Bishop's book in Chapter 1. The outline of the task is:

Generate a synthetic dataset of N points (x, t) for a known function y(x) with some level of noise.
Solve the curve fitting regression problem using error function optimization.
Observe the problems of overfitting this method produces.
Introduce regularization to overcome overfitting.
Finally, use Bayesian estimation to produce an interval estimation of the function y.

Q1.1 Synthetic Dataset Generation

Learn how to use the numpy.random package to sample random numbers from well-known distributions in this reference page. In particular, we will use in this question the Normal distribution: numpy.random.normal.

Generate a dataset of points in the form of 2 vectors x and t of size N where:

ti = y(xi) + Normal(mu, sigma)

where the xi values are equi-distant on the [0,1] segment (that is, x₁ = 0, x₂=1/N-1, x₃=2/N-1..., x_N = 1.0)
mu = 0.0
sigma = 0.03
y(x) = sin(2Πx)

Our objective will be to "learn" the function y from the noisy dataset we generate. The function generateDataset(N, f, sigma) should return a tuple with the 2 vectors x and t. Draw the plot (scatterplot) of (x,t) using matplotlib for N=50. Look at the documentation of the numpy.random.normal function in Numpy for an example of usage. Look at the definition of the function numpy.linspace to generate your dataset.

Note: a useful property of Numpy arrays is that you can apply a function to a Numpy array as follows:

import math
import numpy as np
def s(x): return x**2
def f(x): return math.sin(2 * math.pi * x)
vf = np.vectorize(f)        # Create a vectorized version of f

z = np.array([1,2,3,4])

sz = s(z)                   # For simple functions, you can apply to an array
sz.shape                    # Same dimension as z (4)

fz = vf(z)                  # Must use the vectorized version of f
fz.shape

Q1.2 Polynomial Curve Fitting

We will attempt to learn the function y given a synthetic dataset (x, t). We assume that y is a polynomial of degree M - that is:

y(x) = w₀ + w₁x + w₂x² + ... + w_Mx^M

Our objective is to estimate the vector w = (w₀...w_M) from the dataset (x, t).

We first attempt to solve this regression task by optimizing the square error function (this method is called least squares:

Define: E(w) = 1/2Σ_i(y(x_i) - t_i)²
             = 1/2Σ_i(Σ_kw_kx_i^k - t_i)²

If t = (t₁, ..., t_N), then define the design matrix to be the matrix Φ such that Φ_nm = x_n^m = Φ_m(x_n). We want to minimize the error function, and, therefore, look for a solution to the linear system of equations:

dE/dw_k = 0 for k = 0..M

When we work out the partial derivations, we find that the solution to the following system gives us the optimal value w_LS given (x, t):

w_LS = (Φ^TΦ)^-1Φ^Tt

(Note: Φ is a matrix of dimension NxM, w is a vector of dimension M and t is a vector of dimension N.)

Here is how you write this type of matrix operations in Python using the Numpy library:

import numpy as np
import scipy.linalg

t = np.array([1,2,3,4])                    # This is a vector of dim 4
t.shape
phi = np.array([[1,1],[2,4],[3,3],[2,4]])  # This is a 4x2 matrix
phi.shape
prod = np.dot(phi.T, phi)                  # prod is a 2x2 matrix
prod.shape
i = np.linalg.inv(prod)                    # i is a 2x2 matrix
i.shape
m = np.dot(i, phi.T)                       # m is a 2x4 matrix
m.shape
w = np.dot(m, t)                           # w is a vector of dim 2
w.shape

Implement a method OptimizeLS(x, t, M) which given the dataset (x, t) returns the optimal polynomial of degree M that approximates the dataset according the least squares objective. Plot the learned polynomial w*_M(x_i) and the real function sin(x) for a dataset of size N=10 and M=1,3,5,10.

Q1.3 Polynomial Curve Fitting with Regularization

We observe that the solution to the least-squares optimization has a tendency to overfit the dataset. To avoid overfitting, we will use a method called regularization: the objective function we want to optimize will take into account the least-squares error as above, and in addition the complexity of the learned model w.

We define a new objective function:

Define E_PLS(w) = E(w) + λE_W(w)

Where E_PLS is called the penalized least-squares function of w
and   E_W is the penalty function.

We will use a standard penalty function:
      E_W(w) = 1/2 w^T.w = 1/2 Σ_m=1..Mw_m²

When we work out the partial derivatives of the minimization problem, we find in closed form, that the solution to the penalized least-squares is:

w_PLS = (Φ^TΦ + λI)^-1Φ^Tt

λ is called a hyper-parameter (that is, a parameter which influences the value of the model's parameters w). Its role is to balance the influence of how well the function fits the dataset (as in the least-squares model) and how smooth it is.

Write a function optimizePLS(x, t, M, lambda) which returns the optimal parameters w_PLS given M and lambda.

We want to optimize the value of λ. The way to optimize is to use a validation set in addition to our training set.

To construct a validation set, we will extend our synthetic dataset construction function to return 3 samples: one for training, one for validation and one for testing. Write a function generateDataset3(N, f, sigma) which returns 3 pairs of vectors of size N each, (x_test, t_test), (x_validate, t_validate) and (x_train, t_train). The target values are generated as above with Gaussian noise N(0, sigma).

Look at the documentation of the function numpy.random.shuffle() as a way to generate 3 subsets of size N from the list of points generated by linspace.

Given the synthetic dataset, optimize for the value of λ by varying the value of log(λ) from -20 to 5 on the validation set. Draw the plot of the normalized error of the model for the training, validation and test for the case of N = 10 and the case of N=100. The normalized error of the model is defined as:

NE_w(x, t) = 1/N [Σ_i=1..N[t_i - Σ_m=1..Mw_mx_i^m]²]^1/2

Write the function optimizePLS(xt, tt, xv, tv, M) which selects the best value λ given a dataset for training (xt, tt) and a validation test (xv, tv). Describe your conclusion from this plot.

Q1.4 Probabilistic Regression Framework

We now consider the same problem of regression (learning a function from a dataset) formulated in a probabilistic framework. Consider:

t_n = y(x_n; w) + ε_n

We now model the distribution of ε_n as a probabilistic model:

ε_n ~ N(0, σ²)

since t_n = y(x_n; w) + ε_n:

p(t_n | x_n, w, σ²) = N(y(x_n; w), σ²)

We now assume that the observed data points in the dataset are all drawn in an independent manner (iid), we can then express the likelihood of the whole dataset:

p(t | x,w,σ²) = ∏_n=1..Np(t_n | x_n, w, σ²)
              = ∏_n=1..N(2Πσ²)^-1/2exp[-{t_n - y(x_n, w)}² / 2σ²]

We consider this likelihood as a function of the parameters (w and σ) given a dataset (t, x).

If we consider the log-likelihood (which is easier to optimize because we have to derive a sum instead of a product), we get:

-log p(t | w, σ²) = N/2 log(2Πσ²) + 1/2σ²Σ_n=1..N{t_n - y(x_n;w)}²

We see that optimizing the log-likelihood of the dataset is equivalent to minimizing the error function of the least-squares method. That is to say, the least-squares method is understood as the maximum likelihood estimator of the probabilistic model we just developed, which produces the values w_ML = w_LS.

We can also optimize this model with respect to the second parameter σ² which, when we work out the derivation and the solution of the equation, yields:

σ²_ML = 1/N Σ_n=1..N{y(x_n, w_ML) - t_n}²

Given w_ML and σ²_ML, we can now compute the predictive distribution, which gives us the probability distribution of the values of t given an input variable x:

p(t | x, w_ML, σ²_ML) = N(t | y(x,w_ML), σ²_ML)

This is a richer model than the least-squares model studied above, because it not only estimates the most-likely value t given x, but also the precision of this prediction given the dataset. This precision can be used to construct a confidence interval around t.

We further extend the probabilistic model by considering a Bayesian approach to the estimation of this probabilistic model instead of the maximum likelihood estimator (which is known to overfit the dataset). We choose a prior over the possible values of w which we will, for convenience reasons, select to be of a normal form (this is a conjugate prior as explained in our review of basic probabilities):

p(w | α) = ∏_m=0..M(α / 2Π)^1/2 exp{-α/2 w_m²}
             = N(w | 0, 1/αI)

This prior distribution expresses our degree of belief over the values that w can take. In this distribution, α plays the role of a hyper-parameter (similar to λ in the regularization model above).

The Bayesian approach consists of applying Bayes rule to the estimation task of the posterior distribution given the dataset:


p(w | t, α, σ²) = likelihood . prior / normalizing-factor
                = p(t | w, σ²)p(w | α) / p(t | α, σ²)

Since we wisely chose a conjugate prior for our distribution over w, we can compute the posterior analytically:

p(w | x, t, α, σ²) = N(μ, Σ)

where Φ being the design matrix as above:

μ = (Φ^TΦ + σ²αI)^-1Φ^Tt

Σ = σ²(Φ^TΦ + σ²αI)^-1

Given this approach, instead of learning a single point estimate of w as in the least-squares and penalized least-squares methods above, we have inferred a distribution over all possible values of w given the dataset. In other words, we have updated our belief about w from the prior (which does not include any information about the dataset) using new information derived from the observed dataset.

We can determine w by maximizing the posterior distribution over w given the dataset and the prior belief. This approach is called the maximum posterior (usually written MAP). If we solve the MAP given our selection of the normal conjugate prior, we obtain that the posterior reaches its maximum on the minimum of the following function of w:

1/2σ²Σ_n=1..N{y(x_n, w) - t_n}² + α/2w^Tw

We find thus that w_MAP is in fact the same as the solution of the penalized least-squares method for λ = α σ². A fully Bayesian approach, however, does not look for point-estimators of parameters like w. Instead, we are interested in the predictive distribution p(t | x, x, t). The Bayesian approach consists of marginalizing the predictive distribution over all possible values of the parameters:

p(t | x, x, t) = ∫ p(t|x, w)p(w | x, t) dw

(For simplicity, we have hidden the dependency on the hyper-parameters α and σ in this formula.) On this simple case, with a simple normal distribution and normal prior over w, we can solve this integral analytically, and we obtain:

p(t | x, x, t) = N(t | m(x), s²(x))

where the mean and variance are:

m(x) = 1/σ² Φ(x)^TS Σ_n=1..NΦ(x_n)t_n

s²(x) = σ² + Φ(x)^TSΦ(x)

S^-1 = αI + 1/σ² Σ_n=1..NΦ(x_n)Φ(x_n)^T

Φ(x) = (Φ₀(x) ... Φ_M(x))^T = (1 x x² ... x^M)^T

Note that the mean and the variance of this predictive distribution depend on x.

Your task: write a function bayesianEstimator(x, t, M, alpha, sigma2) which given the dataset (x, t) of size N, and the parameters M, alpha, and sigma2 (the variance), returns a tuple of 2 functions (m(x) var(x)) which are the mean and variance of the predictive distribution inferred from the dataset, based on the parameters and the normal prior over w. As you can see, in the Bayesian approach, we do not learn an optimal value for the parameter w, but instead, marginalize out this parameter and directly learn the predictive distribution.

Note that in Python, a function can return a function (like in Scheme) using the following syntax:

def adder(x): 
    return lambda(y): x+y

a2 = adder(2)

print a2(3) // prints 5

Draw the plot of the original function y = sin(x) over the range [0..1], the mean of the predictive distribution m(x) and the confidence interval (m(x) - var(x)^1/2) and (m(x) + var(x)^1/2) (that is, one standard deviation around each predicted point) for the values:

alpha = 0.005
sigma2 = 1/11.1
M = 9

over a synthetic dataset of size N=10 and N=100. The plot should look similar to the Figure below (from Bishop p.32).

Q2: Classification for Sentiment Analysis

In this question, we use the classifier code provided in nltk and experiment with the task of sentiment analysis and the Movies Review dataset. The movie reviews dataset contains 2000 documents that are classified half as positive movie reviews and half as negative. The task we will now review is to classify a given text as either positive or negative. We will use a Naive Bayes classifier with a variety of features. The main objective of this question is to experiment with the process of classifier learning, feature selection and classifier evaluation.

The dataset is located in nltk_data/corpora/movie_reviews in the form of simple text files. The text files are pre-processed as lower-case only words, already segmented as space-delimited words, one sentence per line. You can access the data in Python as follows:

from nltk.corpus import movie_reviews

negative = movie_reviews.fileids('neg')
positive = movie_reviews.fileids('pos')

movie_reviews.words(fileids=['neg/cv000_29416.txt'])

Q2.1: Baseline - Bag of words classifier

We first will experiment with a feature type called bag of words: the features are binary features that are True if a word is present in the document and False otherwise.

Use the following nltk classes and functions:

nltk.classify.NaiveBayesClassifier as a classifier.
nltk.classify.util.accuracy to measure the classifier accuracy.
nltk.metrics.precision
nltk.metrics.recall

Write a function bag_of_words(document) which extracts the bag of words features for the dataset. For example, the function should be called as follows:

bag_of_words(movie_reviews.words(fileids=['neg/cv000_29416.txt']))
--> Return a list of feature vectors for the document

Write a function evaluate_features(feature_extractor, N) which learns a NaiveBayes classifier on the movie_reviews dataset as follows:

Construct a stratified split (training, test) dataset of (positive, negative) documents of relative size (N-1)/N and 1/N.
Train the Naive Bayes classifier on the training set.
Evaluate the learned classifier on the test set and report:
1. Accuracy
2. Positive and Negative Precision, Recall, F-measure
Show the most informative features learned by the classifier (use NaiveBayesClassifier.show_most_informative_features()).
The function should print the evaluation and return the learned classifier as a value.

We want to perform basic error analysis on this classifer. Identify a method to find the "worst errors" made by the classifier and list the top-K errors for the confused documents (that is, negative classified as positive and positive classified as negative). Can you identify intuitive reasons for the observed confusions?

Q2.2: Data Exploration: Impact of Unknown Words

One of the suspicions we may have about bag of words features is whether the classifier behaves correctly in the presence of unknown words - that is, words that appear in the test set but were never seen during training.

Construct a classifier for N = 2 -- that is, trained over half the documents, tested over the other half (1000 documents in each set). Measure for each document in the test set the percentage of unknown words - that is, how many words are in the test document that were never seen in the training set. Organize the test dataset as a set of 5 groups according to the rate of unknown words. Report for each of the 5 groups:

Size of the bin, relative number of positive and negative documents
Accuracy, positive and negative precision and recall

Present this information in the form of a plot. Discuss your observations.

Q2.3: Improved feature extraction 1: most frequent, stop words

In an attempt to improve the classifier, we will try to filter the bag of words features and select "good predictive features" by filtering out suspicious features. We will try 2 strategies:

Select only the F most frequent words from the vocabulary - so that low-frequency words do not introduce noise.
Remove "stop words" - that is, words that are expected to appear frequently uniformly in all English documents (such as "and", "to" etc).

We can use the following nltk method to obtain a list of stop words:

from nltk.corpus import stopwords
stopset = set(stopwords.words('english'))

def stopword_remover(words):
    return dict([(word, True) for word in words if word not in stopset])

Write a new feature extractor over the movie reviews dataset which keeps only the top-K most frequent words across the whole dataset and filters out stop words. Since K is a parameter, define a higher-order function: make_topK_non_stopword_extractor(K, stopwords). Evaluate this new feature extractor:

extractor = make_topK_non_stop_word_extractor(10000, stopset)
classifier = evaluate_features(extractor, 10)

Compare the behavior of this new feature extractor with the baseline bag of words. Try to optimize the value of the parameter K to learn a good classifier. Do you see a predictable behavior of the accuracy and other metrics of the classifier as K varies as a function of the size of the total vocabulary in the training set W. Draw a plot of accuracy vs. K/W. Identify documents which are classified differently by the 2 classifiers and report new_positives and new_negatives. Explain the differences you observe.

Q2.4: Improved feature extraction 2: exploit part of speech information

An intuitive idea is that "sentiment words" should be mostly adjectives, verbs and adverbs. For example, the top-10 features of the bag of words classifier for a certain split looks like this:

Most Informative Features
             magnificent = True              pos : neg    =     15.0 : 1.0
             outstanding = True              pos : neg    =     13.6 : 1.0
               insulting = True              neg : pos    =     13.0 : 1.0
              vulnerable = True              pos : neg    =     12.3 : 1.0
               ludicrous = True              neg : pos    =     11.8 : 1.0
             uninvolving = True              neg : pos    =     11.7 : 1.0
                  avoids = True              pos : neg    =     11.7 : 1.0
              astounding = True              pos : neg    =     10.3 : 1.0
             fascination = True              pos : neg    =     10.3 : 1.0
                 idiotic = True              neg : pos    =      9.8 : 1.0

We will try to filter the features by keeping only those that are tagged by a "promising part of speech tag". Use the best part of speech tagger you have trained in Assignment 1 on the Brown corpus, and apply it to the movie_reviews dataset. Make sure you train your classifier on lower case text only, since the movie reviews dataset contains only lower case text.

Write a feature extractor constructor make_pos_extractor which constructs a feature extractor that filters words by their POS. The extracted features are words that are tagged by one of the requested tags.

extractor = make_pos_extractor(['NN', 'VBG', 'ADJ', 'ADV'])
classifier = evaluate_features(extractor, 10)

Identify a good set of POS tags that give good results. Explain your observation.

Q2.5: Improved feature extraction 3: bigrams

So far, our feature extractors have focused on single words as features. We suspect that looking at bigrams (sequence of 2 consecutive words) will help the classifier at least for cases like "not good" or "beautifully strange". We will, therefore, investigate bigram features. Our fear is that bigram features will introduce too many low-frequency features and too many unseen features at test time. Therefore, we would like to filter and select only good predictive bigrams.

First, let us compare baselines:

The bag of words (unigram) learned above.
All bigrams (use the nltk.util.bigrams function).
Unigrams and bigrams together

Report on the results for these features for N=4. We now want to combine unigrams and "good bigrams". Here is a method proposed in Python Text Processing with NLTK. We use a set of metrics that measure the "strength of association" of observed bigrams:

These tools can be used as shown below:

import itertools
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures

score = BigramAssocMeasures.chi_sq  # chi square measure of strength
def strong_bigrams(words, score_fn, n):
    bigram_finder = BigramCollocationFinder.from_words(words)
    bigrams = bigram_finder.nbest(score_fn, n)
    return [bigram for bigram in itertools.chain(words, bigrams)]

Use this method to extract good bigrams over the training dataset and compare this with the baselines listed above.

Last modified May 25, 2012