Assignment 1

Due: Thu 10 Dec 2015 Midnight

Natural Language Processing - Fall 2016 Michael Elhadad

This assignment covers 2 parts:

statistical models of parts of speech tagging.
linear regression in a probabilistic model.

The objectives of the assignment are for Part 1:

Learn how to use Python
Learn how to access an annotated text corpus using NLTK corpus classes
Learn how to annotate and extend a text corpus
Learn how to access text resources on the Internet and clean it before it can be annotated
Learn how to explore the statistical properties of a corpus
Learn to train and evaluate simple tagging models
Experiment with a near state of the art POS tagger using the Averaged Perceptron learning method.

For Part 2, on distributions, regression and classification, the objectives are:

Learn about basic statistical distributions (Normal, Multinomial), expectations, sampling and estimation.
Learn how to perform regression using a synthetic dataset.

Make sure you have installed scipy, scikit-learn and numpy to work on this assignment. This should be already available if you have installed the Anaconda distribution.

Submit your solution in the form of an iPython (Jupyter) notebook file (with extension ipynb). Images should be submitted as PNG or JPG files. The whole code should also be submitted as a separate folder with all necessary code to run the questions separated in clearly documented functions.

Look at the following two notebooks as examples to get you started - copy them to your Anaconda folder, then start your iPython notebook server and explore them:

You should execute the following tutorials on Notebooks, Numpy, Scipy and Matplotlib before starting the assignment: Scipy 2015 SKLearn Tutorial.

For a larger introduction to machine learning - it is much recommended to execute the full set of tutorials available in the form of iPython notebooks in this SKLearn Tutorial but this is not necessary for the purposes of this assignment.

Part 1: Parts of Speech Tagging

In this question, we will use the following tagset - called the "universal tagset" for parts of speech. It is slightly different from the original PennTreeBank/Brown taget and is more adapted to be ported across languages:

brown_news_tagged = brown.tagged_words(categories='news', tagset='universal')

The Universal Part-of-Speech Tagset is described in this table:

ADJ (adjective) new, good, high, special, big, local
ADP (adposition) on, of, at, with, by, into, under
ADV (adverb) really, already, still, early, now
CONJ (conjunction) and, or, but, if, while, although
DET (determiner, article) the, a, some, most, every, no, which
NOUN (noun) year, home, costs, time, Africa
NUM (numeral) twenty-four, fourth, 1991, 14:24
PRT (particle) at, on, out, over per, that, up, with
PRON (pronoun) he, their, her, its, my, I, us
VERB (verb) is, say, told, given, playing, would
. (punctuation marks) . , ; !
X (other) ersatz, esprit, dunno, gr8, univeristy

1.1 Data Exploration

1.1.1 Gathering and Cleaning Up Data

When we discussed the task of POS tagging in class, we assumed the text comes in a "clean" form: segmented in sentences and in words. We ran experiments on a clean corpus (correctly segmented) and obtained results of about 90% accuracy. In this question, we want to get a feeling of how difficult it is to clean real-world data. Please read the tutorial in Chapter 3 of the NLTK book. This chapter explains how to access "raw text" and clean it up: remove HTML tags, segment in sentences and in words.

Lookup at the data of the Brown corpus as it is stored in the nltk_data folder (by default, it is in a folder named like C:\nltk_data\corpora\brown under Windows). The format of the corpus is quite simple. We will attempt to add a new "section" to this corpus.

Look at the following Python library for performing Google Search (to run with Python3). This library allows you to send queries to Google and download the results in a simple manner in Python. To install this library, copy the file google3.py under your Python installation in /python/lib/google3. This library uses a popular library in the Python world called BeautifulSoup which can easily parse text in XML or "almost XML" format (such as noisy HTML pages).

It also uses a library called justext that parses HTML pages and extracts "clean text" out of it in a very smart manner (it removes jscript code, advertisements, banners etc).

The easiest way to install libraries in Python is to use the easy_install tool or the pip tool. pip comes pre-installed if you installed Anaconda.

For example, after you have installed easy_install, to install beautifulSoup, do the following:

Open a command shell
Type:
% easy_install beautifulsoup4
or:
% pip install beautifulsoup4

(Again, if you installed Anaconda, beautifulsoup is already included.)

Choose a query to execute using the google package and gather about 10 hits from this site. See the example code in google3.py.

You will realize that cleaning HTML is not an easy task. The JusText Python package does a very good job of cleaning up HTML by removing boilerplate HTML code around "interesting text". To install justext, use pip:

% pip install justext

Save the resulting hits into clean text files. Then run the best POS Tagger you have available from class (using NLTK taggers) on the resulting text files, using the universal POS tagset for the Brown corpus (12 tags). Save the resulting tagged file into text files in the same format expected by the Brown corpus. You should gather about 50 sentences. Look at the Python code under \Python\Lib\site-packages\nltk\corpus\reader\tagged.py to see explanations on how the nltk Brown corpus reader works.

Finally, manually review the tagged text and fix the errors you find. Put the manually tagged file into the nltk_data Brown corpus folder into one of the existing category (or if you are more ambitious in a new category in addition to 'news', 'editorial'...). Make sure the NLTK corpus reader can read the new text you have just added to the Brown corpus.

Review the tagging of the new text separately (2 analyses) and compare your tagging results. Report the list of words on which your 2 manual tagging decisions are different (write a function to compare two taggings of the same text saved in 2 different tagged files.) Show the differences between each of your tags and the tags produced by the automatic tagger. Report how long it took you to check the tagging of 50 sentences.

You must submit in your answer:

The code used to run the queries and clean the resulting text.
The clean text files submitted as input to your tagger. (Folder "clean")
The tagged files returned by your tagger. (Folder "tagged")
The tagged files manually verified and corrected by each human tagger. (Folder "checked1" and "checked2")
The code used to compare the tags of "checked1" and "checked2" and list the differences. (Hint: search the Web to find existing Python library to compare files - for example difflib).
The tagged files where the best tag has been chosen for each disagreement between checked1 and checked2 (Folder "final").

Report qualitatively on the errors you observe during this pipeline:

Errors met while dealing with the Google engine
Errors met while downloading the material from the Google hits
Errors met while cleaning up the HTML pages
Errors met while segmenting the text into sentences and words
Errors met by the automatic tagger: how many errors were reported by checked1 and checked2 each, and altogether.
Disagreements between checked1 and checked2.
Actual accuracy obtained by your automatic tagger compared with the final verified version of the text collection.

1.1.2 Gathering Basic Statistics

When we use a tagger that relies on lexical information (for each word form, what is the distribution of tags that can be assigned to the word), a measure of the complexity of the POS task is related to the level of ambiguity of the word forms. In this question, we want to explore the level of ambiguity present in our dataset.

For the rest of this question, use the full Brown corpus distributed as part of NLTK.

Write a function that plots the number of words having a given number of tags. The X-axis should show the number of tags and the Y-axis the number of words having exactly this number of tags. Use the following example from the NLTK book as an inspiration:

def performance(cfd, wordlist):
    lt = dict((word, cfd[word].max()) for word in wordlist)
    baseline_tagger = nltk.UnigramTagger(model=lt, backoff=nltk.DefaultTagger('NN'))
    return baseline_tagger.evaluate(brown.tagged_sents(categories='news'))

def display():
    import pylab
    words_by_freq = list(nltk.FreqDist(brown.words(categories='news')))
    cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))
    sizes = 2 ** pylab.arange(15)
    perfs = [performance(cfd, words_by_freq[:size]) for size in sizes]
    pylab.plot(sizes, perfs, '-bo')
    pylab.title('Lookup Tagger Performance with Varying Model Size')
    pylab.xlabel('Model Size')
    pylab.ylabel('Performance')
    pylab.show()

In your iPython Notebook, add the following command to make sure the pylab plots are displayed:

%matplotlib inline

Write a Python function that finds words with more than N observed tags. The function should return a ConditionalFreqDist object where the conditions are the words and the frequency distribution indicates the tag frequencies for each word.

Write a test function that verifies that the words indeed have more than N distinct tags in the returned value.

Write a function that given a word, finds one example of usage of the word with each of the different tags in which it can occur.

# corpus can be the tagged_sentences or tagged_words according to what is most convenient
>>> PlotNumberOfTags(corpus)

...show a plot with axis: X - number of tags (1, 2...) and Y - number of words having this number of tags...

>>> cfd = MostAmbiguousWords(corpus, 4)
<conditionalFrequency ...>

>>> TestMostAmbiguousWords(cfd, 4)
All words occur with more than 4 tags.

>>> ShowExamples('book', cfd, corpus)
'book' as NOUN: ....
'book' as VERB: ....

Attach in a file MostAmbiguous.txt the most ambiguous words you find in the corpus with examples for each of them.

We expect this distribution to exhibit a "long tail" form. Do you confirm this hypothesis? (Carefully read the definition of "long tail" to justify your answer.)

1.2 Unigram and Affix Tagger

We described in class the behavior of the Unigram and Affix Taggers. Both of these are implemented in nltk. We will develop here an alternative simpler implementation, and design an alternative way to combine the two taggers instead of the simple backoff strategy discussed in class.

For this task, split the Brown corpus in 3 parts: training set is the first 80% of the sentences; development set the next 10%; test set will be the last 10%. We will use the development set to optimize parameters in our training procedure.

1.2.1 Unigram Tagger

Write a class SimpleUnigramTagger which directly inherits from nltk.TaggerI and implements a unigram tagger in the simplest possible manner. The code should be shorter than that of nltk which is based on the general case of ngram tagging. Verify that your tagger produces the same evaluation as the one provided in nltk.

1.2.2 Using Entropy to Filter Affix Tagger

The NLTK Affix tagger is a tagger which learns a dependency between suffixes of words and parts of speech tags. One of the issues faced when learning good affixes is: which mapping from suffix to POS is "reliable" enough to be used by the Affix tagger. The nltk.AffixTagger uses a notion of "useful context" to determine that a context predicts a POS tag in a reliable manner. (See the source code of the nltk.ContextTagger class from which AffixTagger derives.)

In this question, we want to explore the issue for the case of the SuffixTagger by exploiting the notion of entropy. Entropy measures the level of uncertainty in a distribution. During training, the AffixTagger learns a mapping from suffix to a distribution of tags. We want to filter this mapping to only keep predictions where the entropy of the tags distribution is low (below a certain cutoff). We also want to make sure that we keep suffixes for which we have seen sufficiently many observations to obtain a reliable estimation of the tags distribution.

Write a specific train method for the AffixTagger which filters the learned model according to this idea. One of the parameters of this train method must be the cutoff below which we keep a (suffix to distribution) mapping. The cutoff is a free parameter of this method.

We need then to select a good value for this prefix. One way to do this is to optimize the cutoff parameter. We do this by training the AffixTagger with various values of the cutoff over a range of values, and comparing the accuracy obtained by each model over the development set. The optimized value of the parameter is the one that gives the best accuracy. Write a method optimize_parameter() to perform this task.

Note: for all information theoretic computations, we compute 0xlog(0) = 0. This is important for the computation of H(p) = Σp.log(p). Describe your observations:

Does entropy filtering improve accuracy?
How do you determine the range of values to test for the cutoff?
Is the accuracy value evolving in a predictable manner as the cutoff varies?
Describe the list of suffixes that are good tag predictors -- are you surprised by what you observe?

1.3 Fine-Grained Accuracy and Error Analysis

1.3.1 Per Tag Precision and Recall

We are interested in checking the behavior of a tagger per tag. This indicates which tags are most difficult to distinguish from other tags. Write a function which reports precision and recall of a tagger per tag. These measures are defined as follows:

Precision for tag T: when the tagger predicts tag T, how often is it correct in the dataset.
Recall for tag T: out of all words tagged as T in the dataset, how many are tagged correctly.

Precision and Recall per tag can be computed as a function of the true positive, true negative, false positive and false negative counts for the tags:

True positive count (TP): number of words tagged as T both in the test set and by the tagger.
True negative count (TN): words tagged as non-T both in the test set and by the tagger.
False positive count (FP): words tagged as non-T in the test set and as T by the tagger.
False negative (FN): words tagged as T in the test set and as non-T by the tagger.

Since there is a natural trade-off between Precision and Recall, we often measure a score that combines the two parameters and is called F-measure. The formula are:

Precision(T) = TP / TP + FP

Recall(T) = TP / TP + FN

F-Measure(T) = 2 x Precision x Recall / (Recall + Precision) = 2TP / (2TP + FP + FN)

All three measures are numbers between 0 and 1.

Add the function MicroEvaluate(corpus_test) to the TaggerI interface that computes for the tagger TP, TN, FP, FN, Precision, Recall and F-measure.

Which tags are most difficult in the universal tagset?

1.3.2 Confusion Matrix

A precious method to perform error analysis consists of computing the confusion matrix of a tagger. Consider an error committed by a tagger: a word is predicted as tag T1 where it should be tagged as T2. In other words, tag T2 is confused with T1. Note that confusion is not symmetric.

A confusion matrix tabulates all the mistakes committed by a tagger in the form of a matrix C[ti, tj]. C[ti, tj] counts the number of times the tagger predicted ti instead of tj. (Which NLTK data structure is appropriate for such a value?)

Write a method ConfusionMatrix(corpus_test) that returns such a matrix for a tagger.

Validate the ConfusionMatrix() method over the DefaultTagger discussed in class.

Report the confusion matrix for the universal tagset of the Brown corpus for the best tagger discussed in class. Discuss the results: which pairs of tags are the most difficult to distinguish?

1.4 Averaged Perceptron

We now experiment with a state of the art POS tagger described by Matthew Honnibal in this article: A good POS tagger in 200 lines of Python. This tagger is trained on the same Brown corpus, but uses as a learning algorithm the averaged perceptron with good features.

Honnibal's code is included in the very convenient TextBlob library. To install TextBlob, run these commands in your shell:

% pip install textblob

% python -m textblob.download_corpora

To learn how to use the tagger experiment with TextBlob's quickstart. Your task is:

Run the PerceptronTagger with the pre-trained model installed with TextBlob (averaged_perceptron_tagger.pickle) on the same test dataset used in the previous question. Verify which exact tagset is used in this model (is it NLTK's Universal Tagset or a different tagset?).
Report on accuracy, and per tag Precision, Recall, F.
Train the Averaged Perceptron on the Universal Tagset on the Brown corpus and test its performance. Compare errors performed by the best NLTK tagger used in the previous question and not made by the AP tagger and vice-versa.

Part 2: Polynomial Curve Fitting

In this question, we will reproduce the polynomial curve fitting example used in Bishop's book in Chapter 1. Polynomial curve fitting is an example of a learning task called regression (as opposed to classification we studied in class). In regression, the objective is to learn a function mapping an input variable X to a continuous target variable Y, while in classification, Y is a discrete variable. For educational purposes, it turns out that regression is simpler to grasp than classification. This task shows how to develop a probabilistic model of a task often described in geometric terms. The outline of the task is:

Generate a synthetic dataset of N points (x, t) for a known function y(x) with some level of noise.
Solve the curve fitting regression problem using error function optimization.
Observe the problems of over-fitting this method produces.
Introduce regularization to overcome over-fitting as a form of MAP estimation.
Finally, use Bayesian estimation to produce an interval estimation of the function y.

2.1 Synthetic Dataset Generation

Learn how to use the numpy.random package to sample random numbers from well-known distributions in this reference page. In particular, we will use in this question the Normal distribution: numpy.random.normal.

Generate a dataset of points in the form of 2 vectors x and t of size N where:

ti = y(xi) + Normal(mu, sigma)

where the xi values are equi-distant on the [0,1] segment (that is, x₁ = 0, x₂=1/N-1, x₃=2/N-1..., x_N = 1.0)
mu = 0.0
sigma = 0.03
y(x) = sin(2πx)

Our objective will be to "learn" the function y from the noisy sparse dataset we generate.

The function generateDataset(N, f, sigma) should return a tuple with the 2 vectors x and t.

Draw the plot (scatterplot) of (x,t) using matplotlib for N=100. Look at the documentation of the numpy.random.normal function in Numpy for an example of usage. Look at the definition of the function numpy.linspace to generate your dataset.

Note: a useful property of Numpy arrays is that you can apply a function to a Numpy array as follows:

import math
import numpy as np
def s(x): return x**2
def f(x): return math.sin(2 * math.pi * x)
vf = np.vectorize(f)        # Create a vectorized version of f

z = np.array([1,2,3,4])

sz = s(z)                   # You can apply simple functions to an array
sz.shape                    # Same dimension as z (4)

fz = vf(z)                  # For more complex ones, you must use the vectorized version of f
fz.shape

2.2 Polynomial Curve Fitting

We will attempt to learn the function y given a synthetic dataset (x, t). We assume that y is a polynomial of degree M - that is:

y(x) = w₀ + w₁x + w₂x² + ... + w_Mx^M

Our objective is to estimate the vector w = (w₀...w_M) from the dataset (x, t).

We first attempt to solve this regression task by optimizing the square error function (this method is called least squares:

Define: E(w) = 1/2Σ_i(y(x_i) - t_i)²
             = 1/2Σ_i(Σ_kw_kx_i^k - t_i)²

If t = (t₁, ..., t_N), then define the design matrix to be the matrix Φ such that Φ_nm = x_n^m = Φ_m(x_n). We want to minimize the error function, and, therefore, look for a solution to the linear system of equations:

dE/dw_k = 0 for k = 0..M

When we work out the partial derivations, we find that the solution to the following system gives us the optimal value w_LS given (x, t):

w_LS = (Φ^TΦ)^-1Φ^Tt

(Note: Φ is a matrix of dimension Nx(M+1), w is a vector of dimension (M+1) and t is a vector of dimension N.)

Here is how you write this type of matrix operations in Python using the Numpy library:

import numpy as np
import scipy.linalg

t = np.array([1,2,3,4])                    # This is a vector of dim 4
t.shape                                    # (4,)
phi = np.array([[1,1],[2,4],[3,3],[2,4]])  # This is a 4x2 matrix
phi.shape                                  # (4, 2) 
prod = np.dot(phi.T, phi)                  # prod is a 2x2 matrix
prod.shape                                 # (2, 2)
i = np.linalg.inv(prod)                    # i is a 2x2 matrix
i.shape                                    # (2, 2)
m = np.dot(i, phi.T)                       # m is a 2x4 matrix
m.shape                                    # (2, 4)
w = np.dot(m, t)                           # w is a vector of dim 2
w.shape                                    # (2,)

Implement a method OptimizeLS(x, t, M) which given the dataset (x, t) returns the optimal polynomial of degree M that approximates the dataset according to the least squares objective. Plot the learned polynomial w*_M(x_i) and the real function sin(2πx) for a dataset of size N=10 and M=1,3,5,10.

2.3 Polynomial Curve Fitting with Regularization

We observe that the solution to the least-squares optimization has a tendency to over-fit the dataset. To avoid over-fitting, we will use a method called regularization: the objective function we want to optimize will take into account the least-squares error as above, and in addition the complexity of the learned model w.

We define a new objective function:

Define E_PLS(w) = E(w) + λE_W(w)

Where E_PLS is called the penalized least-squares function of w
and   E_W is the penalty function.

We will use a standard penalty function:
      E_W(w) = 1/2 w^T.w = 1/2 Σ_m=0..Mw_m²

When we work out the partial derivatives of the minimization problem, we find in closed form, that the solution to the penalized least-squares is:

w_PLS = (Φ^TΦ + λI)^-1Φ^Tt

λ is called a hyper-parameter (that is, a parameter which influences the value of the model's parameters w). Its role is to balance the influence of how well the function fits the dataset (as in the least-squares model) and how smooth it is.

Write a function optimizePLS(x, t, M, lambda) which returns the optimal parameters w_PLS given M and lambda.

We want to optimize the value of λ. The way to optimize is to use a development set in addition to our training set.

To construct a development set, we will extend our synthetic dataset construction function to return 3 samples: one for training, one for development and one for testing. Write a function generateDataset3(N, f, sigma) which returns 3 pairs of vectors of size N each, (x_test, t_test), (x_validate, t_validate) and (x_train, t_train). The target values are generated as above with Gaussian noise N(0, sigma).

Look at the documentation of the function numpy.random.shuffle() as a way to generate 3 subsets of size N from the list of points generated by linspace.

Given the synthetic dataset, optimize for the value of λ by varying the value of log(λ) from -40 to -20 on the development set. Draw the plot of the normalized error of the model for the training, development and test for the case of N = 10 and the case of N=100. The normalized error of the model is defined as:

NE_w(x, t) = 1/N [Σ_i=1..N[t_i - Σ_m=0..Mw_mx_i^m]²]^1/2

Write the function optimizePLS(xt, tt, xv, tv, M) which selects the best value λ given a dataset for training (xt, tt) and a development test (xv, tv). Describe your conclusion from this plot.

2.4 Probabilistic Regression Framework

We now consider the same problem of regression (learning a function from a dataset) formulated in a probabilistic framework. Consider:

t_n = y(x_n; w) + ε_n

We now model the distribution of ε_n as a probabilistic model:

ε_n ~ N(0, σ²)

since t_n = y(x_n; w) + ε_n:

p(t_n | x_n, w, σ²) = N(y(x_n; w), σ²)

We now assume that the observed data points in the dataset are all drawn in an independent manner (iid), we can then express the likelihood of the whole dataset:

p(t | x,w,σ²) = ∏_n=1..Np(t_n | x_n, w, σ²)
              = ∏_n=1..N(2πσ²)^-1/2exp[-{t_n - y(x_n, w)}² / 2σ²]

We consider this likelihood as a function of the parameters (w and σ) given a dataset (t, x).

If we consider the log-likelihood (which is easier to optimize because we have to derive a sum instead of a product), we get:

-log p(t | w, σ²) = N/2 log(2πσ²) + 1/2σ²Σ_n=1..N{t_n - y(x_n;w)}²

We see that optimizing the log-likelihood of the dataset is equivalent to minimizing the error function of the least-squares method. That is to say, the least-squares method is understood as the maximum likelihood estimator (MLE) of the probabilistic model we just developed, which produces the values w_ML = w_LS.

We can also optimize this model with respect to the second parameter σ² which, when we work out the derivation and the solution of the equation, yields:

σ²_ML = 1/N Σ_n=1..N{y(x_n, w_ML) - t_n}²

Given w_ML and σ²_ML, we can now compute the plugin posterior predictive distribution, which gives us the probability distribution of the values of t given an input variable x:

p(t | x, w_ML, σ²_ML) = N(t | y(x,w_ML), σ²_ML)

This is a richer model than the least-squares model studied above, because it not only estimates the most-likely value t given x, but also the precision of this prediction given the dataset. This precision can be used to construct a confidence interval around t.

We further extend the probabilistic model by considering a Bayesian approach to the estimation of this probabilistic model instead of the maximum likelihood estimator (which is known to over-fit the dataset). We choose a prior over the possible values of w which we will, for convenience reasons, select to be of a normal form (this is a conjugate prior as explained in our review of basic probabilities):

p(w | α) = ∏_m=0..M(α / 2π)^1/2 exp{-α/2 w_m²}
             = N(w | 0, 1/αI)

This prior distribution expresses our degree of belief over the values that w can take. In this distribution, α plays the role of a hyper-parameter (similar to λ in the regularization model above).

The Bayesian approach consists of applying Bayes rule to the estimation task of the posterior distribution given the dataset:


p(w | t, α, σ²) = likelihood . prior / normalizing-factor
                = p(t | w, σ²)p(w | α) / p(t | α, σ²)

Since we wisely chose a conjugate prior for our distribution over w, we can compute the posterior analytically:

p(w | x, t, α, σ²) = N(μ, Σ)

where Φ is the design matrix as above:

μ = (Φ^TΦ + σ²αI)^-1Φ^Tt

Σ = σ²(Φ^TΦ + σ²αI)^-1

Given this approach, instead of learning a single point estimate of w as in the least-squares and penalized least-squares methods above, we have inferred a distribution over all possible values of w given the dataset. In other words, we have updated our belief about w from the prior (which does not include any information about the dataset) using new information derived from the observed dataset.

We can determine w by maximizing the posterior distribution over w given the dataset and the prior belief. This approach is called the maximum posterior (usually written MAP). If we solve the MAP given our selection of the normal conjugate prior, we obtain that the posterior reaches its maximum on the minimum of the following function of w:

1/2σ²Σ_n=1..N{y(x_n, w) - t_n}² + α/2w^Tw

We find thus that w_MAP is in fact the same as the solution of the penalized least-squares method for λ = α σ². In other words - this probabilistic model explains that the PLS is in fact the optimal solution to the problem when our prior belief on the parameters w is a Normal distribution N(0, 1/αI) which encourages small values for the parameter w.

A fully Bayesian approach, however, does not look for point-estimators of parameters like w. Instead, we are interested in the predictive distribution p(t | x, x, t). The Bayesian approach consists of marginalizing the predictive distribution over all possible values of the parameters:

p(t | x, x, t) = ∫ p(t|x, w)p(w | x, t) dw

(For simplicity, we have hidden the dependency on the hyper-parameters α and σ in this formula.) On this simple case, with a simple normal distribution and normal prior over w, we can solve this integral analytically, and we obtain:

p(t | x, x, t) = N(t | m(x), s²(x))

where the mean and variance are:

m(x) = 1/σ² Φ(x)^TS Σ_n=1..NΦ(x_n)t_n

s²(x) = σ² + Φ(x)^TSΦ(x)

S^-1 = αI + 1/σ² Σ_n=1..NΦ(x_n)Φ(x_n)^T

Φ(x) = (Φ₀(x) ... Φ_M(x))^T = (1 x x² ... x^M)^T

Note that the mean and the variance of this predictive distribution depend on x.

Your task: write a function bayesianEstimator(x, t, M, alpha, sigma2) which given the dataset (x, t) of size N, and the parameters M, alpha, and sigma2 (the variance), returns a tuple of 2 functions (m(x) var(x)) which are the mean and variance of the predictive distribution inferred from the dataset, based on the parameters and the normal prior over w. As you can see, in the Bayesian approach, we do not learn an optimal value for the parameter w, but instead, marginalize out this parameter and directly learn the predictive distribution.

Note that in Python, a function can return a function (like in Scheme) using the following syntax:

def adder(x): 
    return lambda(y): x+y

a2 = adder(2)

print(a2(3))       // prints 5
print(adder(4)(3)) // prints 7

Draw the plot of the original function y = sin(2πx) over the range [0..1], the mean of the predictive distribution m(x) and the confidence interval (m(x) - var(x)^1/2) and (m(x) + var(x)^1/2) (that is, one standard deviation around each predicted point) for the values:

alpha = 0.005
sigma2 = 1/11.1
M = 9

over a synthetic dataset of size N=10 and N=100. The plot should look similar to the Figure below (from Bishop p.32).

Interpret the height of the band around the most likely function in terms of the distribution of the xs in your synthetic dataset. Can you think of ways to make this height very small in one segment of the function and large in another?

Last modified 26 Nov 2015