Assignment 1

Due: Sun 18 Nov 2012 Midnight

Natural Language Processing - Fall 2013 Michael Elhadad

This assignment covers the topic of statistical models of parts of speech tagging. The objective is:

Learn how to use Python and NLTK
Learn how to access an annotated text corpus
Learn how to annotate and extend a text corpus
Learn how to access text resources on the Internet and clean it before it can be annotated
Learn how to explore the statistical properties of a corpus
Learn to train and evaluate simple tagging models

Submit your solution in the form of an HTML file, using the same CSS as this page with code inside <pre> tags. Images should be submitted as PNG or JPG files. The whole code should also be submitted as a separate folder with all necessary code to run the questions separated in clearly documented functions.

Parts of Speech Tagging

Data Exploration
1. Gathering and cleaning up data
2. Gathering basic statistics
Unigram and Affix Tagger
Fine-grained Error Analysis

Known vs. Unknown Accuracy
Per Tag Precision and Recall
Confusion Matrix
Sensitivity to the Size and Structure of the Training Set: Cross-Validation
Stratified Samples

Data Exploration

Gathering and Cleaning Up Data

When we discussed the task of POS tagging in class, we assumed the text comes in a "clean" form: segmented in sentences and in words. We ran experiments on a clean corpus (correctly segmented) and obtained results of about 90% accuracy. In this question, we want to get a feeling of how difficult it is to clean real-world data. Please read the tutorial in Chapter 3 of the NLTK book. This chapter explains how to access "raw text" and clean it up: remove HTML tags, segment in sentences and in words.

Lookup at the data of the Brown corpus as it is stored in the nltk_data folder (by default, it is in a folder named like C:\nltk_data\corpora\brown under Windows). The format of the corpus is quite simple. We will attempt to add a new "section" to this corpus.

Look at the following Python library for performing Google Search. This library allows you to send queries to Google and download the results in a very simple manner in Python. To install this library, copy the file google.py under your Python installation in /python/lib/google. This library uses a popular library in the Python world called BeautifulSoup which can easily parse text in XML or "almost XML" format (such as noisy HTML pages).

The easiest way to install libraries in Python is to use the easy_install tool. For example, after you have installed easy_install, to install beautifulSoup, do the following:

Open a command shell
Type:
% easy_install beautifulsoup4

Choose a query to execute using the google package and gather about 10 hits from this site. Download the pages that your query found. Use code similar to this script to clean up the HTML of these pages:

url = "http://www.bbc.com/news/technology-17771962"
html = urlopen(url).read()
raw = nltk.clean_html(html)
tokens = nltk.word_tokenize(raw)

You will realize that cleaning HTML is not an easy task. The JusText Python package does a very good job of cleaning up HTML by removing boilerplate HTML code around "interesting text". This Blog article explains how this task is performed (it is achieved using machine learning techniques).

Save the resulting hits into clean text files. Then run the best POS Tagger you have available from class on the resulting text files, using the simplified POS Brown tagset (19 tags). Save the resulting tagged file into text files in the same format expected by the Brown corpus. You should gather about 50 sentences. Look at the Python code under \Python27\Lib\site-packages\nltk\corpus\reader\tagged.py to see explanations on how the nltk Brown corpus reader works.

Finally, manually review the tagged text and fix the errors you find. Put the manually tagged file into the nltk_data Brown corpus folder into one of the existing category (or if you are more ambitious in a new category in addition to 'news', 'editorial'...). Make sure the nltk corpus reader can read the new text you have just added to the Brown corpus.

Review the tagging of the new text separately (2 analyses) and compare your tagging results. Report the list of words on which your 2 manual tagging decisions are different (write a function to compare two taggings of the same text saved in 2 different tagged files.) Show the differences between each of your tags and the tags produced by the automatic tagger. Report how long it took you to check the tagging of 50 sentences.

You must submit in your answer:

The code used to run the queries and clean the resulting text.
The clean text files submitted as input to your tagger. (Folder "clean")
The tagged files returned by your tagger. (Folder "tagged")
The tagged files manually verified and corrected by each human tagger. (Folder "checked1" and "checked2")
The code used to compare the tags of "checked1" and "checked2" and list the differences.
The tagged files where the best tag has been chosen for each disagreement between checked1 and checked2 (Folder "final").

Report qualitatively on the errors you observe during this pipeline:

Errors met while dealing with the Google engine
Errors met while downloading the material from the Google hits
Errors met while cleaning up the HTML pages
Errors met while segmenting the text into sentences and words
Errors met by the automatic tagger: how many errors were reported by checked1 and checked2 each, and altogether.
Disagreements between checked1 and checked2.
Actual accuracy obtained by your automatic tagger compared with the final verified version of the text collection.

Gathering Basic Statistics

When we use a tagger that relies on lexical information (for each word form, what is the distribution of tags that can be assigned to the word), a measure of the complexity of the POS task is related to the level of ambiguity of the word forms. In this question, we want to explore the level of ambiguity present in our dataset. For the rest of this question, use the full Brown corpus distributed as part of NLTK.

Write a function that plots the number of words having a given number of tags. The X-axis should show the number of tags and the Y-axis the number of words having exactly this number of tags. Use the following example from the NLTK book as an inspiration:

def performance(cfd, wordlist):
    lt = dict((word, cfd[word].max()) for word in wordlist)
    baseline_tagger = nltk.UnigramTagger(model=lt, backoff=nltk.DefaultTagger('NN'))
    return baseline_tagger.evaluate(brown.tagged_sents(categories='news'))

def display():
    import pylab
    words_by_freq = list(nltk.FreqDist(brown.words(categories='news')))
    cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))
    sizes = 2 ** pylab.arange(15)
    perfs = [performance(cfd, words_by_freq[:size]) for size in sizes]
    pylab.plot(sizes, perfs, '-bo')
    pylab.title('Lookup Tagger Performance with Varying Model Size')
    pylab.xlabel('Model Size')
    pylab.ylabel('Performance')
    pylab.show()

Write a Python function that finds words with more than N observed tags. The function should return a ConditionalFreqDist object where the conditions are the words and the frequency distribution indicates the tag frequencies for each word.

Write a test function that verifies that the words indeed have more than N distinct tags in the returned value.

Write a function that given a word, finds one example of usage of the word with each of the different tags in which it can occur.

# corpus can be the tagged_sentences or tagged_words according to what is most convenient
>>> PlotNumberOfTags(corpus)

...show a plot with axis: X - number of tags (1, 2...) and Y - number of words having this number of tags...

>>> cfd = MostAmbiguousWords(corpus, 4)
<conditionalFrequency ...>

>>> TestMostAmbiguousWords(cfd, 4)
All words occur with more than 4 tags.

>>> ShowExamples('book', cfd, corpus)
'book' as NN: ....
'book' as VB: ....

Attach in a file MostAmbiguous.txt the most ambiguous words you find in the corpus with examples for each of them.

We expect this distribution to exhibit a "long tail" form. Do you confirm this hypothesis?

Unigram and Affix Tagger

We described in class the behavior of the Unigram and Affix Taggers. Both of these are implemented in nltk. We will develop here an alternative simpler implementation.

Unigram Tagger

Write a class SimpleUnigramTagger which directly inherits from nltk.TaggerI and implements a unigram tagger in the simplest possible manner. The code should be shorter than that of nltk which is based on the general case of ngram tagging. Verify that your tagger produces the same evaluation as the one provided in nltk.

Using Entropy to Filter Affix Tagger and Unigram Tagger

We discussed in class the Affix tagger as a tagger which learns a dependency between suffixes of words and parts of speech tags. One of the issues we raised was: which mapping from suffix to POS is "reliable" enough to be used by the Affix tagger. The nltk.AffixTagger uses a notion of "useful context" to determine that a context predicts a POS tag in a reliable manner. (See the source code of the nltk.ContextTagger class from which AffixTagger derives.) In this question, we want to explore the issue for the case of the SuffixTagger by exploiting the notion of entropy. Entropy measures the level of uncertainty in a distribution. During training, the AffixTagger learns a mapping from suffix to a distribution of tags. We want to filter this mapping to only keep predictions where the entropy of the tags distribution is low (below a certain cutoff). We also want to make sure that we keep suffixes for which we have seen sufficiently many observations to obtain a reliable estimation of the tags distribution.

Write a specific train method for the AffixTagger which filters the learned model according to this idea. One of the parameters of this train method must be the cutoff below which we keep a (suffix to distribution) mapping.

We need then to select a good value for this prefix. One way to do this is to optimize the cutoff parameter. We do this by training the AffixTagger with various values of the cutoff over a range of values, and comparing the accuracy obtained by each model over the development set. The optimized value of the parameter is the one that gives the best accuracy. Write a method optimize_parameter() to perform this task.

For this task, split the Brown corpus in 3 parts: training set is the first 80% of the sentences; development set the next 10%; test set will be the last 10%.

Note: for all information theoretic computations, we compute 0xlog(0) = 0. This is important for the computation of H(p) = Σpxlog(p). Describe your observations:

does entropy filtering improve accuracy?
how do you determine the range of values to test for the cutoff?
is the accuracy value evolving in a predictable manner as the cutoff varies?
describe the list of suffixes that are good tag predictors -- are you surprised by what you observe?

Repeat the same process on your simplified UnigramTagger (which learns a tag distribution per word). We want to filter the tag dictionary by using 2 criteria:

How many word instances have been seen at training time (a word seen only rarely provides low confidence)
The level of entropy of the tag distribution for the given word

Note that the 2 criteria are not independent (2 instances of a word with the same tag may be better than 4 with 3 different tags). Design a method to optimize these 2 criteria given a single backoff tagger for the unigram tagger (one using a default tag).

Fine-Grained Accuracy and Error Analysis

Known vs. Unknown Accuracy

In the review of the taggers done in class, we reported the accuracy of each tagger using the TaggerI.evaluate() method. This method computes the average number of words correctly tagged in a test dataset.

We will now investigate more fine-grained accuracy metrics and error analysis tools.

One of the most challenging task for taggers that learn from a training set is to decide how to tag unknown words. Implement a function evaluate2(training_corpus) in the TaggerI interface that reports accuracy for a trained tagger for known words and for unknown words. (Propose a design to identify known words for a trained tagger. Specify in details what it means that a chain of backoff taggers "know" a word in their training.)

Test the evaluate2() method on each of the taggers discussed in class: unigram, affix, regexp, bigram.

Per Tag Precision and Recall

We are interested in checking the behavior of a tagger per tag. This indicates which tags are most difficult to distinguish from other tags. Write a function which reports precision and recall of a tagger per tag. These measures are defined as follows:

Precision for tag T: when the tagger predicts tag T, how often is it correct in the dataset.
Recall for tag T: out of all words tagged as T in the dataset, how many are tagged correctly.

Precision and Recall per tag can be computed as a function of the true positive, true negative, false positive and false negative counts for the tags:

True positive count (TP): number of words tagged as T both in the test set and by the tagger.
True negative count (TN): words tagged as non-T both in the test set and by the tagger.
False positive count (FP): words tagged as non-T in the test set and as T by the tagger.
False negative (FN): words tagged as T in the test set and as non-T by the tagger.

Since there is a natural trade-off between Precision and Recall, we often measure a score that combines the two parameters and is called F-measure. The formula are:

Precision(T) = TP / TP + FP

Recall(T) = TP / TP + FN

F-Measure(T) = 2 x Precision x Recall / (Recall + Precision) = 2TP / (2TP + FP + FN)

All three measures are numbers between 0 and 1.

Add the function MicroEvaluate(corpus_test) to the TaggerI interface that computes for the tagger TP, TN, FP, FN, Precision, Recall and F-measure.

Propose a method to test these functions (think of extreme cases of taggers that would produce results with expected precisions or recalls).

Which tags are most difficult in the simplified tagset? In the full tagset?

Confusion Matrix

A precious method to perform error analysis consists of computing the confusion matrix of a tagger. Consider an error committed by a tagger: a word is predicted as tag T1 where it should be tagged as T2. In other words, tag T2 is confused with T1. Note that confusion is not symmetric.

A confusion matrix tabulates all the mistakes committed by a tagger in the form of a matrix C[ti, tj]. C[ti, tj] counts the number of times the tagger predicted ti instead of tj. (Which NLTK data structure is appropriate for such a value?)

Write a method ConfusionMatrix(corpus_test) that returns such a matrix for a tagger.

Validate the ConfusionMatrix() method over the DefaultTagger discussed in class.

Report the confusion matrix for the full tagset and simplified tagset of the Brown corpus for the last tagger discussed in class. Discuss the results: which pairs of tags are the most difficult to distinguish?

Given your observation on the most likely confusions, propose a simple (engineering) method to improve the results of your tagger. Implement this improvement and report on error reduction.

Sensitivity to the Size and Structure of the Training Set: Cross-Validation

The taggers we reviewed in class were trained on a data set then evaluated on a test set. We will now investigate how the results of the evaluation vary when we vary the size of the training set and the way we split our overall dataset between training and test sets.

We saw above a plot that shows how the accuracy of a unigram tagger improves as the size of the training set increases. Assume we are given a manually tagged corpus of N words. We want to train on a part and test on another. So we split the corpus in 2 parts. How should we split the dataset so that the test training set is a good predictor of actual performance on unseen data?

The first method we will describe is called cross-validation: assume we decide to split our corpus in relative size of 90% training-10% testing. How can we be sure the split on which we test is representative? The cross-validation process consists of splitting the data in 10 subsets of 10% each. We iterate the process of training/testing 10 times, each time withholding one subset of 10% for testing and training on the other 9 subsets. We then report the results of the accuracy as a table with rows: i (iteration number), accuracy(i) -- and summarize with the combined accuracy averaged over the ten experiments. (The same procedure can be performed for any number of splits N.)

Implement a method crossValidate(corpus, n) for trainable taggers. Report the 10-fold cross-validation results for the last tagger discussed in class. Discuss the results:

Are the accuracy results "stable" from fold to fold (for each of the 10 experiments)?
How do you explain the variations?
Would it help to try the extreme situation where we use number of folds = number of sentences in the training set?

Stratified Samples

When we perform cross-validation, we split the corpus randomly in N parts. An important issue to consider is whether the corpus contains sentences that are uniformly difficult. Assume there are P classes of sentences in the corpus, each class is more or less difficult to tag. If we sample the test corpus out of the "easy" class, we will unfairly claim high accuracy results. One way to avoid such bias is to construct stratified testing datasets.

The procedure for constructing a stratified dataset consists of identifying P classes, then splitting each class separately. In this question, we will perform stratification over 2 dimensions:

Length of the sentence
Genre of the sentence

The hypothesis we want to test is whether the length of a sentence (number of words) or its genre affect the results of the tagger.

We first define three classes of sentence-lengths - short, medium and long. To define what is the exact definition of these classes, plot the distribution of sentences by length in the overall Brown corpus (all categories). The plot should show how many sentences occur in the corpus for each observed sentence length. Observe the plot and decide on cutoff values for the classes "short", "medium" and "long". Discuss how you made your decision.

Write a method to construct a stratified dataset given the classes: stratifiedSamples(classes, N=10). The method should return 2 values: the training subset and the test subset, each stratified according to the classes. For example, if N=10, the stratified test subset should contain 10% of each of the classes and the stratified training subset should contain 90% of each of the classes. As a consequence, both training and testing sets contain the same relative proportion of each class.

Perform a cycle of training-testing on the Brown corpus for the last tagger discussed in class for each of the following cases:

Random split 90%-10%
Stratified split 90%-10% according to sentence length (split short/medium/long)
Stratified split 90%-10% according to the sentence genre. The Brown corpus contains sentences in each of the following categories:
- news
- editorial
- reviews
- religion
- hobbies
- lore
- belles_lettres
- government
- learned
- fiction
- mystery
- science_fiction
- adventure
- romance
- humor

Discuss the results you observe.

Last modified 04 Nov 2012