Assignment 1

Due: Mon 19 Dec 2016 Midnight

Natural Language Processing - Fall 2017 Michael Elhadad

The objectives of this assignment are:

Use Python for Machine Learning applications
Access an annotated text corpus using NLTK corpus classes
Annotate and extend a text corpus
Access text resources on the Internet and clean it before it can be annotated
Explore the statistical properties of a corpus
Train and evaluate simple tagging models
Experiment with a good POS tagger.
Experiment with a neural network library.

Make sure you have installed scipy, scikit-learn and numpy to work on this assignment. This should be already available if you have installed the Anaconda distribution.

Submit your solution in the form of an Jupyter notebook file (with extension ipynb). Images should be submitted as PNG or JPG files. The whole code should also be submitted as a separate folder with all necessary code to run the questions separated in clearly documented functions.

Look at the following two notebooks as examples to get you started - copy them to your Anaconda folder, then start your iPython notebook server and explore them:

You should execute the following tutorials on Notebooks, Numpy, Scipy and Matplotlib before starting the assignment: Scipy 2015 SKLearn Tutorial (ipynb).

For a larger introduction to machine learning - it is much recommended to execute the full set of tutorials available in the form of iPython notebooks in this SKLearn Tutorial but this is not necessary for the purposes of this assignment.

Content

Parts of Speech Tagging

Parts of Speech Tagging

In this question, we will use the following tagset - called the "universal tagset" for parts of speech. It is slightly different from the original PennTreeBank/Brown taget and is more adapted to be ported across languages:

brown_news_tagged = brown.tagged_words(tagset='universal')

1 Data Exploration

1.1 Manual Tagging

Manually tag the following sentences:

One of the most basic and most useful task when processing text is to tokenize each word separately and label each word according to its most likely part of speech.

In the mid 1980s, researchers in Europe began to use hidden Markov models (HMMs) to disambiguate parts of speech, when working to tag the Lancaster-Oslo-Bergen Corpus of British English.

A first approximation was done with a program by Greene and Rubin, which consisted of a huge handmade list of what categories could co-occur at all.

Write the tagging in the same way it is encoded in the Brown corpus - using the Universal Tagset abbreviations - for example:

	The/DET Fulton/PROPN County/NOUN Grand/ADJ Jury/NOUN said/VERB Friday/NOUN an/DET investigation/NOUN of/ADP Atlanta/NOUN 's/PART recent/ADJ primary/ADJ election/NOUN produced/VERB ``/PUNCT no/DET evidence/NOUN ''/PUNCT that/SCONJ any/DET irregularities/NOUN took/VERB place/NOUN ./PUNCT

Some of the manual tagging is prevented by tokenization decisions (how words are split). Give examples.

1.2 Gathering and Cleaning Up Data

When we discussed the task of POS tagging in class, we assumed the text comes in a "clean" form: segmented in sentences and in words. We ran experiments on a clean corpus (correctly segmented) and obtained results of about 90-95% accuracy. In this question, we want to get a feeling of how difficult it is to clean real-world data. In this question, we access "raw text" and clean it up: remove HTML tags, segment in sentences and in words.

Lookup at the data of the Brown corpus as it is stored in the nltk_data folder (by default, it is in a folder named like C:\nltk_data\corpora\brown under Windows). The format of the corpus is quite simple. We will attempt to add a new "section" to this corpus.

Look at the following Python library for performing Google Search (to run with Python3). This library allows you to send queries to Google and download the results in a simple manner in Python. To install this library, copy the file google3.py under your Python installation in /python/lib/google3. This library uses a popular library in the Python world called BeautifulSoup which can easily parse text in XML or "almost XML" format (such as noisy HTML pages).

It also uses a library called justext that parses HTML pages and extracts "clean text" out of it in a very smart manner (it removes jscript code, advertisements, banners etc).

Make sure beautifulSoup is installed:

Open a command shell
Type:
% pip install beautifulsoup4

(If you installed Anaconda, beautifulsoup is already included.)

Choose a query to execute using the google package and gather about 10 hits from this site. See the example code in google3.py.

You will realize that cleaning HTML is not an easy task. The JusText Python package does a very good job of cleaning up HTML by removing boilerplate HTML code around "interesting text". To install justext, use pip:

% pip install justext

Save the resulting hits into clean text files. Then run the best POS Tagger you have available from class (using NLTK taggers) on the resulting text files, using the universal POS tagset for the Brown corpus (17 tags). Save the resulting tagged file into text files in the same format expected by the Brown corpus. You should gather about 20 sentences. Look at the Python code under \Python\Lib\site-packages\nltk\corpus\reader\tagged.py to see explanations on how the nltk Brown corpus reader works.

Finally, manually review the tagged text and fix the errors you find. Put the manually tagged file into the nltk_data Brown corpus folder into one of the existing category. Make sure the NLTK corpus reader can read the new text you have just added to the Brown corpus.

Review the tagging of the new text separately (2 analyses) and compare your tagging results. Report the list of words on which your 2 manual tagging decisions are different (write a function to compare two taggings of the same text saved in 2 different tagged files.) Show the differences between each of your tags and the tags produced by the automatic tagger. Report how long it took you to check the tagging of 20 sentences.

You must submit in your answer:

The code used to run the queries and clean the resulting text.
The clean text files submitted as input to your tagger. (Folder "clean")
The tagged files returned by your tagger. (Folder "tagged")
The tagged files manually verified and corrected by each human tagger. (Folder "checked1" and "checked2")
The code used to compare the tags of "checked1" and "checked2" and list the differences. (Hint: search the Web to find existing Python library to compare files - for example difflib).
The tagged files where the best tag has been chosen for each disagreement between checked1 and checked2 (Folder "final").

Report qualitatively on the errors you observe during this pipeline:

Errors met while dealing with the Google engine
Errors met while downloading the material from the Google hits
Errors met while cleaning up the HTML pages
Errors met while segmenting the text into sentences and words
Errors met by the automatic tagger: how many errors were reported by checked1 and checked2 each, and altogether.
Disagreements between checked1 and checked2.
Actual accuracy obtained by your automatic tagger compared with the final verified version of the text collection.

1.2 Gathering Basic Statistics

When we use a tagger that relies on lexical information (for each word form, what is the distribution of tags that can be assigned to the word), a measure of the complexity of the POS task is related to the level of ambiguity of the word forms. In this question, we want to explore the level of ambiguity present in our dataset.

For the rest of this question, use the full Brown corpus distributed as part of NLTK (not just a single category).

Write a function that plots the number of words having a given number of tags. The X-axis should show the number of tags and the Y-axis the number of words having exactly this number of tags. Use the following example from the NLTK book as an inspiration:

def performance(cfd, wordlist):
    lt = dict((word, cfd[word].max()) for word in wordlist)
    baseline_tagger = nltk.UnigramTagger(model=lt, backoff=nltk.DefaultTagger('NN'))
    return baseline_tagger.evaluate(brown.tagged_sents(categories='news'))

def display():
    import pylab
    words_by_freq = list(nltk.FreqDist(brown.words(categories='news')))
    cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))
    sizes = 2 ** pylab.arange(15)
    perfs = [performance(cfd, words_by_freq[:size]) for size in sizes]
    pylab.plot(sizes, perfs, '-bo')
    pylab.title('Lookup Tagger Performance with Varying Model Size')
    pylab.xlabel('Model Size')
    pylab.ylabel('Performance')
    pylab.show()

In your Notebook, add the following command to make sure the pylab plots are displayed:

%matplotlib inline

# corpus can be the tagged_sentences or tagged_words according to what is most convenient
>>> PlotNumberOfTags(corpus)

...show a plot with axis: X - number of tags (1, 2...) and Y - number of words having this number of tags...

2 Stratified Test/Train Sets

The Brown dataset contains documents in multiple categories:

% from nltk.corpus import brown
% brown.categories()
['adventure',
 'belles_lettres',
 'editorial',
 'fiction',
 'government',
 'hobbies',
 'humor',
 'learned',
 'lore',
 'mystery',
 'news',
 'religion',
 'reviews',
 'romance',
 'science_fiction']

Each category contains a set of sentences of different genres, and with different vocabularies and linguistic characteristics.

We want to split the dataset into a train and test dataset - as a 90%/10% split. One way to perform the split is to attempt to maintain the same distribution of genres in the train and test dataset as it is in the whole dataset. To this end, we need to split each category into train and test in the same proportion. Such splits are called stratified split.

Write a function to split the brown dataset in a stratified manner:

% train, test = stratified_split(brown, .9)

3 Unigram

3.1 Unigram Tagger

3.1.1. Write a class SimpleUnigramTagger which directly inherits from nltk.TaggerI and implements a unigram tagger in the simplest possible manner. The code should be shorter than that of nltk which is based on the general case of ngram tagging.

3.1.2. Verify that your tagger produces the same evaluation as the one provided in nltk.

3.1.3. In your evaluation, count how many of the errors are due:

To the selection of a wrong tag for a known word (one that was observed at training time)
To the fact that the tested word was unknown (was never seen at training time)

3.1.4. Report the rate of unknown words per category.

3.2 Using Entropy to Filter Affix Tagger

The NLTK Affix tagger is a tagger which learns a dependency between suffixes of words and parts of speech tags. When learning good affixes we must decide which mapping from suffix to POS is "reliable" enough to be used by the Affix tagger. The nltk.AffixTagger uses a notion of "useful context" to determine that a context predicts a POS tag in a reliable manner. (See the source code of the nltk.ContextTagger class from which AffixTagger derives.)

In this question, we want to explore this decision of the SuffixTagger by exploiting the notion of entropy. Entropy measures the level of uncertainty in a distribution. During training, the AffixTagger learns a mapping from suffix to a distribution of tags. We want to filter this mapping to only keep predictions where the entropy of the tags distribution is low (below a certain cutoff). We also want to make sure that we keep suffixes for which we have seen sufficiently many observations to obtain a reliable estimation of the tags distribution.

Filtering only makes sense if we have a backoff tagger to handle the decisions we decide not to commit to. In the following experiment, select a backoff tagger that allows you to measure the benefit of the filtering - explain your selection of this backoff tagger to this end. 3.2.1. Write a specific train method for the AffixTagger which filters the learned model according to this idea. One of the parameters of this train method must be the cutoff below which we keep a (suffix to distribution) mapping. The cutoff is a free parameter of this method.

3.2.2. We need then to select a good value for this prefix. One way to do this is to optimize the cutoff parameter. We do this by splitting the training dataset in two parts (90%/10% of the training data) - which we call train and development. We train the AffixTagger with various values of the cutoff over a range of values on the train set, and compare the accuracy obtained by each model over the development set. The optimized value of the parameter is the one that gives the best accuracy. Write a method optimize_parameter() to perform this task.

Note: for all information theoretic computations (like Entropy), we compute 0xlog(0) = 0. This is important for the computation of H(p) = Σp.log(p). 3.2.3 Describe your observations (in the notebook):

Does entropy filtering improve the accuracy of the AffixTagger?
How do you determine the range of values to test for the cutoff?
Is the accuracy value evolving in a predictable manner as the cutoff varies?
Describe the list of suffixes that are good tag predictors -- are you surprised by what you observe?

4 Fine-Grained Accuracy and Error Analysis

4.1 Per Tag Precision and Recall

We want to check the behavior of a tagger per tag. This indicates which tags are most difficult to distinguish from other tags. Write a function which reports precision and recall of a tagger per tag. These measures are defined as follows:

Precision for tag T: when the tagger predicts tag T, how often is it correct in the dataset.
Recall for tag T: out of all words tagged as T in the dataset, how many are tagged correctly.

Precision and Recall per tag can be computed as a function of the true positive, true negative, false positive and false negative counts for the tags:

True positive count (TP): number of words tagged as T both in the test set and by the tagger.
True negative count (TN): words tagged as non-T both in the test set and by the tagger.
False positive count (FP): words tagged as non-T in the test set and as T by the tagger.
False negative (FN): words tagged as T in the test set and as non-T by the tagger.

Since there is a natural trade-off between Precision and Recall, we often measure a score that combines the two parameters and is called F-measure. The formula are:

Precision(T) = TP / TP + FP

Recall(T) = TP / TP + FN

F-Measure(T) = 2 x Precision x Recall / (Recall + Precision) = 2TP / (2TP + FP + FN)

All three measures are numbers between 0 and 1.

4.1.1. Add the function microEvaluate(corpus_test) to the TaggerI interface that computes for the tagger TP, TN, FP, FN, Precision, Recall and F-measure.

4.1.2. Which tags are most difficult in the universal tagset? Explain why with linguistic examples.

4.2 Confusion Matrix

A precious method to perform error analysis consists of computing the confusion matrix of a tagger. Consider an error committed by a tagger: a word is predicted as tag T1 where it should be tagged as T2. In other words, tag T2 is confused with T1. Note that confusion is not symmetric.

A confusion matrix tabulates all the mistakes committed by a tagger in the form of a matrix C[ti, tj]. C[ti, tj] counts the number of times the tagger predicted ti instead of tj. (Which NLTK data structure is appropriate for such a value?)

4.2.1. Write a method ConfusionMatrix(corpus_test) that returns such a matrix for a tagger. Note: Study how the method sklearn.metrics.confusion_matrix works to help you work on this task:

import numpy as np
from sklearn import metrics

x = ['a', 'b', 'c']
ygold = ['vowel', 'consonant', 'consonant']
yest = ['vowel', 'vowel', 'consonant']

cm = metrics.confusion_matrix(ygold, yest)
print(cm)

The following code displays a confusion matrix in a pleasing way:

%matplotlib inline
import matplotlib.pyplot as plt
import itertools

def plot_confusion_matrix(cm, classes,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')

# Plot confusion matrix
plt.figure(figsize=(10,10))
plot_confusion_matrix(cm, classes=labels, title='Confusion matrix')
plt.show()

4.2.2. Validate the ConfusionMatrix() method over the DefaultTagger discussed in class (what shape do you expect for the confusion matrix of a DefaultTagger?).

4.2.3. Report the confusion matrix for the universal tagset of the Brown corpus for the best tagger discussed in class. Discuss the results: which pairs of tags are the most difficult to distinguish? Show examples of the 4 most frequent confusion pairs and explain why they are difficult to distinguish.

5 Averaged Perceptron

We now experiment with a good POS tagger described by Matthew Honnibal in this article: A good POS tagger in 200 lines of Python. This tagger uses as a learning algorithm the averaged perceptron with good features.

Honnibal's code is available in NLTK under the name PerceptronTagger. Your task is:

5.1. Explain in your own words how the Averaged perceptron algorithm works.
5.2. Train the averaged perceptron tagger on the Brown dataset (full stratified training dataset with 90% of the sentences).
5.3. Report on accuracy, and per tag Precision, Recall, F and confusion matrix.
5.4. Study the source code of the NLTK PerceptronTagger:
- 5.4.1. Explain the usage of the tagdict data structure - how many elements does it include? what fraction of the tested tokens are guessed through tagdict?
- 5.4.2. Are there errors at test time which are caused by the usage of tagdict? Write code to confirm your answer.
- 5.4.3. Write a version of the train / tag methods of the PercepronTagger that does not use tagdict - train this version and report on performance as in 5.3. Explain the differences you observe with the original version of PerceptronTagger. (This type of analysis is called ablation study.)
- 5.4.4. Write a function get_features(perceptronTagger, taggedSentence) which shows the computed features for a single sentence. pay attention to the way words guessed through tagdict are encoded as features. (Start from the code of the method PerceptronTagger.tag). Explain the meaning of each of 13 computed features that are computed for each token - indicate for each whether they correspond to morphological, syntactic or lexical knowledge.
- 5.4.5. Explain the role of the function normalize().
- 5.4.6. Inspect the value of the variable perceptronTagger.model.weigths after it has been trained. How many elements does it include? Pick some of the elements of this data structure and explain their intuitive meaning as a rule "if .... then ...".

6. Averaged Perceptron in Tensorflow [OPTIONAL BONUS QUESTION]

Your objective is to implement a version of the Percepron tagger using the TensorFlow neural network library. Consider as a starting point these 2 examples:

multilayer perceptron in Tensorflow for MNIST handwritten digit recognition.
Scikit_learn DictVectorizer describes how to translate features encoded as dictionaries (as implemented in the PerceptronTagger reviewed in 5.4.4) into numpy vectors of numbers.

The easiest point to start your implementation is to modify the PerceptronTagger without tagdict usage (as in 5.4.3) and replace the Model class from the simple perceptron implemented in the nltk version. Experiment with the selection of the parameters of your model (dimensions of hidden layers, number of hidden layers) and report on obtained performance.

Last modified 30 Nov 2016