Assignment 1
Due: Mon 19 Dec 2016 Midnight
Natural Language Processing - Fall 2017 Michael Elhadad
The objectives of this assignment are:
- Use Python for Machine Learning applications
- Access an annotated text corpus using NLTK corpus classes
- Annotate and extend a text corpus
- Access text resources on the Internet and clean it before it can be annotated
- Explore the statistical properties of a corpus
- Train and evaluate simple tagging models
- Experiment with a good POS tagger.
- Experiment with a neural network library.
Make sure you have installed scipy, scikit-learn and numpy to work on this assignment.
This should be already available if you have installed the Anaconda distribution.
Submit your solution in the form of an Jupyter notebook file (with extension ipynb).
Images should be submitted as PNG or JPG files. The whole code should also be submitted as a separate folder with all necessary code to run the questions separated in clearly
documented functions.
Look at the following two notebooks as examples to get you started - copy them to your Anaconda folder, then start
your iPython notebook server and explore them:
You should execute the following tutorials on Notebooks, Numpy, Scipy and Matplotlib before starting the assignment:
Scipy 2015 SKLearn Tutorial
(ipynb).
For a larger introduction to machine learning - it is much recommended to execute the full set of tutorials available in the form of iPython notebooks in this
SKLearn Tutorial but this is not necessary for the purposes of this assignment.
Content
Parts of Speech Tagging
In this question, we will use the following tagset - called the "universal tagset" for parts of speech.
It is slightly different from the original PennTreeBank/Brown taget and is more adapted to be ported
across languages:
brown_news_tagged = brown.tagged_words(tagset='universal')
1 Data Exploration
1.1 Manual Tagging
Manually tag the following sentences:
One of the most basic and most useful task when processing text is to tokenize each word separately and label each word according to its most likely part of speech.
In the mid 1980s, researchers in Europe began to use hidden Markov models (HMMs) to disambiguate parts of speech, when working to tag the Lancaster-Oslo-Bergen Corpus of British English.
A first approximation was done with a program by Greene and Rubin, which consisted of a huge handmade list of what categories could co-occur at all.
Write the tagging in the same way it is encoded in the Brown corpus - using the Universal Tagset abbreviations - for example:
The/DET Fulton/PROPN County/NOUN Grand/ADJ Jury/NOUN said/VERB Friday/NOUN an/DET investigation/NOUN of/ADP Atlanta/NOUN 's/PART recent/ADJ primary/ADJ election/NOUN produced/VERB ``/PUNCT no/DET evidence/NOUN ''/PUNCT that/SCONJ any/DET irregularities/NOUN took/VERB place/NOUN ./PUNCT
Some of the manual tagging is prevented by tokenization decisions (how words are split).
Give examples.
1.2 Gathering and Cleaning Up Data
When we discussed the task of POS tagging in class, we assumed the text
comes in a "clean" form: segmented in sentences and in words. We ran
experiments on a clean corpus (correctly segmented) and obtained results
of about 90-95% accuracy. In this question, we want to get a feeling of
how difficult it is to clean real-world data. In this question, we access "raw text"
and clean it up: remove HTML tags, segment in sentences and in words.
Lookup at the data of the Brown corpus as it is stored in the nltk_data
folder (by default, it is in a folder named like
C:\nltk_data\corpora\brown under Windows). The format of the corpus is
quite simple. We will attempt to add a new "section" to this corpus.
Look at the following Python library for
performing Google Search (to run with Python3). This library allows you to send queries to
Google and download the results in a simple manner in Python. To
install this library, copy the file google3.py under your Python
installation in /python/lib/google3. This library uses a popular library
in the Python world called
BeautifulSoup
which can easily parse text in XML or "almost XML" format (such as noisy
HTML pages).
It also uses a library called justext that parses HTML pages and extracts
"clean text" out of it in a very smart manner (it removes jscript code,
advertisements, banners etc).
Make sure beautifulSoup is installed:
Open a command shell
Type:
% pip install beautifulsoup4
(If you installed Anaconda, beautifulsoup is already included.)
Choose a query to execute using the google package and gather about 10
hits from this site. See the example code in google3.py.
You will realize that cleaning HTML is not an easy task.
The JusText Python
package does a very good job of cleaning up HTML by removing boilerplate
HTML code around "interesting text".
To install justext, use pip:
% pip install justext
Save the resulting hits into clean text files. Then run the best POS
Tagger you have available from class (using NLTK taggers) on the resulting text files,
using the universal POS tagset for the Brown corpus (17 tags). Save the resulting
tagged file into text files in the same format expected by the Brown
corpus. You should gather about 20 sentences. Look at the Python
code under \Python\Lib\site-packages\nltk\corpus\reader\tagged.py to
see explanations on how the nltk Brown corpus reader works.
Finally, manually review the tagged text and fix the errors you find.
Put the manually tagged file into the nltk_data Brown corpus folder
into one of the existing category. Make sure the NLTK corpus reader can read the new text you have
just added to the Brown corpus.
Review the tagging of the new text separately (2 analyses) and compare
your tagging results. Report the list of words on which your 2 manual
tagging decisions are different (write a function to compare two
taggings of the same text saved in 2 different tagged files.) Show
the differences between each of your tags and the tags produced by the
automatic tagger. Report how long it took you to check the tagging of
20 sentences.
You must submit in your answer:
- The code used to run the queries and clean the resulting text.
- The clean text files submitted as input to your tagger. (Folder "clean")
- The tagged files returned by your tagger. (Folder "tagged")
- The tagged files manually verified and corrected by each human
tagger. (Folder "checked1" and "checked2")
- The code used to compare the tags of "checked1" and "checked2" and
list the differences. (Hint: search the Web to find existing Python
library to compare files - for example difflib).
- The tagged files where the best tag has been chosen for each
disagreement between checked1 and checked2 (Folder "final").
Report qualitatively on the errors you observe during this pipeline:
- Errors met while dealing with the Google engine
- Errors met while downloading the material from the Google hits
- Errors met while cleaning up the HTML pages
- Errors met while segmenting the text into sentences and words
- Errors met by the automatic tagger: how many errors were reported by
checked1 and checked2 each, and altogether.
- Disagreements between checked1 and checked2.
- Actual accuracy obtained by your automatic tagger compared with the
final verified version of the text collection.
1.2 Gathering Basic Statistics
When we use a tagger that relies on lexical information (for each word
form, what is the distribution of tags that can be assigned to the
word), a measure of the complexity of the POS task is related to the
level of ambiguity of the word forms. In this question, we want to
explore the level of ambiguity present in our dataset.
For the rest of this question, use the full Brown corpus distributed as
part of NLTK (not just a single category).
Write a function that plots the number of words
having a given number of tags. The X-axis should show the number of
tags and the Y-axis the number of words having exactly this number of
tags. Use the following example from the
NLTK book as an
inspiration:
def performance(cfd, wordlist):
lt = dict((word, cfd[word].max()) for word in wordlist)
baseline_tagger = nltk.UnigramTagger(model=lt, backoff=nltk.DefaultTagger('NN'))
return baseline_tagger.evaluate(brown.tagged_sents(categories='news'))
def display():
import pylab
words_by_freq = list(nltk.FreqDist(brown.words(categories='news')))
cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))
sizes = 2 ** pylab.arange(15)
perfs = [performance(cfd, words_by_freq[:size]) for size in sizes]
pylab.plot(sizes, perfs, '-bo')
pylab.title('Lookup Tagger Performance with Varying Model Size')
pylab.xlabel('Model Size')
pylab.ylabel('Performance')
pylab.show()
In your Notebook, add the following command to make sure
the pylab plots are displayed:
%matplotlib inline
# corpus can be the tagged_sentences or tagged_words according to what is most convenient
>>> PlotNumberOfTags(corpus)
...show a plot with axis: X - number of tags (1, 2...) and Y - number of words having this number of tags...
2 Stratified Test/Train Sets
The Brown dataset contains documents in multiple categories:
% from nltk.corpus import brown
% brown.categories()
['adventure',
'belles_lettres',
'editorial',
'fiction',
'government',
'hobbies',
'humor',
'learned',
'lore',
'mystery',
'news',
'religion',
'reviews',
'romance',
'science_fiction']
Each category contains a set of sentences of different genres, and with different vocabularies and linguistic characteristics.
We want to split the dataset into a train and test dataset - as a 90%/10% split.
One way to perform the split is to attempt to maintain the same distribution of genres in the train and test dataset as it is in the whole dataset.
To this end, we need to split each category into train and test in the same proportion.
Such splits are called stratified split.
Write a function to split the brown dataset in a stratified manner:
% train, test = stratified_split(brown, .9)
3 Unigram
3.1 Unigram Tagger
3.1.1. Write a class SimpleUnigramTagger which directly inherits from
nltk.TaggerI and implements a unigram tagger in the simplest possible
manner. The code should be shorter than that of nltk which is based on
the general case of ngram tagging.
3.1.2. Verify that your tagger produces the same evaluation as the one provided in nltk.
3.1.3. In your evaluation, count how many of the errors are due:
- To the selection of a wrong tag for a known word (one that was observed at training time)
- To the fact that the tested word was unknown (was never seen at training time)
3.1.4. Report the rate of unknown words per category.
3.2 Using Entropy to Filter Affix Tagger
The NLTK Affix tagger is a tagger which learns a dependency between suffixes of words and parts of speech tags.
When learning good affixes we must decide which mapping from suffix to POS is "reliable"
enough to be used by the Affix tagger. The nltk.AffixTagger uses a
notion of "useful context" to determine that a context predicts a POS
tag in a reliable manner. (See
the source
code of the nltk.ContextTagger class from which
AffixTagger
derives.)
In this question, we want to explore this decision of the
SuffixTagger by exploiting the notion of
entropy. Entropy
measures the level of uncertainty in a distribution. During training,
the AffixTagger learns a mapping from suffix to a distribution of tags.
We want to filter this mapping to only keep predictions where the
entropy of the tags distribution is low (below a certain cutoff). We
also want to make sure that we keep suffixes for which we have seen
sufficiently many observations to obtain a reliable estimation of the
tags distribution.
Filtering only makes sense if we have a backoff tagger to handle the decisions
we decide not to commit to. In the following experiment, select a backoff tagger
that allows you to measure the benefit of the filtering - explain your selection
of this backoff tagger to this end.
3.2.1. Write a specific train method for the AffixTagger which filters the
learned model according to this idea. One of the parameters of this
train method must be the cutoff below which we keep a (suffix to
distribution) mapping. The cutoff is a free parameter of this method.
3.2.2. We need then to select a good value for this prefix. One way to do this
is to optimize the cutoff parameter. We do this by splitting the training dataset
in two parts (90%/10% of the training data) - which we call train and development.
We train the AffixTagger with various values of the cutoff over a range of values on the
train set, and compare the accuracy obtained by each model over the development
set. The optimized value of the parameter is the one that gives the
best accuracy. Write a method optimize_parameter() to perform this
task.
Note: for all information theoretic computations (like Entropy), we compute
0xlog(0) = 0. This is important for the computation of H(p) = Σp.log(p).
3.2.3 Describe your observations (in the notebook):
- Does entropy filtering improve the accuracy of the AffixTagger?
- How do you determine the range of values to test for the cutoff?
- Is the accuracy value evolving in a predictable manner as the cutoff varies?
- Describe the list of suffixes that are good tag predictors -- are
you surprised by what you observe?
4 Fine-Grained Accuracy and Error Analysis
4.1 Per Tag Precision and Recall
We want to check the behavior of a tagger per tag. This
indicates which tags are most difficult to distinguish from other
tags. Write a function which reports precision and recall of a tagger
per tag. These measures are defined as follows:
- Precision for tag T: when the tagger predicts tag T, how often is it correct in the dataset.
- Recall for tag T: out of all words tagged as T in the dataset, how many are tagged correctly.
Precision and Recall per tag can be computed as a function of the true
positive, true negative, false positive and false negative counts for
the tags:
- True positive count (TP): number of words tagged as T both in the
test set and by the tagger.
- True negative count (TN): words tagged as non-T both in the test set
and by the tagger.
- False positive count (FP): words tagged as non-T in the test set and
as T by the tagger.
- False negative (FN): words tagged as T in the test set and as non-T
by the tagger.
Since there is a natural trade-off between Precision and Recall, we
often measure a score that combines the two parameters and is called
F-measure. The formula are:
Precision(T) = TP / TP + FP
Recall(T) = TP / TP + FN
F-Measure(T) = 2 x Precision x Recall / (Recall + Precision) = 2TP / (2TP + FP + FN)
All three measures are numbers between 0 and 1.
4.1.1. Add the function microEvaluate(corpus_test) to the TaggerI interface
that computes for the tagger TP, TN, FP, FN, Precision, Recall and
F-measure.
4.1.2. Which tags are most difficult in the universal tagset? Explain why with linguistic examples.
4.2 Confusion Matrix
A precious method to perform error analysis consists of computing the
confusion matrix of a tagger. Consider an error committed by a tagger:
a word is predicted as tag T1 where it should be tagged as T2. In other
words, tag T2 is confused with T1. Note that confusion is not symmetric.
A confusion matrix tabulates all the mistakes committed by a tagger
in the form of a matrix C[ti, tj]. C[ti, tj] counts the number of times
the tagger predicted ti instead of tj. (Which NLTK data structure is
appropriate for such a value?)
4.2.1. Write a method ConfusionMatrix(corpus_test) that returns such a matrix
for a tagger.
Note: Study how the method sklearn.metrics.confusion_matrix works to help you
work on this task:
import numpy as np
from sklearn import metrics
x = ['a', 'b', 'c']
ygold = ['vowel', 'consonant', 'consonant']
yest = ['vowel', 'vowel', 'consonant']
cm = metrics.confusion_matrix(ygold, yest)
print(cm)
The following code displays a confusion matrix in a pleasing way:
%matplotlib inline
import matplotlib.pyplot as plt
import itertools
def plot_confusion_matrix(cm, classes,
title='Confusion matrix',
cmap=plt.cm.Blues):
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=45)
plt.yticks(tick_marks, classes)
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, cm[i, j],
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('True label')
plt.xlabel('Predicted label')
# Plot confusion matrix
plt.figure(figsize=(10,10))
plot_confusion_matrix(cm, classes=labels, title='Confusion matrix')
plt.show()
4.2.2. Validate the ConfusionMatrix() method over the DefaultTagger discussed
in class (what shape do you expect for the confusion matrix of a DefaultTagger?).
4.2.3. Report the confusion matrix for the universal tagset of the Brown
corpus for the best tagger discussed in class. Discuss the results:
which pairs of tags are the most difficult to distinguish? Show examples of the 4 most frequent confusion pairs
and explain why they are difficult to distinguish.
5 Averaged Perceptron
We now experiment with a good POS tagger described by Matthew Honnibal
in this article: A good POS tagger in 200 lines of Python.
This tagger uses as a learning algorithm the averaged perceptron with good features.
Honnibal's code is available in NLTK under the name PerceptronTagger.
Your task is:
- 5.1. Explain in your own words how the Averaged perceptron algorithm works.
- 5.2. Train the averaged perceptron tagger on the Brown dataset (full stratified training dataset with 90% of the sentences).
- 5.3. Report on accuracy, and per tag Precision, Recall, F and confusion matrix.
- 5.4. Study the source code of the NLTK PerceptronTagger:
- 5.4.1. Explain the usage of the tagdict data structure - how many elements does it include? what fraction of the tested tokens are guessed through tagdict?
- 5.4.2. Are there errors at test time which are caused by the usage of tagdict? Write code to confirm your answer.
- 5.4.3. Write a version of the train / tag methods of the PercepronTagger that does
not use tagdict - train this version and report on performance as in 5.3.
Explain the differences you observe with the original version of PerceptronTagger.
(This type of analysis is called ablation study.)
- 5.4.4. Write a function get_features(perceptronTagger, taggedSentence) which shows the computed features for a single sentence.
pay attention to the way words guessed through tagdict are encoded as features. (Start from the code of the method PerceptronTagger.tag).
Explain the meaning of each of 13 computed features that are computed for each token - indicate for each whether they correspond to morphological,
syntactic or lexical knowledge.
- 5.4.5. Explain the role of the function normalize().
- 5.4.6. Inspect the value of the variable perceptronTagger.model.weigths after it has been trained. How many elements does it include?
Pick some of the elements of this data structure and explain their intuitive meaning as a rule "if .... then ...".
6. Averaged Perceptron in Tensorflow [OPTIONAL BONUS QUESTION]
Your objective is to implement a version of the Percepron tagger using the TensorFlow
neural network library.
Consider as a starting point these 2 examples:
- multilayer perceptron in Tensorflow for MNIST handwritten digit recognition.
- Scikit_learn
DictVectorizer describes how to translate features encoded as dictionaries
(as implemented in the PerceptronTagger reviewed in 5.4.4) into numpy vectors of numbers.
The easiest point to start your implementation is to modify the PerceptronTagger without
tagdict usage (as in 5.4.3) and replace the Model class from the simple perceptron
implemented in the nltk version.
Experiment with the selection of the parameters of your model (dimensions of hidden
layers, number of hidden layers) and report on obtained performance.
Last modified 30 Nov 2016