Assignment 1
Due: Sun 23 Nov 2014 Midnight
Natural Language Processing - Fall 2015 Michael Elhadad
This assignment covers the topic of statistical models of parts of speech tagging.
The objective is:
- Learn how to use Python and NLTK
- Learn how to access an annotated text corpus
- Learn how to annotate and extend a text corpus
- Learn how to access text resources on the Internet and clean it before it can be annotated
- Learn how to explore the statistical properties of a corpus
- Learn to train and evaluate simple tagging models
- Experiment with word segmentation in Hebrew.
Submit your solution in the form of an iPython
notebook file (with extension ipynb). Images should be submitted as
PNG or JPG files. The whole code should also be submitted as a separate
folder with all necessary code to run the questions separated in clearly
documented functions.
Parts of Speech Tagging
- Data Exploration
- Gathering and cleaning up data
- Gathering basic statistics
- Unigram and Affix Taggers and Voting
- Fine-grained Error Analysis
- Per Tag Precision and Recall
- Confusion Matrix
- Hebrew Word Segmentation
Data Exploration
Gathering and Cleaning Up Data
When we discussed the task of POS tagging in class, we assumed the text
comes in a "clean" form: segmented in sentences and in words. We ran
experiments on a clean corpus (correctly segmented) and obtained results
of about 90% accuracy. In this question, we want to get a feeling of
how difficult it is to clean real-world data. Please read the tutorial
in Chapter 3 of the
NLTK book. This chapter explains how to access "raw text" and clean it
up: remove HTML tags, segment in sentences and in words.
Lookup at the data of the Brown corpus as it is stored in the nltk_data
folder (by default, it is in a folder named like
C:\nltk_data\corpora\brown under Windows). The format of the corpus is
quite simple. We will attempt to add a new "section" to this corpus.
Look at the following Python library for
performing Google Search. This library allows you to send queries to
Google and download the results in a very simple manner in Python. To
install this library, copy the file google.py under your Python
installation in /python/lib/google. This library uses a popular library
in the Python world called
BeautifulSoup
which can easily parse text in XML or "almost XML" format (such as noisy
HTML pages).
The easiest way to install libraries in Python is to use
the easy_install
tool or the pip tool. pip comes pre-installed if you installed
Anaconda.
For example, after you have installed easy_install, to install
beautifulSoup, do the following:
Open a command shell
Type:
% easy_install beautifulsoup4
or:
% pip install beautifulsoup4
(Again, if you installed Anaconda, beautifulsoup is already included.)
Choose a query to execute using the google package and gather about 10
hits from this site. Download the pages that your query found. Use
code similar to this script to clean up the HTML of these pages:
url = "http://www.bbc.com/news/technology-26415021"
html = urlopen(url).read()
raw = nltk.clean_html(html)
tokens = nltk.word_tokenize(raw)
You will realize that cleaning HTML is not an easy task.
The JusText Python
package does a very good job of cleaning up HTML by removing boilerplate
HTML code around "interesting text".
To install justext, use pip:
% pip install justext
Save the resulting hits into clean text files. Then run the best POS
Tagger you have available from class on the resulting text files,
using the universal POS tagset for the Brown corpus (12 tags). Save the resulting
tagged file into text files in the same format expected by the Brown
corpus. You should gather about 50 sentences. Look at the Python
code under \Python27\Lib\site-packages\nltk\corpus\reader\tagged.py to
see explanations on how the nltk Brown corpus reader works.
Finally, manually review the tagged text and fix the errors you find.
Put the manually tagged file into the nltk_data Brown corpus folder
into one of the existing category (or if you are more ambitious in a
new category in addition to 'news', 'editorial'...). Make sure the
nltk corpus reader can read the new text you have just added to the
Brown corpus.
Review the tagging of the new text separately (2 analyses) and compare
your tagging results. Report the list of words on which your 2 manual
tagging decisions are different (write a function to compare two
taggings of the same text saved in 2 different tagged files.) Show
the differences between each of your tags and the tags produced by the
automatic tagger. Report how long it took you to check the tagging of
50 sentences.
You must submit in your answer:
- The code used to run the queries and clean the resulting text.
- The clean text files submitted as input to your tagger. (Folder "clean")
- The tagged files returned by your tagger. (Folder "tagged")
- The tagged files manually verified and corrected by each human
tagger. (Folder "checked1" and "checked2")
- The code used to compare the tags of "checked1" and "checked2" and
list the differences. (Hint: search the Web to find existing Python
library to compare files.)
- The tagged files where the best tag has been chosen for each
disagreement between checked1 and checked2 (Folder "final").
Report qualitatively on the errors you observe during this pipeline:
- Errors met while dealing with the Google engine
- Errors met while downloading the material from the Google hits
- Errors met while cleaning up the HTML pages
- Errors met while segmenting the text into sentences and words
- Errors met by the automatic tagger: how many errors were reported by
checked1 and checked2 each, and altogether.
- Disagreements between checked1 and checked2.
- Actual accuracy obtained by your automatic tagger compared with the
final verified version of the text collection.
Gathering Basic Statistics
When we use a tagger that relies on lexical information (for each word
form, what is the distribution of tags that can be assigned to the
word), a measure of the complexity of the POS task is related to the
level of ambiguity of the word forms. In this question, we want to
explore the level of ambiguity present in our dataset.
For the rest of this question, use the full Brown corpus distributed as
part of NLTK.
Write a function that plots the number of words
having a given number of tags. The X-axis should show the number of
tags and the Y-axis the number of words having exactly this number of
tags. Use the following example from the
NLTK book as an
inspiration:
def performance(cfd, wordlist):
lt = dict((word, cfd[word].max()) for word in wordlist)
baseline_tagger = nltk.UnigramTagger(model=lt, backoff=nltk.DefaultTagger('NN'))
return baseline_tagger.evaluate(brown.tagged_sents(categories='news'))
def display():
import pylab
words_by_freq = list(nltk.FreqDist(brown.words(categories='news')))
cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))
sizes = 2 ** pylab.arange(15)
perfs = [performance(cfd, words_by_freq[:size]) for size in sizes]
pylab.plot(sizes, perfs, '-bo')
pylab.title('Lookup Tagger Performance with Varying Model Size')
pylab.xlabel('Model Size')
pylab.ylabel('Performance')
pylab.show()
In your iPython Notebook, add the following command to make sure
the pylab plots are displayed:
%matplotlib inline
Write a Python function that finds words with more than N observed tags.
The function should return a ConditionalFreqDist object where the
conditions are the words and the frequency distribution indicates the
tag frequencies for each word.
Write a test function that verifies that the words indeed have more than
N distinct tags in the returned value.
Write a function that given a word, finds one example of usage of the
word with each of the different tags in which it can occur.
# corpus can be the tagged_sentences or tagged_words according to what is most convenient
>>> PlotNumberOfTags(corpus)
...show a plot with axis: X - number of tags (1, 2...) and Y - number of words having this number of tags...
>>> cfd = MostAmbiguousWords(corpus, 4)
<conditionalFrequency ...>
>>> TestMostAmbiguousWords(cfd, 4)
All words occur with more than 4 tags.
>>> ShowExamples('book', cfd, corpus)
'book' as NOUN: ....
'book' as VERB: ....
Attach in a file MostAmbiguous.txt the most ambiguous words you find in
the corpus with examples for each of them.
We expect this distribution to exhibit a "long tail" form. Do you
confirm this hypothesis? (Carefully read the definition of "long tail"
to justify your answer.)
Unigram and Affix Tagger and Voting
We described in class the behavior of the Unigram and Affix Taggers.
Both of these are implemented in nltk. We will develop here an
alternative simpler implementation, and design an alternative way
to combine the two taggers instead of the simple backoff strategy
discussed in class.
For this task, split the Brown corpus in 3 parts: training set is the
first 80% of the sentences; development set the next 10%; test set will
be the last 10%. We will use the development set to optimize parameters
in our training procedure.
Unigram Tagger
Write a class SimpleUnigramTagger which directly inherits from
nltk.TaggerI and implements a unigram tagger in the simplest possible
manner. The code should be shorter than that of nltk which is based on
the general case of ngram tagging.
Verify that your tagger produces the same evaluation as the one provided
in nltk.
Using Entropy to Filter Affix Tagger
We discussed in class the Affix tagger as a tagger which learns a
dependency between suffixes of words and parts of speech tags. One of
the issues we raised was: which mapping from suffix to POS is "reliable"
enough to be used by the Affix tagger. The nltk.AffixTagger uses a
notion of "useful context" to determine that a context predicts a POS
tag in a reliable manner. (See
the source
code of the nltk.ContextTagger class from which AffixTagger
derives.)
In this question, we want to explore the issue for the case of the
SuffixTagger by exploiting the notion of
entropy. Entropy
measures the level of uncertainty in a distribution. During training,
the AffixTagger learns a mapping from suffix to a distribution of tags.
We want to filter this mapping to only keep predictions where the
entropy of the tags distribution is low (below a certain cutoff). We
also want to make sure that we keep suffixes for which we have seen
sufficiently many observations to obtain a reliable estimation of the
tags distribution.
Write a specific train method for the AffixTagger which filters the
learned model according to this idea. One of the parameters of this
train method must be the cutoff below which we keep a (suffix to
distribution) mapping.
We need then to select a good value for this prefix. One way to do this
is to optimize the cutoff parameter. We do this by training the
AffixTagger with various values of the cutoff over a range of values,
and comparing the accuracy obtained by each model over the development
set. The optimized value of the parameter is the one that gives the
best accuracy. Write a method optimize_parameter() to perform this
task.
Note: for all information theoretic computations, we compute
0xlog(0) = 0. This is important for the computation of H(p) =
Σpxlog(p).
Describe your observations:
- does entropy filtering improve accuracy?
- how do you determine the range of values to test for the cutoff?
- is the accuracy value evolving in a predictable manner as the cutoff varies?
- describe the list of suffixes that are good tag predictors -- are
you surprised by what you observe?
A Voting Method to Combine Taggers
When we discussed the backoff strategy to combine multiple taggers into
one better tagger, we noted that backoff taggers cannot "undo" mistakes
to which a previous tagger already committed. Consider the
UnigramTagger used as a first-line tagger, that has an error rate of
about 15%, of which about 5% are None (unknown words) and 10% are errors
made because the UnigramTagger commits to the most-frequent tag instead
of the actual tag. These 10% errors cannot be fixed by a backoff tagger.
In this question, we will attempt to implement a different tagger
combination method that avoids this problem. The intuition is that no
tagger should commit to an answer when it is "unsure" of the answer it
is proposing. We will use Entropy as a measure of confidence.
(Measuring confidence of different taggers in a way that allows
comparison is a complex research issue.)
Consider the 2 taggers - UnigramTagger and AffixTagger.
Each one can propose more than one answer - but each answer
is weighted by a probability -- how likely is it that a tag
is correct given the tagger model and the observed word -- that is,
we can consider each tagger as a model of the conditional distribution
p(t | w). For example, UnigramTagger can predict that the distribution
p(t | "book") is { ('N': 0.75), ('V': 0.25) }.
To exploit this information, implement the following elements:
- Define a new Python interface EntropyTaggerI that extends TaggerI
and adds the methods possible_tags(self, w) which returns a tag
distribution, and entropy(self, w) which returns the entropy
of the tag distribution p(t|w). The method choose_tag(self, w)
selects the most likely tag from possible_tags() as usual.
- Implement the methods possible_tags and entropy for UnigramTagger
and AffixTagger.
- Define a new class EntropyVotingTagger - similar to SeqBackoffTagger
discussed in class that combines a list of taggers in the following
manner:
- When a VotingTagger needs to choose a tag given a word - it
first queries the entropy of each participant tagger given w.
It then selects the tagger that is most confident in its answer,
that is, it uses choose_tag() for the tagger with the smallest
entropy.
- Note that the VotingTagger can still return None as an answer
in case no tagger can propose an answer.
- We add a free parameter to control the level of guessing the
Voting tagger is ready to make - which is a cutoff on the entropy
over which the tagger will consider that it is better not to use
the guess. For example, if the UnigramTagger "hesitates" between
N at 55% and V at 45% - it may decide it is better to "pass"
rather than to make a decision.
We will need to optimize this cutoff parameter through experimentation.
Define the tagger that combines UnigramTagger and AffixTagger using the
EntropyVoting method. We now want to analyze how different this tagger
is from the backoff combining method.
- Report the accuracy of the Voting vs. Backoff tagger.
- On the development set report the number of words for which:
- UnigramTagger (with no backoff) has no information (unknown word)
- UnigramTagger has exactly i tag options (for i in [1...12]).
- UnigramTagger has an entropy over 0.69 (which is the entropy of a
50%/50% bet).
For each case report the number of tokens, number of distinct words
and the same number as percentages of all tokens and all distinct words.
- Draw a plot of the error rate of the unigram tagger as a function
of the entropy.
- Make a list of all cases where the backoff combined tagger and the
voting combined tagger disagree. Describe qualitatively what you
observe.
Fine-Grained Accuracy and Error Analysis
Per Tag Precision and Recall
We are interested in checking the behavior of a tagger per tag. This
indicates which tags are most difficult to distinguish from other
tags. Write a function which reports precision and recall of a tagger
per tag. These measures are defined as follows:
- Precision for tag T: when the tagger predicts tag T, how often is it correct in the dataset.
- Recall for tag T: out of all words tagged as T in the dataset, how many are tagged correctly.
Precision and Recall per tag can be computed as a function of the true
positive, true negative, false positive and false negative counts for
the tags:
- True positive count (TP): number of words tagged as T both in the
test set and by the tagger.
- True negative count (TN): words tagged as non-T both in the test set
and by the tagger.
- False positive count (FP): words tagged as non-T in the test set and
as T by the tagger.
- False negative (FN): words tagged as T in the test set and as non-T
by the tagger.
Since there is a natural trade-off between Precision and Recall, we
often measure a score that combines the two parameters and is called
F-measure. The formula are:
Precision(T) = TP / TP + FP
Recall(T) = TP / TP + FN
F-Measure(T) = 2 x Precision x Recall / (Recall + Precision) = 2TP / (2TP + FP + FN)
All three measures are numbers between 0 and 1.
Add the function MicroEvaluate(corpus_test) to the TaggerI interface
that computes for the tagger TP, TN, FP, FN, Precision, Recall and
F-measure.
Which tags are most difficult in the universal tagset? In the full tagset?
Confusion Matrix
A precious method to perform error analysis consists of computing the
confusion matrix of a tagger. Consider an error committed by a tagger:
a word is predicted as tag T1 where it should be tagged as T2. In other
words, tag T2 is confused with T1. Note that confusion is not symmetric.
A confusion matrix tabulates all the mistakes committed by a tagger
in the form of a matrix C[ti, tj]. C[ti, tj] counts the number of times
the tagger predicted ti instead of tj. (Which NLTK data structure is
appropriate for such a value?)
Write a method ConfusionMatrix(corpus_test) that returns such a matrix
for a tagger.
Validate the ConfusionMatrix() method over the DefaultTagger discussed
in class.
Report the confusion matrix for the universal tagset of the Brown
corpus for the last tagger discussed in class. Discuss the results:
which pairs of tags are the most difficult to distinguish?
Given your observation on the most likely confusions, propose a simple
(engineering) method to improve the results of your tagger. Implement
this improvement and report on error reduction.
Hebrew Word Segmentation
In Hebrew, 7 formative letters - mem, shin, he, vav, kaf, lamed, bet
- can be attached as prefixes to a word, indicating functions, such as
coordination – we-bait (and a home), preposition – be-bait (at home),
subordination sh-be-bait (that at home), and more. These letters can be
combined into 431 possible combinations, 76 of which are observed in a
corpus of about 10M tokens. Yona defined a set of 13 short formative
words (composed of the 7 formative letters). These short formative
words, when combined together, can compose all of the 76 observed prefix
sequences. The 13 short formative words are:
- be
- ha
- ve
- ke
- le
- me
- she
- k-she
- mi-she
- mi-be
- li-k-she
- bi-k-she
- mi-k-she
One of the most difficult issue when processing Hebrew text is to
properly segment the observed tokens by segmenting these prefix
formative words from the base word. This segmentation is ambiguous,
as illustrated by this case:
barak can be read as:
- barak (noun)
- be-rak (preposition + adverb)
In addition, multiple prefix formative words can be concatenated:
vesheksheamarti
should be segmented as:
ve-she-kshe-amarti
Our task is to develop a Hebrew word segmenter.
We will use 3 resources to help us achieve this goal:
- The word segmentation program described by Norvig in
Natural Language Corpus Data.
The code for the word segmentation method is available in
wordSegment.py.
It uses the 2 language models compiled from Google's trillion word
dataset:
- 4.9 MB
count_1w.txt:
The 1/3 million most frequent words, all lowercase, with counts.
- 5.6 MB
count_2w.txt:
The 1/4 million most frequent two-word (lowercase) bigrams, with
counts.
- The BGU Hebrew corpora available at
the BGU nlpproj page.
We will use a corpus of 1M words (53K sentences) automatically
annotated from
Ha'aretz articles (23MB).
This corpus can be used using the NLTK corpus methods when using
the
BGUCorpusReader.py class.
- The BGU Morphological Analyzer developed
by Meni Adler.
This analyzer takes as input free text in Hebrew and returns
morphologically analyzed, segmented words. It can be tested
online here.
A Web Service gives access to this analyzer using this Python
class: hebrewMorph.py.
Testing Norvig's Word Segmenter
Norvig's word segmenter takes as input a sentence in English where all
spaces and punctuations have been removed, and all letters turned to
lower case, and returns the original sentence:
>>> from wordSegment import segment2
>>> segment2("ikindoflovesegmentingoddsentences")
(-27.41, ['i', 'kind', 'of', 'love', 'segmenting', 'odd', 'sentences'])
To test the accuracy of this segmenter, we will use the Brown corpus.
- Define a method to turn Brown sentences into appropriate input for
the segment2 function.
- Measure the accuracy of segment2 on the news category of the Brown
corpus.
Testing Norvig's Word Segmenter Dependency on the Language Model
segment2 works remarkably well in part because it relies on the huge
language model compiled into the count_1w and count_2w files and which
summarizes observed counts over a corpus of 1 trillion words.
Before we attempt to use a similar method in Hebrew, we want to measure
to what extent the size of this language model impacts on the accuracy
of the word segmenter (because we do not have a trillion words handy
in Hebrew).
- Compile a smaller language model similar to count_1w and count_2w
from a 1M words corpus in English and save it into two new files
small_1w and small_2w (remember to put all words in lowercase).
For example:
>>> import nltk.corpus
>>> bw = nltk.corpus.brown.tagged_words()
>>> len(bw)
1161192
- Repeat the measure of the accuracy of segment2 when using
small_1w and small_2w instead of count_1w and count_2w.
(Note that we are cheating in this test - explain why.)
Compiling a Hebrew Language Model
We now want to adapt Norvig's English word segmenter (which is really a
"toy" example) to the real task of segmenting Hebrew words. The task
consists of the following: given a Hebrew word in the context of a
sentence, segment the word into a sequence of formative word prefixes,
followed by a base word. The formative words are all one of the 13 words
listed above. We do not care about the suffixes.
Our first obstacle is that we do not have an observable dataset where
formative words are properly segmented. There are different ways to
address this problem. We will go for the easy one: we will use a
corpus automatically processed by Meni Adler's analyzer, and rely on
the output of this automatic tool as our "ground truth" (even though
we know this analyzer is only correct at about 96-97% level).
The haaretz.bgu corpus is such a dataset of 1M words automatically
segmented.
Using the BguCorpusReader, we obtain the following analysis of Hebrew
words annotated in the Haaretz corpus:
Given the first lines of the corpus:
אם :CONJUNCTION:
תבוצע :VERB-F,S,3,FUTURE:
הרפורמה DEF:NOUN-F,S,ABS:
הרחבה DEF:ADJECTIVE-F,S,ABS:
במשק PREPOSITION+DEF:NOUN-M,S,ABS:
הסובייטי DEF:ADJECTIVE-M,S,ABS:
We see in Python the following:
from BguCorpusReader import BguCorpusReader
c = BguCorpusReader()
tagged_words = c.tagged_words()
for i in range(6):
w,t = tagged_words[i]
print t.getRaw()
print t.getBguTag()
print t.getPosTag()
print
:CONJUNCTION:
[[], [u'CONJUNCTION'], []]
CONJUNCTION
:VERB-F,S,3,FUTURE:
[[], [u'VERB', [u'F', u'S', u'3', u'FUTURE']], []]
VERB
DEF:NOUN-F,S,ABS:
[[[u'DEF', []]], [u'NOUN', [u'F', u'S', u'ABS']], []]
DEF NOUN
DEF:ADJECTIVE-F,S,ABS:
[[[u'DEF', []]], [u'ADJECTIVE', [u'F', u'S', u'ABS']], []]
DEF ADJECTIVE
PREPOSITION+DEF:NOUN-M,S,ABS:
[[[u'PREPOSITION', []], [u'DEF', []]], [u'NOUN', [u'M', u'S', u'ABS']], []]
PREPOSITION DEF NOUN
DEF:ADJECTIVE-M,S,ABS:
[[[u'DEF', []]], [u'ADJECTIVE', [u'M', u'S', u'ABS']], []]
DEF ADJECTIVE
The analysis is returned as a list of pairs [[part-of-speech
morphological-features] ...] by the method t.getBguTag(). If we are
only interested in the parts of speech (without the features such as "M"
for "masculine" or "S" for "single") we use instead the method
t.getPosTag(). t.getPosTag() will return a list of POS tags which
has for length the number of words into which the original token has
been split.
- Write a function getWords(w, t) which, given a Hebrew word
and its Bgu Tag returns a list of words - prefix1, prefix2.., base.
The function should not attempt to remove the suffixes.
- The "ha" definite article is tricky in this function because it sometimes
will be returned in the list of tags (as the DEF POS tag) even though there
is NO letter in the original word that corresponds to it.
This is because in Hebrew the sequence:
"be"+"ha"+"bait" becomes "babait" (the "ha" is dropped).
In this case, your function should recover the missing "ha" on the basis
of the tag analysis.
- Note that this code must use Hebrew strings - which are all encoded
in UTF-8 strings in Python to compare the prefixes observed in the word
with the 13 possible formative words.
- Given the function getWords(w, t), compile a version of the haaretz
corpus where all prefixes have been properly segmented into separate
words. Call this file haaretz-segmented.txt.
- Compile language models similar to small_1w and small_2w on the
segmented Haaretz corpus.
- Test the accuracy of the segment2 method when used on the
Haaretz corpus where all the 53K sentences are turned into a
single string with no delimiter and no punctuation:
segment2(u'אםתבוצעהרפורמההרחבהבמשקהסובייטישהדיוןבהעמדלהיפתחאתמולתהיהזאתהמהפיכההגדולהביותרבכלכלתברי"םמאזהנא"ףהתוכניתהכלכליתהחדשהשללניןלפניכ07שנה')
Original sentence:
אם תבוצע הרפורמה הרחבה במשק הסובייטי שהדיון בה עמד להיפתח אתמול
תהיה זאת המהפיכה הגדולה ביותר בכלכלת ברי"ם מאז הנא"ף התוכנית
הכלכלית החדשה של לנין לפני כ07 שנה
Word segmented sentence:
אם תבוצע ה רפורמה ה רחבה ב ה משק ה סובייטי ש ה דיון ב ה עמד להיפתח אתמול
תהיה זאת ה מהפיכהה גדולה ב יותר ב ה כלכלת ברי"ם מ אז ה נא"ף ה תוכנית
ה כלכלית ה חדשה של לנין לפני כ 07 שנה
- Derive a function from segment2 that will only segment real
Hebrew tokens (and not aggregated as shown above).
- Evaluate the accuracy of this word segmenter on the haaretz corpus
on a test dataset.
Last modified 14 Nov 2014