Parts of Speech Tagging with NLTK

Michael Elhadad NLP 2018

Last Modified 27 Nov 17

Download as Jupyter Notebook

Definition of the Task

One of the most basic and most useful task when processing text is to tokenize each word separately and label each word according to its most likely part of speech. This task is called part of speech tagging (POST).

Refer to the Wikipedia presentation for a short definition of the task of parts of speech tagging.

We will follow the step by step description introduced in Chapter 5 of the Natural Language Processing with Python --- Analyzing Text with the Natural Language Toolkit book by Steven Bird, Ewan Klein, and Edward Loper (2009).

This task will give us the opportunity to first address methods of statistical analysis and of machine learning. Machine learning creates computational tools that generalize data viewed over a collection of examples. In the case of POST, an influential dataset that is used in many studies was published in 1967, known as the Brown Corpus. The Brown Corpus defined a tagset (specific collection of part-of-speech labels) that has been reused in many other annotated resources in English.

More recently, a different version of the tagset has been defined which is called the Universal Parts of Speech Tagset.
One of the objectives of the universal tagset is to define a tagset that can be used uniformly across multiple languages. We will use this tagset in this empirical exploration.

Installing NLTK

NLTK (Natural Language ToolKit) is a collection of open source Python modules, linguistic data and documentation for research and development in natural language processing.It provides excellent combination of hands-on access to data, explanation and real-life data.

To install NLTK on your machine, follow these instructions.

If you installed Python using Anaconda, NLTK comes already installed.

Make sure to also install the datasets that come with NLTK as explained here. From the command line, type:

python -m nltk.downloader all

The FreqDist NLTK Object: Counting Things

One of the most useful thing to do on corpora is to count things: count words, count words with specific properties, count sequences, count tags. NLTK defines a useful class for this: nltk.FreqDist. (There is a closely related class in standard Python called Counter.)

FreqDist (Frequency Distribution) is an object that counts occurrences of objects. It is defined in the nltk.probability module.

In [1]:
from nltk.probability import FreqDist
list = ['a', 'b', 'a']
fdist = FreqDist(list)
FreqDist({'a': 2, 'b': 1})

We can now use the FreqDist to explore the distribution of objects in the stream 'list':

In [2]:
fdist['a']  # We saw 'a' 2 times
In [3]:
fdist['b']  # We saw 'b' 1 time
In [4]:
fdist['c']  # We saw 'c' 0 times
In [5]:
fdist.max()      # 'a' is the most frequent sample seen
In [6]:
len(fdist)      # we observed 2 types of objects in the stream
In [7]:
fdist.keys()     # Return the types of the objects that were observed in the stream
dict_keys(['a', 'b'])
In [8]:
fdist.freq('a')  # 2/3 of the samples we saw were 'a' 
In [9]:
fdist.N()        # How many samples did we count?

Working on the Brown Corpus with NLTK

NLTK contains a collection of tagged corpora, arranged as convenient Python objects. We will use the Brown corpus in this experiment. The tagged_sents version of the corpus is a list of sentences. Each sentence is a list of pairs (tuples) (word, tag). Similarly, one can access the corpus as a flat list of tagged words.

In [10]:
from nltk.corpus import brown
brown_news_tagged = brown.tagged_sents(tagset='universal')
brown_news_words = brown.tagged_words(tagset='universal')

Let us now check: which word is the most common among the about 100,000 words in this part of the Brown corpus? Which tag is the most common?

In [11]:
fdistw = FreqDist([w for (w, t) in brown_news_words])
fdistw.N()    # We saw 1,161,192 words in this section of the corpus
In [12]:
len(fdistw)  # How many distinct words are there|
In [13]:
fdistw.max()  # What is the most frequent word?
In [14]:
fdistw['the']   # How often does 'the' occur in the corpus
In [15]:
print('%5.2f%%' % (fdistw.freq('the') * 100))

We observe that the distribution of word tokens to word types is extremely unbalanced - a single word (the) accounts for over 6% of the word tokens. This is a general observation of linguistic data -- known as Zipf's Law -- few types are extremely frequent, and many types are extremely rare.

Distinguishing Word Type and Work Token

When distinguishing word type and word token, we can decide to consider that strings that vary only because of case variants correspond to the same word type - for example, the token Book and book correspond to the same word type book (and so do bOOk and bOok).

In [16]:
# Let us count the words without distinction of upper/lower case and the tags
fdistwl = FreqDist([w.lower() for (w, t) in brown_news_words])
fdistt = FreqDist([t for (w, t) in brown_news_words])

When we ignore case variants, there are only 49,815 word types instead of 62,713 when case differences are kept:

In [18]:
len(fdistwl)  # How many distinct words when we ignore case distinctions ('the' same as 'The')
In [19]:
print('%5.2f%%' % (fdistt.freq('NOUN') * 100))

Perplexity of the POST task

The first question we address about the task of POS tagging is: how complex is it?

One way to quantify the complexity of a task is to measure its perplexity. The intuitive meaning of perplexity is: when the tagger is considering which tag to associate to a word, how "confused" is it? That is, how many options does it have to choose from?

Obviously, this depends on the number of tags in the tagset. The universal tagset includes 17 tags:

Tag Meaning Examples
ADJ adjective new, good, high, special, big, local
ADV adverb really, already, still, early, now
CONJ conjunction and, or, but, if, while, although
DET determiner the, a, some, most, every, no
X other, foreign words dolce, ersatz, esprit, quo, maitre
NOUN noun year, home, costs, time, education
PROPN proper noun Alison, Africa, April, Washington
NUM numeral twenty-four, fourth, 1991, 14:24
PRON pronoun he, their, her, its, my, I, us
ADP adposition, preposition on, of, at, with, by, into, under
AUX auxiliary verb has (done), is (doing), will (do), should (do), must (do), can (do)
INTJ interjection ah, bang, ha, whee, hmpf, oops
VERB verb is, has, get, do, make, see, run
PART particle possessive marker 's, negation 'not'
SCONJ subordinating conjunction: complementizer, adverbial clause introducer I believe 'that' he will come, if, while
SYM symbol $, %, (C), +, *, /, =, :),

In the absence of any knowledge about words and tags, the perplexity of the task with a tagset of size 17 will be 17. We will see that adding knowledge will reduce the perplexity of the task.

Note that the decision on how to tag a word, without more information is ambiguous for multiple reasons:

  • The same string can be understood as a noun or a verb (book).
  • Some POS tags have a systematically ambiguous definition: a present participle can be used in progressive verb usages (I am going:VERB), but it can also be used in an adjectival position modifying a noun: (A striking:ADJ comparison). In other words, it is unclear in the definition itself of the tag whether the tag refers to a syntactic function or to a morphological property of the word.

Measuring success: Accuracy, Training Dataset, Test Dataset

Assume we develop a tagger. How do we know how successful it is? Can we trust the decisions the tagger makes?

The way to address these issues is to define a criterion for success, and to test the tagger on a large test dataset. Assume we have a large dataset of 1M words with their tags assigned manually. We first split the dataset into 2 parts: one part on which we will "learn" facts about the words and their tags (we call this the training dataset), and one part which we use to test the results of our tagger (we call this the test dataset).

It is critical NOT to test our tagger on the training dataset -- because we want to test whether the tagger is able to generalize from data it has seen and make decision on unseen data. (A "stupid" tagger would learn the exact data seen in the training dataset "by heart", and respond exactly as shown when asked on training data -- it would get a perfect score on the training data, but a poor score on any unseen data.)

This is one way to split the dataset into training and testing:

In [20]:
l = len(brown_news_tagged)
In [21]:
# from nltk.corpus import brown
# brown_news_tagged = brown.tagged_sents(categories='news', tagset='universal')
# brown_news_words = brown.tagged_words(categories='news', tagset='universal')

brown_train = brown_news_tagged[5000:]
brown_test = brown_news_tagged[:5000]

from nltk.tag import untag
test_sent = untag(brown_test[0])
print("Tagged: ", brown_test[0])
print("Untagged: ", test_sent)
Tagged:  [('The', 'DET'), ('Fulton', 'NOUN'), ('County', 'NOUN'), ('Grand', 'ADJ'), ('Jury', 'NOUN'), ('said', 'VERB'), ('Friday', 'NOUN'), ('an', 'DET'), ('investigation', 'NOUN'), ('of', 'ADP'), ("Atlanta's", 'NOUN'), ('recent', 'ADJ'), ('primary', 'NOUN'), ('election', 'NOUN'), ('produced', 'VERB'), ('``', '.'), ('no', 'DET'), ('evidence', 'NOUN'), ("''", '.'), ('that', 'ADP'), ('any', 'DET'), ('irregularities', 'NOUN'), ('took', 'VERB'), ('place', 'NOUN'), ('.', '.')]
Untagged:  ['The', 'Fulton', 'County', 'Grand', 'Jury', 'said', 'Friday', 'an', 'investigation', 'of', "Atlanta's", 'recent', 'primary', 'election', 'produced', '``', 'no', 'evidence', "''", 'that', 'any', 'irregularities', 'took', 'place', '.']

To measure success, in this task, we will measure accuracy. The tagger object in NLTK includes a method called evaluate to measure the accuracy of a tagger on a given test set.

In [22]:
# A default tagger assigns the same tag to all words
from nltk import DefaultTagger
default_tagger = DefaultTagger('XYZ')
default_tagger.tag('This is a test'.split())
[('This', 'XYZ'), ('is', 'XYZ'), ('a', 'XYZ'), ('test', 'XYZ')]

Since 'NOUN' is the most frequent universal tag in the Brown corpus, we can use a tagger that assigns 'NOUN' to all words as a baseline.

In [23]:
default_tagger = DefaultTagger('NOUN')
[('The', 'NOUN'), ('Fulton', 'NOUN'), ('County', 'NOUN'), ('Grand', 'NOUN'), ('Jury', 'NOUN'), ('said', 'NOUN'), ('Friday', 'NOUN'), ('an', 'NOUN'), ('investigation', 'NOUN'), ('of', 'NOUN'), ("Atlanta's", 'NOUN'), ('recent', 'NOUN'), ('primary', 'NOUN'), ('election', 'NOUN'), ('produced', 'NOUN'), ('``', 'NOUN'), ('no', 'NOUN'), ('evidence', 'NOUN'), ("''", 'NOUN'), ('that', 'NOUN'), ('any', 'NOUN'), ('irregularities', 'NOUN'), ('took', 'NOUN'), ('place', 'NOUN'), ('.', 'NOUN')]

Using this baseline, we achieve a low accuracy:

In [24]:
print('Accuracy: %4.1f%%' % (100.0 * default_tagger.evaluate(brown_test)))
Accuracy: 30.2%

Still, note that we improved the expected accuracy from picking one out of 17 answers with no knowledge, to picking one out of about 3 answers with very little knowledge (what is the most frequent tag in the dataset).

The computation of the accuracy compares the answers obtained by the tagger with that listed in the "gold standard" dataset. This is defined as follows in the NLTK code (the izip method in Python is defined in the itertools package, given two iterable collections, it returns an iterator over a collection of pairs - iterators are "consumed" by code such as "for ... in ...iterator...):

def accuracy(reference, test): 
       if len(reference) != len(test): 
           raise ValueError("Lists must have the same length.") 
       num_correct = 0 
       for x, y in izip(reference, test): 
           if x == y: 
               num_correct += 1 
       return float(num_correct) / len(reference)

Sources of Knowledge to Improve Tagging Accuracy

Intuitively, the sources of knowledge that can help us decide what is the tag of a word include:

  • A dictionary that lists the possible parts of speech for each word
  • The context of the word in a sentence (neighboring words)
  • The morphological form of the word (suffixes, prefixes)

We will now develop a sequence of taggers that use these knowledge sources, and combine them. The taggers we develop will implement the nltk.tag.TaggerI interface:

class TaggerI(object): 
  def tag(self, tokens): 
    raise NotImplementedError() 

  def batch_tag(self, sentences): 
    return [self.tag(sent) for sent in sentences] 

  def evaluate(self, gold): 
    tagged_sents = self.batch_tag([untag(sent) for sent in gold]) 
    gold_tokens = sum(gold, []) 
    test_tokens = sum(tagged_sents, []) 
    return accuracy(gold_tokens, test_tokens)

Combining Taggers: Backoff Tagger

We will see as we proceed that there is a tradeoff in machine learning between precision and recall: a classifier has high precision if when it makes a decision, it is often right; a classifier has high recall if it can make the right decision on most of the questions.

To understand the difference, consider that a classifier may be given the option to "pass" on difficult questions. A cautious classifier will only commit to an answer if it is sure of the answer. Such a classifier could reach high precision, but may run the risk of low recall. In contrast, a risky classifier will suggest an answer even if it is not sure of the answer. Such a classifier could reach high recall, but runs the risk of low precision.

One way to balance between different knowledge sources is to combine the classifiers they produce. For example, a tagger that looks at the morphological structure of words and one that looks at the neighbors of the words to be classified use completely different knowledge sources. We say they are orthogonal.

One would want to combine the 2 classifiers into one "better" classifier than either one alone. Intuitively, it makes sense that the two sources of knowledge should produce better decisions than either one alone. How can we encode such combination?

One way is to use a chain of classifiers and order the classifiers in a sequence of backoff: the first classifier, the most "picky" starts. If it decides to "pass", then it asks another, less picky tagger to contribute. The chain can be as long as needed. This method of tagger combination is implemented in the module nltk.tag.sequential. The interface sequentialBackoffTagger inherits from the TaggerI interface.

class SequentialBackoffTagger(TaggerI): 
 An abstract base class for taggers that tags words sequentially, 
 left to right.  Tagging of individual words is performed by the 
 method L{choose_tag()}, which should be defined by subclasses.  If 
 a tagger is unable to determine a tag for the specified token, 
 then its backoff tagger is consulted. 

 def __init__(self, backoff=None): 
   if backoff is None: 
     self._taggers = [self] 
     self._taggers = [self] + backoff._taggers 

   def _get_backoff(self): 
     if len(self._taggers) < 2: return None 
       else: return self._taggers[1] 

   backoff = property(_get_backoff, doc='''The backoff tagger for this tagger.''') 

   def tag(self, tokens): 
      # docs inherited from TaggerI 
       tags = [] 
      for i in range(len(tokens)): 
        tags.append(self.tag_one(tokens, i, tags)) 
      return zip(tokens, tags) 

   def tag_one(self, tokens, index, history): 
     Determine an appropriate tag for the specified token, and 
     return that tag.  If this tagger is unable to determine a tag 
     for the specified token, then its backoff tagger is consulted. 
     tag = None 
     for tagger in self._taggers: 
       tag = tagger.choose_tag(tokens, index, history) 
       if tag is not None:  break 
     return tag 

   def choose_tag(self, tokens, index, history): 
     Decide which tag should be used for the specified token, and 
     return that tag.  If this tagger is unable to determine a tag 
     for the specified token, return C{None} -- do I{not} consult 
     the backoff tagger.  This method should be overridden by 
     subclasses of C{SequentialBackoffTagger}. 
    raise AssertionError('SequentialBackoffTagger is an abstract class')

This class encodes 2 properties of taggers:

  • Backoff: They try to tag a word, and if they are not sure, they pass to a backoff tagger.
  • Sequential: They tag words from left to right, in order, and at each step, they can look at the decisions made in the past words within the sentence.

Default Tagger

Here is the simplest implementation of this method:

class DefaultTagger(SequentialBackoffTagger, yaml.YAMLObject): 
  A tagger that assigns the same tag to every token. 
  yaml_tag = '!nltk.DefaultTagger' 

  def __init__(self, tag): 
    Construct a new tagger that assigns C{tag} to all tokens. 
    self._tag = tag 
    SequentialBackoffTagger.__init__(self, None) 

  def choose_tag(self, tokens, index, history): 
    return self._tag  # ignore token and history 

  def __repr__(self): 
    return '<DefaultTagger: tag=%s>' % self._tag

This is the default tagger we used above. When it is given the 'NOUN' tag, it achieves accuracy of about 31%. This tagger is not discriminative at all: it does not look at any of the input data. It always finds a good candidate. It is, therefore, a good choice as the last member of a chain of taggers organized in a backoff chain. When all other methods cannot decide, then choose a reasonable default (which is better than not deciding at all, or deciding completely randomly).

Lookup Tagger: Using Dictionary Knowledge

Assume we have a dictionary that lists the possible tags for each word in English. Could we use this information to perform better tagging?

The intuition is that we would only assign to a word a tag that it can have in the dictionary. For example, if "box" can only be a Verb or a Noun, when we have to tag an instance of the word "box", we only choose between 2 options - and not between 17 options. Thus, dictionary knowledge will reduce the perplexity of the task.

There are 3 issues we must address to turn this into working code:

  • Where do we get the dictionary?
  • How do we choose between the various tags associated to a word in the dictionary? (For example, how do we choose between VERB and NOUN for "box").
  • What do we do for words that do not appear in the dictionary?

The simple solutions we will test are the following - note that for each question, there exist other strategies that we will investigate later:

  • Where do we get the dictionary: we will learn it from a sample dataset.
  • How do we choose between the various tags associated to a word in the dictionary: we will choose the most likely tag as observed in the sample dataset.
  • What do we do for words that do not appear in the dictionary: we will pass unknown words to a backoff tagger.

The nltk.UnigramTagger implements this overall strategy. It must be trained on a dataset, from which it builds a model of "unigrams". The following code shows how it is used:

In [25]:
# Prepare training and test datasets
# from nltk.corpus import brown
from nltk import UnigramTagger

# brown_news_tagged = brown.tagged_sents(categories='news', tagset='universal')
# brown_train = brown_news_tagged[100:]
# brown_test = brown_news_tagged[:100]

# Train the unigram model
unigram_tagger = UnigramTagger(brown_train)

# Test it on a single sentence
[('The', 'DET'),
 ('Fulton', 'NOUN'),
 ('County', 'NOUN'),
 ('Grand', 'ADJ'),
 ('Jury', 'NOUN'),
 ('said', 'VERB'),
 ('Friday', 'NOUN'),
 ('an', 'DET'),
 ('investigation', 'NOUN'),
 ('of', 'ADP'),
 ("Atlanta's", None),
 ('recent', 'ADJ'),
 ('primary', 'ADJ'),
 ('election', 'NOUN'),
 ('produced', 'VERB'),
 ('``', '.'),
 ('no', 'DET'),
 ('evidence', 'NOUN'),
 ("''", '.'),
 ('that', 'ADP'),
 ('any', 'DET'),
 ('irregularities', 'NOUN'),
 ('took', 'VERB'),
 ('place', 'NOUN'),
 ('.', '.')]

Note that the unigram tagger leaves some words tagged as 'None' -- these are unknown words, words that were not observed in the training dataset.

How successful is this tagger?

In [26]:
print('Unigram tagger accuracy: %4.1f%%' % ( 100.0 * unigram_tagger.evaluate(brown_test)))
Unigram tagger accuracy: 90.5%

90.5% is quite an improvement on the 31% of the default tagger. And this is without any backoff and without using morphological clues.

Is 90.5% a good level of accuracy? In fact it is not. It is accuracy per word. It means that on average, in every sentence of about 20 words, we will accumulate 2 errors. 2 errors in each sentence is a very high error rate. It makes it difficult to run another task on the output of such a tagger. Think how difficult the life of a parser would be if 2 words in every sentence are wrongly tagged. The problem is known as the pipeline effect -- when language processing tools are chained in a pipeline, error rates accumulate from module to module.

How much would a good backoff help? Let's try first to add the NN-default tagger as a backoff:

In [27]:
nn_tagger = DefaultTagger('NOUN')
ut2 = UnigramTagger(brown_train, backoff=nn_tagger)
print('Unigram tagger with backoff accuracy: %4.1f%%' % ( 100.0 * ut2.evaluate(brown_test)))
Unigram tagger with backoff accuracy: 94.5%

Adding a simple backoff (with accuracy of 31%) improved accuracy from 90.5% to 94.5%.

One way to report this is in terms of error reduction: the error rate went down from 9.5% (100-90.5) to 5.5%. That's an absolute error reduction of 9.5-5.5 = 4.0%. Error reduction is generally reported as a percentage of the error: 100.0 * (4.0 / 9.5) = 42.1% relative error reduction.

In other words, out of the words not tagged by the original model (with no backoff), 42.1% were corrected by the backoff.

What can we say about this error reduction? It is different from the accuracy of the backoff (42.1% vs 31%).

One lesson to learn from this is that the distribution of unknown words is significantly different from the distribution of all the words in the corpus.

Using Morphological Clues

As mentioned above, another knowledge source to perform tagging is to look at the letter structure of the words. We will look at 2 different methods to use this knowledge. First, we will use nltk.RegexpTagger to recognize specific regular expressions in words.

In [28]:
from nltk import RegexpTagger

regexp_tagger = RegexpTagger(
     [(r'^-?[0-9]+(.[0-9]+)?$', 'NUM'),   # cardinal numbers
      (r'(The|the|A|a|An|an)$', 'DET'),   # articles
      (r'.*able$', 'ADJ'),                # adjectives
      (r'.*ness$', 'NOUN'),               # nouns formed from adjectives
      (r'.*ly$', 'ADV'),                  # adverbs
      (r'.*s$', 'NOUN'),                  # plural nouns
      (r'.*ing$', 'VERB'),                # gerunds
      (r'.*ed$', 'VERB'),                 # past tense verbs
      (r'.*', 'NOUN')                     # nouns (default)

print('Regexp accuracy %4.1f%%' % (100.0 * regexp_tagger.evaluate(brown_test)))
Regexp accuracy 45.3%

The regular expressions are tested in order. If one matches, it decides the tag. Else it tries the next tag. The implementation of the RegexpTagger is quite simple: the choose_tag method of the SequenceBackoffTagger ignores the history (left context) of the word to be tagged. It tries the regular expressions in order.

Note: The notation r'.*able' in Python refers to a raw string - that is, a string in which backslash escape characters are interpreted in a different manner than those used in regular strings. For example '\n' is translated to a newline character by the Python reader, but r'\n' is read as a string of 2 characters containing a backslash. Since regular expressions often contain backslash characters to escape special regular expression characters (such as ., ^, *, +, $ etc), it is most convenient to encode regular expression strings with the "r-prefix" notation.

class RegexpTagger(SequentialBackoffTagger, yaml.YAMLObject):
    A tagger that assigns tags to words based on regular expressions over word strings.
    yaml_tag = '!nltk.RegexpTagger'
    def __init__(self, regexps, backoff=None):
        self._regexps = regexps
        SequentialBackoffTagger.__init__(self, backoff)

    def choose_tag(self, tokens, index, history):
        for regexp, tag in self._regexps:
            if re.match(regexp, tokens[index]): # ignore history
                return tag
        return None

    def __repr__(self):
        return '<Regexp Tagger: size=%d>' % len(self._regexps)

The question we face when we see such a "rule-based" tagger are:

  • How do we find the most successful regular expressions?
  • In which order should we try the regular expressions?

A typical answer to such questions is:

  • let's learn these parameters from a training corpus.

The nltk.AffixTagger is a trainable tagger that attempts to learn word patterns. It only looks at the last letters in the words in the training corpus, and counts how often a word suffix can predict the word tag. In other words, we only learn rules of the form ('.*xyz' , POS). This is how the affix tagger is used:

In [29]:
from nltk import AffixTagger

affix_tagger = AffixTagger(brown_train, backoff=nn_tagger)
print('Affix tagger accuracy: %4.1f%%' % (100.0 * affix_tagger.evaluate(brown_test)))
Affix tagger accuracy: 39.2%

Should we be disappointed that the "data-based approach" performs worse than the hand-written rules (39.2% vs. 45.3%)?

Not necessarily: note that our hand-written rules include cases that the AffixTagger cannot learn - we match cardinal numbers and suffixes with more than 3 letters.

Let us see whether the combination of the 2 taggers helps:

In [30]:
at2 = AffixTagger(brown_train, backoff=regexp_tagger)
print("Affix tagger with regexp backoff accuracy: %4.1f%%" % (100.0 * at2.evaluate(brown_test)))
Affix tagger with regexp backoff accuracy: 49.3%

This is not bad - the machine learning in AffixTagger helped us reduce the error from 54.7% to 50.7% (about 10% error reduction).

How much does this tagger help the unigram tagger if we use it as a backoff instead of the NOUN-default tagger?

In [31]:
ut3 = UnigramTagger(brown_train, backoff=at2)
print('Unigram with affix backoff accuracy: %4.1f%%' % (100.0 * ut3.evaluate(brown_test)))
Unigram with affix backoff accuracy: 94.6%

The error reduction is from 90.5% to 94.6% -- slightly better that the 94.5% obtained with the NOUN backoff.

At this point, we have combined 2 major sources of information: dictionary and morphology and obtained about 94.6% accuracy.

The last type of knowledge we will use is syntax: look at the context of word to be tagged to make a decision. This means that for the first time we will have a tagger which is capable to generate different tags for the same word type -- that is, a tagger that finally handles the ambiguity of the POS tagging task.

Looking at the Context

The last source of knowledge we want to exploit to better tag is the context of the word to be tagged. By context, we mean the words that appear around the word to be tagged. The intuition is that if we have to decide between "book" as a verb or a noun, the word preceding "books" can give us strong cues: for example, if it is an article ("the" or "a") then we would be sure it is a noun; if it is "to", then we would be sure it is a verb.

How can we turn this intuition into working code? The basic approach we will take is to learn "predictive contexts" from our training data set. A context is "predictive" if when we see the context, we can predict the tag of a word with high confidence. The easiest way to detect predictive contexts is to construct a list of contexts - and for each context, keep track of the distribution of tags that follow it.

There are several questions we must address to verify that this strategy will work:

  • What is a context?
  • How reliable is the observation of a context in the training set to learn predictions on the test set?
  • How reliable is the inference from a context to the tag of the next word?

We start with easy answers to these questions - naturally, more sophisticated answers will lead better performance:

  • A context is the list of the N tags that appear before the word to be tagged together with the word to be tagged itself.
  • A context is reliable if it is seen more often than C times in the training set.
  • The inference from context to tag is always reliable - we just take the argmax of the distribution (context --> tag).

When we tag unseen text, we tag words from left to right, sentence by sentence. When we reach word number i within a sentence, we remember the history of the decisions we've made up to i, and use these as the context of the next tagging decision.

This strategy is implemented in the following 2 nltk classes: ContextTagger and NgramTagger. ContextTagger is an abstract class that extends SeqentialBackoffTagger and that captures:

  • The notion of an abstract context (through the abstract method context)
  • training to learn a context table (through the method _train)
  • Tag inference by looking up the context table (method choose_tag)

As usual, if choose_tag cannot make a decision (because the observed context was never seen at training time), the decision is delegated to a backoff tagger.

At the core of the ContextTagger, lies the context_to_tag dictionary. The train method constructs this dictionary by constructing a CondFreqDist object that counts how often a context is mapped to each observed tag.

class ContextTagger(SequentialBackoffTagger):
    An abstract base class for sequential backoff taggers that choose
    a tag for a token based on the value of its "context".  Different
    subclasses are used to define different contexts.

    A C{ContextTagger} chooses the tag for a token by calculating the
    token's context, and looking up the corresponding tag in a table.
    This table can be constructed manually; or it can be automatically
    constructed based on a training corpus, using the L{_train()}
    factory method.

    @ivar _context_to_tag: Dictionary mapping contexts to tags.
    def __init__(self, context_to_tag, backoff=None):
        SequentialBackoffTagger.__init__(self, backoff)
        if context_to_tag:
            self._context_to_tag = context_to_tag
            self._context_to_tag = {}

    def context(self, tokens, index, history):
        @return: the context that should be used to look up the tag
            for the specified token; or C{None} if the specified token
            should not be handled by this tagger.
        @rtype: (hashable)
        raise AssertionError('Abstract base class')

    def choose_tag(self, tokens, index, history):
        context = self.context(tokens, index, history)
        return self._context_to_tag.get(context)

    def _train(self, tagged_corpus, cutoff=0, verbose=False):
        Initialize this C{ContextTagger}'s L{_context_to_tag} table
        based on the given training data.  In particular, for each
        context C{I{c}} in the training data, set
        C{_context_to_tag[I{c}]} to the most frequent tag for that
        context.  However, exclude any contexts that are already
        tagged perfectly by the backoff tagger(s).

        @param tagged_corpus: A tagged corpus.  Each item should be
            a C{list} of C{(word, tag)} tuples.
        @param cutoff: If the most likely tag for a context occurs
            fewer than C{cutoff} times, then exclude it from the
            context-to-tag table for the new tagger.

        token_count = hit_count = 0

        # A context is considered 'useful' if it's not already tagged
        # perfectly by the backoff tagger.
        useful_contexts = set()

        # Count how many times each tag occurs in each context.
        fd = ConditionalFreqDist()
        for sentence in tagged_corpus:
            tokens, tags = zip(*sentence)
            for index, (token, tag) in enumerate(sentence):
                # Record the event.
                token_count += 1
                context = self.context(tokens, index, tags[:index])
                if context is None: continue
                # If the backoff got it wrong, this context is useful:
                if (self.backoff is None or
                    tag != self.backoff.tag_one(tokens, index, tags[:index])):

        # Build the context_to_tag table -- for each context, figure
        # out what the most likely tag is.  Only include contexts that
        # we've seen at least `cutoff` times.
        for context in useful_contexts:
            best_tag = fd[context].max()
            hits = fd[context][best_tag]
            if hits > cutoff:
                self._context_to_tag[context] = best_tag
                hit_count += hits

Note: The following Python calls need to be explained: zip(*sentence) and enumerate(sentence):

In [32]:
# zip(*sentence) splits a list of tuples with n elements each into n flat lists 
zip(*[('a', 1, 'z'), ('b', 2, 'y')])
<zip at 0x26a2f524208>

Zip returns a generator - it is a lazy function. To consume the generator, we must consume it with for example a for iteration:

In [33]:
for t in zip(*[('a', 1, 'z'), ('b', 2, 'y')]): print(t)
('a', 'b')
(1, 2)
('z', 'y')

Enumerate(generator) returns a new iterator over pairs that are numbered in increasing numbers:

In [38]:
# enumerate(list) returns an iterator over pairs of the form (index, item-in-list)
[i for i in enumerate(['a', 'b'])]
[(0, 'a'), (1, 'b')]

The class NgramTagger extends ContextTagger by specifying that the context to be used contains the n previous tags in the sentence and the word to be tagged.

class NgramTagger(ContextTagger, yaml.YAMLObject):
    A tagger that chooses a token's tag based on its word string and
    on the preceeding I{n} word's tags.  In particular, a tuple
    C{(tags[i-n:i-1], words[i])} is looked up in a table, and the
    corresponding tag is returned.  N-gram taggers are typically
    trained on a tagged corpus.
    def __init__(self, n, train=None, model=None,
                 backoff=None, cutoff=0, verbose=False):
        self._n = n
        ContextTagger.__init__(self, model, backoff)
        if train:
            self._train(train, cutoff, verbose)

    def context(self, tokens, index, history):
        tag_context = tuple(history[max(0,index-self._n+1):index])
        return (tag_context, tokens[index])

How much does such a simple model of context help?

Let us try it with our best backoff tagger so far (ut3 is the unigram tagger backed off by the affix tagger backed off by the regexp tagger backed off by the NN tagger):

In [39]:
# Where we stand before looking at the context
In [40]:
from nltk import NgramTagger

ct2 = NgramTagger(2, brown_train, backoff=ut3)
In [41]:
ct3 = NgramTagger(3, brown_train, backoff=ut3)

We find on our dataset that looking at a context of 2 tags in the past improves the accuracy from 94.6% to 95.4% -- this is 18% error reduction.

If we try to use a larger context of 3 tags, we get less improvement (from 94.6% to 95.2%).

The main problem we face is that of data sparseness: there are not enough samples of each context for the context to help. We will return to this issue in the next lectures.


This chapter introduced tools to tag parts of speech in free text.

Parts of speech are classes of words that can be characterized by criteria of different type:

  • Syntactic: 2 words of the same class can substitute each other in a sentence and leave the sentence syntactically acceptable
  • Morphological: words of the same class are inflected in similar manner
  • Semantic: words of the same class denote entities of similar semantic types (object, action, property, relation)

Tagsets of various granularity can be considered. We mentioned the standard Brown corpus tagset (about 60 tags for the complete tagset) and the reduced universal tagset (17 tags).

The key point of the approach we investigated is that it is data-driven: we attempt to solve the task by:

  • Obtain sample data annotated manually: we used the Brown corpus
  • Define a success metric: we used the definition of accuracy
  • Measure the adequacy of a method by measuring its success
  • Measure the complexity of the problem by using the notion of perplexity

The computational methods we developed are characterized by:

  • We first define possible knowledge sources that can help us solve the task. Specifically, we investigated

    • dictionary,
    • morphological
    • context as possible sources.
  • We developed computational models that represent each of these knowledge sources in simple data structures (hashtables, frequency distributions, conditional frequency distributions).

  • We tested simple machine learning methods: data is acquired by inspecting a training dataset, then evaluated by testing on a test dataset.
  • We investigated one method to combine several systems into a combined system: backoff models.

This methodology will be further developed in the next chapters, as we will address more complex tasks (parsing, summarization) and use more sophisticated machine learning methods.

The task of Parts of Speech tagging is very well studied in English. The most efficient systems obtain accuracy rates of over 98% even on fine granularity tagsets - which is equivalent to the rate of success human beings obtain and the best agreement among human taggers generally obtained. The best systems use better machine learning algorithms (HMM, SVM) and treat unknown words (words not seen in training data) with more sophistication than what we have observed here.