Tagging

This lecture discusses computational methods to perform part-of-speech tagging of free text.
  1. Definition of the Task
  2. Installing NLTK
  3. Instant Python
  4. The FreqDist NLTK Object: Counting Things
  5. Accessing a Corpus
  6. Perplexity of the POST task
  7. Measuring success: Accuracy, Training Dataset, Test Dataset
  8. Sources of Knowledge to Improve Tagging Accuracy
  9. Combining Taggers: Backoff Tagger
  10. Default Tagger
  11. Lookup Tagger: Using Dictionary Knowledge
  12. Using Morphological Clues
  13. Looking at the Context
  14. Summary

Definition of the Task

One of the most basic and most useful task when processing text is to tokenize each word separately and label each word according to its most likely part of speech. This task is called part of speech tagging (POST).

Refer to the Wikipedia presentation for a short definition of the task of parts of speech tagging.

We will follow the step by step description introduced in Chapter 5 of the Natural Language Processing with Python --- Analyzing Text with the Natural Language Toolkit book by Steven Bird, Ewan Klein, and Edward Loper (2009).

For practical experimentation, refer to the NLTK HowTo page on tagging.

This task will give us the opportunity to first address methods of statistical analysis and of machine learning. Machine learning creates computational tools that generalize data viewed over a collection of examples. In the case of POST, an influential dataset that is used in many studies was published in 1967, known as the Brown Corpus. The Brown Corpus defined a tagset (specific collection of part-of-speech labels) thathas been reused in many other annotated resources in English.


Installing NLTK

NLTK (Natural Language ToolKit) is a collection of open source Python modules, linguistic data and documentation for research and development in natural language processing.It provides excellent combination of hands-on access to data, explanation and real-life data.

To install NLTK on your machine, follow these instructions.

Make sure to also install the datasets that come with NLTK as explained here.


Instant Python

We'll pick up Python as we go. If you need quick introductions to the language - look up the Python Tutorial. (Do not yet use Python 3.x - remain with the latest 2.7 version.)

If you ever think you need to develop something, first check whether it exists in the Python standard library.

The Google intro to Python is a very good way to start using Python rapidly.

Note that Python is an interpreted language, so it is easy to just experiment - you don't need a complex development environment to use it. But if you want to invest some time in setting up your Python environment, read this article.


The FreqDist NLTK Object: Counting Things

One of the most useful thing to do on corpora is to count things: count words, count words with specific properties, count sequences, count tags. NLTK defines a really useful class for this: nltk.FreqDist. (There is a closely related class in standard Python called Counter).

FreqDist (Frequency Distribution) is an object that counts occurrences of objects. It is defined in the nltk.probability module.

>>> from nltk.probability import FreqDist
>>> list = ['a', 'b', 'a']
>>> fdist = FreqDist(list)
>>> fdist
<FreqDist with 3 outcomes>
>>> fdist['a']  # We saw 'a' 2 times
2
>>> fdist['b']  # We saw 'b' 1 time
1
>>> fdist['c']  # We saw 'c' 0 times
0
One can get frequencies, maximum, sorted samples out of a FreqDist:
>>> fdist.max()      # 'a' is the most frequent sample seen
'a'
>>> fdist.keys()     # Return the samples in decreasing frequency
['a', 'b']
>>> fdist.freq('a')  # 2/3 of the samples we saw were 'a' 
0.66666666666666663
>>> fdist.N()        # How many samples did we count?
3

Accessing a Corpus

NLTK contains a collection of tagged corpora, arranged as convenient Python objects. We will use the Brown corpus in this experiment.
>>> from nltk.corpus import brown
>>> brown_news_tagged = brown.tagged_sents(categories='news')
>>> brown_news_words = brown.tagged_words(categories='news')
The tagged_sents version of the corpus is a list of sentences. Each sentence is a list of pairs (tuples) (word, tag).

Similarly, one can access the corpus as a flat list of tagged words.

>>> fdistw = FreqDist([w for (w, t) in brown_news_words])
>>> fdistw.N()    # We saw 100,554 words in this section of the corpus
100554
Let us now check: which word is the most common among the about 100,000 words in this part of the Brown corpus? Which tag is the most common?
>>> fdistw.max()
'the'
# Let us count the words without distinction of upper/lower case and the tags
>>> fdistwl = FreqDist([w.lower() for (w, t) in brown_news_words])
>>> fdistt = FreqDist([t for (w, t) in brown_news_words])

>>> len(fdistw)  # How many distinct words are there
14394
>> len(fdistwl)  # How many distinct words when we ignore case distinctions ('the' same as 'The')
13112

>>> fdistwl['the']   # How often does 'the' occur in the corpus
6386

>>> print '%5.2f'%% % (fdistwl.freq('the') * 100)
6.35%    # 6.35% of the words are 'the'

>>> print '%5.2f%%' % (fdistt.freq('NN') * 100)
13.09%   # 13.09% of the words are tagged as 'NN'

Perplexity of the POST task

The first question we address about the task of POS tagging is: how complex is it?

One way to quantify the complexity of a task is to measure its perplexity. The intuitive meaning of perplexity is: when the tagger is considering which tag to associate to a word, how "confused" is it? That is, how many options does it have to choose from?

Obviously, this depends on the number of tags in the tagset. Many recent studies of POST use a simplified version of the Brown tagset, that contains only 19 tags. This tagset is shown in the following table (from Chapter 5 of the NLTK book):
Tag Meaning Examples
ADJ adjective new, good, high, special, big, local
ADV adverb really, already, still, early, now
CNJ conjunction and, or, but, if, while, although
DET determiner the, a, some, most, every, no
EX existential there, there's
FW foreign word dolce, ersatz, esprit, quo, maitre
MOD modal verb will, can, would, may, must, should
N noun year, home, costs, time, education
NP proper noun Alison, Africa, April, Washington
NUM number twenty-four, fourth, 1991, 14:24
PRO pronoun he, their, her, its, my, I, us
P preposition on, of, at, with, by, into, under
TO the word to to
UH interjection ah, bang, ha, whee, hmpf, oops
V verb is, has, get, do, make, see, run
VD past tense said, took, told, made, asked
VG present participle making, going, playing, working
VN past participle given, taken, begun, sung
WH wh determiner who, which, when, what, where, how

The corpus methods in NLTK allow us to work with the simplified tagset using the following options:

>>> from nltk.corpus import brown
>>> brown_news_tagged = brown.tagged_sents(categories='news', simplify_tags=True)
>>> brown_news_words = brown.tagged_words(categories='news', simplify_tags=True)

In the absence of any knowledge about words and tags, the perplexity of the task with a tagset of size 19 will be 19. We will see that adding knowledge will reduce the perplexity of the task.

Measuring success: Accuracy, Training Dataset, Test Dataset

Assume we develop a tagget. How do we know how successful it is? Can we trust the decisions the tagger makes? The way to address these issues is to define a criterion for success, and to test the tagger on a large test dataset. Assume we have a large dataset of 1M words with their tags assigned manually. We first split the dataset into 2 parts: one part on which we will "learn" facts about the words and their tags (we call this the training dataset), and one part which we use to test the results of our tagger (we call this the test dataset). It is critical NOT to test our tagger on the training dataset -- because we want to test whether the tagger is able to generalize from data it has seen and make decision on unseen data. (A "stupid" tagger would learn the exact data seen in the training dataset "by heart", and respond exactly as shown when asked on training data -- it would get a perfect score on the training data, but a poor score on any unseen data.) This is one way to split the dataset into training and testing:
>>> brown_train = brown_news_tagged[100:]
>>> brown_test = brown_news_tagged[:100]
>>> test_sent = nltk.tag.untag(brown_test[0])
To measure success, in this task, we will measure accuracy. The tagger object in NLTK includes a method called evaluate to measure the accuracy of a tagger on a given test set.
# A default tagger assigns the same tag to all words
>>> default_tagger = nltk.DefaultTagger('XYZ')
>>> default_tagger.tag('This is a test'.split())
[('This', 'XYZ'), ('is', 'XYZ'), ('a', 'XYZ'), ('test', 'XYZ')]
Since 'NN' is the most frequent tag in the Brown corpus, we can use a tagger that assigns 'NN' to all words as a baseline.
>>> default_tagger = nltk.DefaultTagger('NN')
>>> default_tagger.tag(test_sent) 
[('The', 'NN'), ('Fulton', 'NN'), ('County', 'NN'), ('Grand', 'NN'), ('Jury', 'NN'),
('said', 'NN'), ('Friday', 'NN'), ('an', 'NN'), ('investigation', 'NN'), ('of', 'NN'),
("Atlanta's", 'NN'), ('recent', 'NN'), ('primary', 'NN'), ('election', 'NN'),
('produced', 'NN'), ('``', 'NN'), ('no', 'NN'), ('evidence', 'NN'), ("''", 'NN'),
('that', 'NN'), ('any', 'NN'), ('irregularities', 'NN'), ('took', 'NN'),
('place', 'NN'), ('.', 'NN')]
Using this baseline, we achieve a low accuracy:
>>> print 'Accuracy: %4.1f%%' % (
...     100.0 * default_tagger.evaluate(brown_test))
Accuracy: 14.6%
Still, note that we improved the expected accuracy from picking one out of 19 answers with no knowledge, to picking one out of about 6 answers with very little knowledge (what is the most frequent tag in the dataset).

The computation of the accuracy compares the answers obtained by the tagger with that listed in the "gold standard" dataset. This is defined as follows in the NLTK code (the izip method in Python is defined in the itertools package, given two iterable collections, it returns an iterator over a collection of pairs - iterators are "consumed" by code such as "for ... in ...iterator...):

 def accuracy(reference, test): 
       if len(reference) != len(test): 
           raise ValueError("Lists must have the same length.") 
       num_correct = 0 
       for x, y in izip(reference, test): 
           if x == y: 
               num_correct += 1 
       return float(num_correct) / len(reference)

Sources of Knowledge to Improve Tagging Accuracy

Intuitively, the sources of knowledge that can help us decide what is the tag of a word include: We will now develop a sequence of taggers that use these knowledge sources, and combine them. The taggers we develop will implement the nltk.tag.TaggerI interface:
class TaggerI(object): 
  def tag(self, tokens): 
    raise NotImplementedError() 
    
  def batch_tag(self, sentences): 
    return [self.tag(sent) for sent in sentences] 

  def evaluate(self, gold): 
    tagged_sents = self.batch_tag([untag(sent) for sent in gold]) 
    gold_tokens = sum(gold, []) 
    test_tokens = sum(tagged_sents, []) 
    return accuracy(gold_tokens, test_tokens) 

Combining Taggers: Backoff Tagger

We will see as we proceed that there is a tradeoff in machine learning between precision and recall: a classifier has high precision if when it makes a decision, it is often right; a classifier has high recall if it can make the right decision on most of the questions.

To understand the difference, consider that a classifier may be given the option to "pass" on difficult questions. A cautious classifier will only commit to an answer if it is sure of the answer. Such a classifier could reach high precision, but may run the risk of low recall. In contrast, a risky classifier will suggest an answer even if it is not sure of the answer. Such a classifier could reach high recall, but runs the risk of low precision.

One way to balance between different knowledge sources is to combine the classifiers they produce. For example, a tagger that looks at the morphological structure of words and one that looks at the neighbors of the words to be classified use completely different knowledge sources. We say they are orthogonal.

One would want to combine the 2 classifiers into one "better" classifier than either one alone. Intuitively, it makes sense that the two sources of knowledge should produce better decisions than either one alone. How can we encode such combination?

One way is to use a chain of classifiers and order the classifiers in a sequence of backoff: the first classifier, the most "picky" starts. If it decides to "pass", then it asks another, less picky tagger to contribute. The chain can be as long as needed. This method of tagger combination is implemented in the module nltk.tag.sequential. The interface sequentialBackoffTagger inherits from the TaggerI interface.

class SequentialBackoffTagger(TaggerI): 
  """ 
 An abstract base class for taggers that tags words sequentially, 
 left to right.  Tagging of individual words is performed by the 
 method L{choose_tag()}, which should be defined by subclasses.  If 
 a tagger is unable to determine a tag for the specified token, 
 then its backoff tagger is consulted. 
""" 
 
 def __init__(self, backoff=None): 
   if backoff is None: 
     self._taggers = [self] 
   else: 
     self._taggers = [self] + backoff._taggers 
 
   def _get_backoff(self): 
     if len(self._taggers) < 2: return None 
       else: return self._taggers[1] 
 
   backoff = property(_get_backoff, doc='''The backoff tagger for this tagger.''') 
 
   def tag(self, tokens): 
      # docs inherited from TaggerI 
       tags = [] 
      for i in range(len(tokens)): 
        tags.append(self.tag_one(tokens, i, tags)) 
      return zip(tokens, tags) 
 
   def tag_one(self, tokens, index, history): 
     """ 
     Determine an appropriate tag for the specified token, and 
     return that tag.  If this tagger is unable to determine a tag 
     for the specified token, then its backoff tagger is consulted. 
     """ 
     tag = None 
     for tagger in self._taggers: 
       tag = tagger.choose_tag(tokens, index, history) 
       if tag is not None:  break 
     return tag 
 
   def choose_tag(self, tokens, index, history): 
     """ 
     Decide which tag should be used for the specified token, and 
     return that tag.  If this tagger is unable to determine a tag 
     for the specified token, return C{None} -- do I{not} consult 
     the backoff tagger.  This method should be overridden by 
     subclasses of C{SequentialBackoffTagger}. 
     """ 
    raise AssertionError('SequentialBackoffTagger is an abstract class') 

This class encodes 2 properties of taggers:

  1. Backoff: They try to tag a word, and if they are not sure, they pass to a backoff tagger.
  2. Sequential: They tag words from left to right, in order, and at each step, they can look at the decisions made in the past words within the sentence.

Default Tagger

Here is the simplest implementation of this method:
class DefaultTagger(SequentialBackoffTagger, yaml.YAMLObject): 
  """ 
  A tagger that assigns the same tag to every token. 
  """ 
  yaml_tag = '!nltk.DefaultTagger' 

  def __init__(self, tag): 
    """ 
    Construct a new tagger that assigns C{tag} to all tokens. 
    """ 
    self._tag = tag 
    SequentialBackoffTagger.__init__(self, None) 

  def choose_tag(self, tokens, index, history): 
    return self._tag  # ignore token and history 

  def __repr__(self): 
    return '<DefaultTagger: tag=%s>' % self._tag 

This is the default tagger we used above. When it is given the 'NN' tag, it achieves accuracy of about 15%. This tagger is not discriminative at all: it does not look at any of the input data. It always finds a good candidate. It is, therefore, a good choice as the last member of a chain of taggers organized in a backoff chain. When all other methods cannot decide, then choose a reasonable default (which is better than not deciding at all, or deciding completely randomly).

Lookup Tagger: Using Dictionary Knowledge

Assume we have a dictionary that lists the possible tags for each word in English. Could we use this information to perform better tagging?

The intuition is that we would only assign to a word a tag that it can have in the dictionary. For example, if "box" can only be a Verb or a Noun, when we have to tag an instance of the word "box", we only choose between 2 options - and not between 19 options. Thus, dictionary knowledge will reduce the perplexity of the task.

There are 3 issues we must address to turn this into working code:

  1. Where do we get the dictionary?
  2. How do we choose between the various tags associated to a word in the dictionary? (For example, how do we choose between V and N for "box").
  3. What do we do for words that do not appear in the dictionary?

The simple solutions we will test are the following - note that for each question, there exist other strategies that we will investigate later:

  1. Where do we get the dictionary: we will learn it from a sample dataset.
  2. How do we choose between the various tags associated to a word in the dictionary: we will choose the most likely tag as observed in the sample dataset.
  3. What do we do for words that do not appear in the dictionary: we will pass unknown words to a backoff tagger.

The nltk.UnigramTagger implements this overall strategy. It must be trained on a dataset, from which it builds a model of "unigrams". The following code shows how it is used:

# Prepare training and test datasets
>>> from nltk.corpus import brown
>>> brown_news_tagged = brown.tagged_sents(categories='news')
>>> brown_train = brown_news_tagged[100:]
>>> brown_test = brown_news_tagged[:100]

# Train the unigram model
>>> unigram_tagger = nltk.UnigramTagger(brown_train)
>>> unigram_tagger.tag(nltk.tag.untag(brown_test[0]))
[('The', 'AT'), ('Fulton', None), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), 
('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN'), ("Atlanta's", 'NP$'), ('recent', 'JJ'), 
('primary', 'NN'), ('election', 'NN'), ('produced', 'VBD'), ('``', '``'), ('no', 'AT'), ('evidence', 'NN'), 
("''", "''"), ('that', 'CS'), ('any', 'DTI'), ('irregularities', None), ('took', 'VBD'), 
('place', 'NN'), ('.', '.')]
Note that the unigram tagger leaves some words tagged as 'None' -- these are unknown words, words that were not observed in the training dataset.

How successful is this tagger?

>>> print 'Unigram tagger accuracy: %4.1f%%' % ( 100.0 * unigram_tagger.evaluate(brown_test))
Unigram tagger accuracy: 85.6%
This is quite an improvement on the 15% of the default tagger. And this is without any backoff and without using morphological clues.

Is 85.6% a good level of accuracy? In fact it is not. It is accuracy per word. It means in particular that on average, in every sentence of about 20 words, we will accumulate 3 errors. 3 errors in each sentence is a very high error rate. It makes it difficult to run another task on the output of such a tagger. Think how difficult the life of a parser would be if 3 words in every sentence are wrongly tagged. The problem is known as the "pipeline effect" -- when language processing tools are chained in a pipeline, error rates accumulate from module to module.

How much would a good backoff help? Let's try first to add the NN-default tagger as a backoff:

>>> nn_tagger = nltk.DefaultTagger('NN')
>>> ut2 = nltk.UnigramTagger(brown_train, backoff=nn_tagger)
>>> print 'Unigram tagger with backoff accuracy: %4.1f%%' % ( 100.0 * ut2.evaluate(brown_test))
Unigram tagger accuracy: 86.8%
Adding a simple backoff (with accuracy of 15%) improved accuracy from 85.6% to 86.8%. One way to report this is in terms of error reduction: the error rate went down from 14.4% (100-85.6) to 13.2%. That's an absolute error reduction of 14.4-13.2 = 1.2%. Error reduction is generally reported as a percentage of the error: 100.0 * (1.2 / 14.4) = 8.3% error reduction. In other words, out of the words not tagged by the original model (with no backoff), 8.3% were corrected by the backoff.

What can we say about this error reduction? It is disappointing. We would have expected, all things being equal, to be able to tag the Unknown words with a success rate close to 15% (the success rate of the backoff tagger). But we get only half of that accuracy on the unknown words. What does that mean?

One lesson to learn from this is that the distribution of unknown words is significantly different from the distribution of all the words in the corpus. For example, we can infer from the figures seen so far that most NN words were already caught by the unigram tagger. As a consequence, the rate of NNs among unknown words is much lower than in the general word population.

Using Morphological Clues

As mentioned above, another knowledge source to perform tagging is to look at the letter structure of the words. We will look at 2 different methods to use this knowledge. First, we will use nltk.RegexpTagger to recognize specific regular expressions in words.
>>> regexp_tagger = nltk.RegexpTagger(
...     [(r'^-?[0-9]+(.[0-9]+)?$', 'CD'),   # cardinal numbers
...      (r'(The|the|A|a|An|an)$', 'AT'),   # articles
...      (r'.*able$', 'JJ'),                # adjectives
...      (r'.*ness$', 'NN'),                # nouns formed from adjectives
...      (r'.*ly$', 'RB'),                  # adverbs
...      (r'.*s$', 'NNS'),                  # plural nouns
...      (r'.*ing$', 'VBG'),                # gerunds
...      (r'.*ed$', 'VBD'),                 # past tense verbs
...      (r'.*', 'NN')                      # nouns (default)
... ])

>>> print 'Regexp accuracy %4.1f%%' % (100.0 * regexp_tagger.evaluate(brown_test))
Regexp accuracy 33.1%
The regular expressions are tested in order. If one matches, it decides the tag. Else it tries the next tag. The implementation of the RegexpTagger is quite simple: the choose_tag method of the SequenceBackoffTagger ignores the history (left context) of the word to be tagged. It tries the regular expressions in order.

Note: The notation "r'.*able" in Python refers to a "raw string" -- that is, a string in which backslash escape characters are interpreted in a different manner than those used in regular strings. For example '\n' is translated to a newline character by the Python reader, but r'\n' is read as a string of 2 characters containing a backslash. Since regular expressions often contain backslash characters to escape special regular expression characters (such as ., ^, *, +, $ etc), it is most convenient to encode regular expression strings with the "r-prefix" notation.

class RegexpTagger(SequentialBackoffTagger, yaml.YAMLObject):
    """
    A tagger that assigns tags to words based on regular expressions over word strings.
    """
    yaml_tag = '!nltk.RegexpTagger'
    def __init__(self, regexps, backoff=None):
        self._regexps = regexps
        SequentialBackoffTagger.__init__(self, backoff)

    def choose_tag(self, tokens, index, history):
        for regexp, tag in self._regexps:
            if re.match(regexp, tokens[index]): # ignore history
                return tag
        return None

    def __repr__(self):
        return '<Regexp Tagger: size=%d>' % len(self._regexps)
The question we face when we see such a "rule-based" tagger are:
  1. How do we find the most successful regular expressions?
  2. In which order should we try the regular expressions?
A typical answer to such questions is: let's learn these parameters from a training corpus. The nltk.AffixTagger is a trainable tagger that attempts to learn word patterns. It only looks at the last letters in the words in the training corpus, and counts how often a word suffix can predict the word tag. In other words, we only learn rules of the form ('.*xyz' , POS). This is how the affix tagger is used:
>>> affix_tagger = nltk.AffixTagger(brown_train, backoff=nn_tagger)
>>> print 'Affix tagger accuracy: %4.1f%%' % (100.0 * affix_tagger.evaluate(brown_test))
Affix tagger accuracy: 30.2%
Should we be disappointed that the "data-based approach" performs worse than the hand-written rules? Not necessarily: note that our hand-written rules include cases that the AffixTagger cannot learn -- we match cardinal numbers and suffixes with more than 3 letters. Let us see whether the combination of the 2 taggers helps:
>>> at2 = nltk.AffixTagger(brown_train, backoff=regexp_tagger)
>>> print 'Affix tagger with regexp backoff accuracy: %4.1f%%" % (100.0 * at2.evaluate(brown_test))
Affix tagger with regexp backoff accuracy: 40.1%
This is not bad -- the machine learning helped us reduce the error from 66.9% to 50.9% (24% error reduction).

How much does this tagger help the unigram tagger if we use it as a backoff instead of the NN-default tagger?

>>> ut3 = nltk.UnigramTagger(brown_train, backoff=at2)
>>> print 'Unigram with affix backoff accuracy: %4.1f%%' % (100.0 * ut3.evaluate(brown_test))
Unigram with affix backoff accuracy: 89.0%
The error reduction is from 85.6% to 89.0% -- that is about 20% error reduction.

At this point, we have combined 2 major sources of information: dictionary and morphology and obtained about 89% accuracy. The last type of knowledge we will use is syntax: look at the context of word to be tagged to make a decision.

Looking at the Context

The last source of knowledge we want to exploit to better tag is the context of the word to be tagged. By context, we mean the words that appear around the word to be tagged. The intuition is that if we have to decide between "book" as a verb or a noun, the word preceding "books" can give us strong cues: for example, if it is an article ("the" or "a") then we would be sure it is a noun; if it is "to", then we would be sure it is a verb.

How can we turn this intuition into working code? The basic approach we will take is to learn "predictive contexts" from our training data set. A context is "predictive" if when we see the context, we can predict the tag of a word with high confidence. The easiest way to detect predictive contexts is to construct a list of contexts - and for each context, keep track of the distribution of tags that follow it.

There are several questions we must address to verify that this strategy will work:

  1. What is a context?
  2. How reliable is the observation of a context in the training set to learn predictions on the test set?
  3. How reliable is the inference from a context to the tag of the next word?
We start with easy answers to these questions -- naturally, more sophisticated answers will lead better performance:
  1. A context is the list of the N tags that appear before the word to be tagged together with the word to be tagged itself.
  2. A context is reliable if it is seen more often than C times in the training set.
  3. The inference from context to tag is always reliable - we just take the argmax of the distribution (context --> tag).
When we tag unseen text, we tag words from left to right, sentence by sentence. When we reach word number i within a sentence, we remember the history of the decisions we've made up to i, and use these as the context of the next tagging decision.

This strategy is implemented in the following 2 nltk classes: ContextTagger and NgramTagger. ContextTagger is an abstract class that extends SeqentialBackoffTagger and that captures:

  1. The notion of an abstract context (through the abstract method context)
  2. training to learn a context table (through the method _train)
  3. Tag inference by looking up the context table (method choose_tag)
As usual, if choose_tag cannot make a decision (because the observed context was never seen at training time), the decision is delegated to a backoff tagger.

At the core of the ContextTagger, lies the context_to_tag dictionary. The train method constructs this dictionary by constructing a CondFreqDist object that counts how often a context is mapped to each observed tag.

class ContextTagger(SequentialBackoffTagger):
    """
    An abstract base class for sequential backoff taggers that choose
    a tag for a token based on the value of its "context".  Different
    subclasses are used to define different contexts.

    A C{ContextTagger} chooses the tag for a token by calculating the
    token's context, and looking up the corresponding tag in a table.
    This table can be constructed manually; or it can be automatically
    constructed based on a training corpus, using the L{_train()}
    factory method.

    @ivar _context_to_tag: Dictionary mapping contexts to tags.
    """
    def __init__(self, context_to_tag, backoff=None):
        SequentialBackoffTagger.__init__(self, backoff)
        if context_to_tag:
            self._context_to_tag = context_to_tag
        else:
            self._context_to_tag = {}

    def context(self, tokens, index, history):
        """
        @return: the context that should be used to look up the tag
            for the specified token; or C{None} if the specified token
            should not be handled by this tagger.
        @rtype: (hashable)
        """
        raise AssertionError('Abstract base class')

    def choose_tag(self, tokens, index, history):
        context = self.context(tokens, index, history)
        return self._context_to_tag.get(context)

    def _train(self, tagged_corpus, cutoff=0, verbose=False):
        """
        Initialize this C{ContextTagger}'s L{_context_to_tag} table
        based on the given training data.  In particular, for each
        context C{I{c}} in the training data, set
        C{_context_to_tag[I{c}]} to the most frequent tag for that
        context.  However, exclude any contexts that are already
        tagged perfectly by the backoff tagger(s).

        @param tagged_corpus: A tagged corpus.  Each item should be
            a C{list} of C{(word, tag)} tuples.
        @param cutoff: If the most likely tag for a context occurs
            fewer than C{cutoff} times, then exclude it from the
            context-to-tag table for the new tagger.
        """

        token_count = hit_count = 0

        # A context is considered 'useful' if it's not already tagged
        # perfectly by the backoff tagger.
        useful_contexts = set()
        
        # Count how many times each tag occurs in each context.
        fd = ConditionalFreqDist()
        for sentence in tagged_corpus:
            tokens, tags = zip(*sentence)
            for index, (token, tag) in enumerate(sentence):
                # Record the event.
                token_count += 1
                context = self.context(tokens, index, tags[:index])
                if context is None: continue
                fd[context].inc(tag)
                # If the backoff got it wrong, this context is useful:
                if (self.backoff is None or
                    tag != self.backoff.tag_one(tokens, index, tags[:index])):
                    useful_contexts.add(context)

        # Build the context_to_tag table -- for each context, figure
        # out what the most likely tag is.  Only include contexts that
        # we've seen at least `cutoff` times.
        for context in useful_contexts:
            best_tag = fd[context].max()
            hits = fd[context][best_tag]
            if hits > cutoff:
                self._context_to_tag[context] = best_tag
                hit_count += hits

Note: The following Python calls need to be explained: zip(*sentence) and enumerate(sentence):
# zip(*sentence) splits a list of tuples with n elements each into n flat lists 
>>> zip(*[('a', 1, 'z'), ('b', 2, 'y')])
[('a', 'b'), (1, 2), ('z', 'y')]

# enumerate(list) returns an iterator over pairs of the form (index, item-in-list)
>>> [i for i in enumerate(['a', 'b'])]
[(0, 'a'), (1, 'b')]
The class NgramTagger extends ContextTagger by specifying that the context to be used contains the n previous tags in the sentence and the word to be tagged.
class NgramTagger(ContextTagger, yaml.YAMLObject):
    """
    A tagger that chooses a token's tag based on its word string and
    on the preceeding I{n} word's tags.  In particular, a tuple
    C{(tags[i-n:i-1], words[i])} is looked up in a table, and the
    corresponding tag is returned.  N-gram taggers are typically
    trained on a tagged corpus.
    """
    def __init__(self, n, train=None, model=None,
                 backoff=None, cutoff=0, verbose=False):
        self._n = n
        ContextTagger.__init__(self, model, backoff)
        if train:
            self._train(train, cutoff, verbose)
            
    def context(self, tokens, index, history):
        tag_context = tuple(history[max(0,index-self._n+1):index])
        return (tag_context, tokens[index])

How much does such a simple model of context help? Let us try it with our best backoff tagger so far (ut3 is the unigram tagger backed off by the affix tagger backed off by the regexp tagger backed off by the NN tagger):
# Where we stand before looking at the context
>>> ut3.evaluate(brown_test)
0.89021164021164023

>>> ct2 = nltk.NgramTagger(2, brown_train, backoff=ut3)
>>> ct2.evaluate(brown_test)
0.90343915343915349

>>> ct3 = nltk.NgramTagger(3, brown_train, backoff=ut3)
>>> ct3.evaluate(brown_test)
0.89241622574955903
We find on our dataset that looking at a context of 2 tags in the past improves the accuracy from 89% to 90.3% -- this is 12% error reduction. If we try to use a larger context of 3 tags, we get almost no improvement (from 89.0% to 89.2%). The main problem we face is that of data sparseness: there are not enough samples of each context for the context to help. We will return to this issue in Chapter 3.

Summary

This chapter introduced tools to tag parts of speech in free text.

Parts of speech are classes of words that can be characterized by criteria of different type:

  1. Syntactic: 2 words of the same class can substitute each other in a sentence and leave the sentence syntactically acceptable
  2. Morphological: words of the same class are inflected in similar manner
  3. Semantic: words of the same class denote entities of similar semantic types (object, action, property, relation)
Tagsets of various granularity can be considered. We looked at the standard Brown corpus tagset - in 2 forms (about 60 tags for the complete tagset and a simplified tagset of 19 tags).

The key point of the approach we investigated is that it is data-driven: we attempt to solve the task by:

  1. Obtain sample data annotated manually: we used the Brown corpus
  2. Define a success metric: we used the definition of accuracy
  3. Measure the adequacy of a method by measuring its success
  4. Measure the complexity of the problem by using the notion of perplexity
The computational methods we developed are characterized by:
  1. We first define possible knowledge sources that can help us solve the task. Specifically, we investigated dictionary, morphological and context as possible sources.
  2. We developed computational models that represent each of these knowledge sources in simple data structures (hashtables, frequency distributions, conditional frequency distributions).
  3. We tested simple machine learning methods: data is acquired by inspecting a training dataset, then evaluated by testing on a test dataset.
  4. We investigated one method to combine several systems into a combined system: backoff models.
This methodology will be further developed in the next chapters, as we will address more complex tasks (parsing, summarization) and use more sophisticated machine learning methods.

The task of Parts of Speech tagging is very well studied in English. The most efficient systems obtain accuracy rates of over 98% -- which is equivalent to the rate of success human beings obtain and the best agreement among human taggers generally obtained. The best systems use better machine learning algorithms (HMM, SVM) and treat unknown words (words not seen in training data) with more sophistication. For example, TnT - A Statisical Part of Speech Tagger by Thorsten Brants (ACL 2000) reports accuracy rates of 97% over the Penn Treebank data.


Last modified Apr 1st, 2012