Natural Language Processing - Class 5 - Parts of Speech Taggers

Tagging

How is it done?

Methods: using manualy tagged corpora (Brown Corpus, the Penn Treebank):
improved learning techniques: Markov models, decision trees, connectionist machines, transformations, nearest neighbor algorithms and max entropy.
Taggers yield quite similar results: same information in different forms?


The  can        will       rust
--------------------------------
det  modal-verb modal-verb noun
     noun       noun       verb
     verb       verb

We will look at some of the methods:
Historically - tagging began with manually engineered rules for tagging, sometimes with the aid of corpus.

Some different points of view:

Three main steps in tagging

  1. lexicalizing
  2. tagging unknown words
  3. assigning the k-best tags to a word.

Requirements from a taggers


Unigram Tagging

The simplest: every word assigned its most likely a POS regardless of the context in which it appears.
Words that dont appear in lexicon can be tagged from simple manually-derived heuristics to guess the proper tag of the word. (or be interpreted as Proper nouns).

Bad example:

The bird flies/V over the flower.
The flies/V like a flower.
Accuracy: 90-94%

Improve: use some local context:

How to calculate context?


Transformational tagging

Starting point: a simple method does well:

Initial-state annotators for POS

Make it more accurate:
A sequence of rules are applied that change the tags of a word, based on the context.

Rule formats are kept very simple to make them easy to learn:

Change the tag of a word X to Y if the tag of the previous word is Z 
The can will rust example:
"change modal-verb to noun after det" will assign.
The rules are applied determenistically, in the order they appear. The window of corrections is of 3 words.
A tagger can be found here.

Drawback: for 40 possible tags: 40**3 rules = 6.4*10**4
Must be calculated on big corpora and see which are the most useful rules, and then chooses the rules that make good changes after the revisions were first made..
Properties: efficiency, tagging w/o predefined corpus.

In more details:

Transformation-based error-driven learning

This method is common in NLP: tagging, prepositional phrase attachment disambiguation and syntactic parsing [Brill], machine translation.

Tagging:

Developed by Eric Brill, then at Penn, now at Johns-Hopkins.
Basic idea: Possible annotations: Compare this initial corpus to the truth, i.e., a manually annotated corpus.
Form an ordered list of transformations, each of which brings the corpus closer to the truth.

At each iteration of learining, the transformation that assigned highest-score is found, and is added to the ordred transformation list: needs a space of transformations, scoring functions for comparisons with the truth.

The ordered list of transformations, plus the initial-state annotator, is a learned model of POS tagging.

POS transformations

How do we acquire a sequence of transformations?

Examples for Contextual (final-state) transformation templates

Change the tag of word w from A to B when:

Unknown words

Unknown-word (start-state) transformation templates

Assign tag A to word w if/
or
Change the tag of word w from A to B if:

Stochastic Taggers

Probablistic/Statistical Terms:

Measure how ofter a particular event occurs.

Example

The word flies occurs in a corpus 1000 times. 600 as a verb, 400 as a noun:
  - P(V) = 0.6 (i.e. P(flies is a verb)
  - P(N) = 0.4 (i.e. P(flies is a noun)
flies occurs in the beginning of a sentence 100 times: 95 of these, as a noun:
  - P(N & S) = 95/1000 = .095 
(i.e. P(flies noun at the beginning of a sentence))

The target in statistical taggers:
for a sequence of w1....wn words, find the sequence of POS categories
C1....Cn that maximizes p(C1...Cn|W1....Wn)
Bayes rule: P(C1...Cn)*P(W1....Wn|C1...Cn) / P(W1....Wn) Difficult to calculate. Approximate each factor by applying some independence assumptions: Unigram: P(C1...Cn)=~P(C1)*...*p(Cn) Bigram: P(Ci|Ci-1) Estimator: P(C1,....Cn)=~P(C1|C2)*P(C2|C3)*...*P(Cn|Cn-1)

Cannonical Statistical Taggers:

N-Gram Tagging

most widely used.
Main algorithm:
Given a word sequence W - try to find the tag sequence T that maximizes P(T|W).
*This can be done using the Viterbi algorithm to find the T that maximizes:
P(T)*P(W|T)
Unseen words are handles with suffix information.

More detailed:
Let t vary over all possible tags. The most common tag for the ith word of a sentence, wi, is the one that maximizes the probablility:
p(t|wi)

This algorithm solves the tagging probelm for a single word by finding:

arg maxtp(t|wi)
(find the t that maximizes the following quantity: the probability of a tag given the word.

This can be extended by looking for the sequence of tags that maximize the product of the individual word probabilities:
arg max BIG_PAI tp(ti|wi)

Tags these days have 96-97% accurancy .
But this type of tagging fail in 'the can..' example - making can a modal.

To allow a bit context:

Collect statistics on the probability of tag ti following tag ti-1.

arg max BIG_PAI tp(ti|ti-1)p(wi|ti)

This is a product of the probablilities of each word but with two probablilities: the probablitiy of a tag ti given its possible tags, and the probablitiy of a tag given its previuos tag (the context): p(ti|ti-1)

The above equation corresponds to a well understood mathematical construct: Hidden Markov Models (HMM): a finite automaton in which a state tarnsitions have probablitities and whose output is also probablisitic.
: HMM automatum

This allows seeing the tagging problem as simply the following way:
Given a string of words, find the sequence of states the machine went through in order to output the sentence at hand.
These are 2 possible ways : linear equation (Viterbi algorithm) or the automata.

What happens when allowed unkown words in input? for such words p(t|wi) is zero. Methods:

Maximum Entropy Tagging

This framework is a probabilistic framework, a model is found that is consistent with the observed data and is maximally agnostic with respect to all parameters for which no data exists.
Such a tagger can be found here.

References


Back to the course home page

Last modified April 25, 1999 Yael Dahan Netzer


For any question, contact me: yaeln@cs.bgu.ac.il
Back to course homepage

Last modified , 2000