Natural Language Processing - Class 5 - Parts of Speech Taggers
|
Tagging
How is it done?
Methods: using manualy tagged corpora (Brown Corpus,
the Penn Treebank):
improved learning techniques: Markov models, decision trees, connectionist
machines, transformations, nearest neighbor algorithms and max entropy.
Taggers yield quite similar results: same information in different forms?
The can will rust
--------------------------------
det modal-verb modal-verb noun
noun noun verb
verb verb
We will look at some of the methods:
Historically -
tagging began with manually engineered rules for tagging, sometimes with the aid of
corpus.
Some different points of view:
- Stochastic (refers to frequency, statistics..):
- Markov-model based stochastic taggers: High rates of accuracy.
Pros: saves hand-rule-writing, captures information that could be missed by
human.
Cons: linguistic information is captured indirectly.
vs. non-stochastic taggers.
- Rule based taggers, machine-learining techniques
Pros: partialy automated, linguistic info is encoded direcly, good results.
- Supervised refers to the degree of automation of the tarining and tagging process.
- based on pre-tagged corpus.
vs. Unsupervised
- induce word grouping using sophisticated computational methods.
- Corpus-based
- Standard approach for POS taggers today is to automatically acquire a POS
model from a hand-tagged corpus.
Small amounts of local context seem to be sufficient to determine a word's
Part of Speach.
vs. Lexical-based.
- An example: a lexicon-based - for, spanish tagger SPOSTS
uses lexicon and a set of rules.
- Hybrid systems.
Three main steps in tagging
- lexicalizing
- tagging unknown words
- assigning the k-best tags to a word.
Requirements from a taggers
- Robustness - capable to analyze ungrammatical constructions, isolated
phrases, non-linguistic data, unknown words.
- Effiecient - linear in the number of words tagged.
- Accurate
- Tunable - give "hints" on specifiec attributes of different
corpora.
- Reusable - to new corpora and new languages (with minor changes)
Unigram Tagging
The simplest: every word assigned its most likely a POS regardless of
the context in which it appears.
Words that dont appear in lexicon can be tagged from simple
manually-derived heuristics to guess the proper tag of the word.
(or be interpreted as Proper nouns).
Bad example:
The bird flies/V over the flower.
The flies/V like a flower.
Accuracy: 90-94%
Improve: use some local context:
How to calculate context?
Transformational tagging
Starting point: a simple method does well:
Initial-state annotators for POS
- Apply good heuristics
- Use simple (or complex) probabilistic models
E.g., use most frequent POS for each word found in a separate lexicon
corpus
- Apply random tags?
Make it more accurate:
A sequence of rules are applied that change
the tags of a word, based on the context.
Rule formats are kept very simple to make them easy to learn:
Change the tag of a word X to Y if the tag of the previous word is Z
The can will rust example:
"change modal-verb to noun after det" will assign.
The rules are applied determenistically, in the order they appear.
The window of corrections is of 3 words.
A tagger can be found here.
Drawback: for 40 possible tags: 40**3 rules = 6.4*10**4
Must be calculated on big corpora and see which are the most useful rules,
and then chooses the rules that make good changes after the revisions were
first made..
Properties: efficiency, tagging w/o predefined corpus.
In more details:
Transformation-based error-driven learning
This method is common in NLP: tagging, prepositional phrase attachment
disambiguation and syntactic parsing [Brill], machine translation.
Tagging:
Developed by Eric Brill, then at Penn, now at Johns-Hopkins.
Basic idea:
Possible annotations:
- random structure
- up to sophisticated manually created anotator.
Compare this initial corpus to the truth, i.e., a manually annotated
corpus.
Form an ordered list of transformations, each of which brings the corpus
closer to the truth.
At each iteration of learining, the transformation that assigned
highest-score is found, and is added to the ordred transformation list:
needs a space of transformations, scoring functions for comparisons with the
truth.
The ordered list of transformations, plus the initial-state annotator, is a
learned model of POS tagging.
POS transformations
- Each transformation has two components: a rewrite rule and a triggering
environment
- Rewrite rule: Change the tag from verb to noun
Triggering environment: if the word to the left is a determiner
Together, these would correct the word drive in the following mistagged.
Example: I/pronoun enjoy/verb the/determiner drive/verb
How do we acquire a sequence of transformations?
- At each step, examine every applicable transformation
- Pick the one that brings the current state of the training corpus closest to the
manually-annotated truth.
- Apply that transformation everywhere applicable
- Pick another transformation
- Continue until no transformation improves the training corpus
Examples for Contextual (final-state) transformation templates
Change the tag of word w from A to B when:
-
The previous (next) word is
tagged C
-
Either of the two previous (next)
words is tagged C
-
The word exactly two words to
the left (right) is tagged C
-
w is word X
-
w is word X, and the previous
(next) word is tagged C
Unknown words
- Initial-state annotator uses most frequent POS from a lexicon corpus.
What about words that aren't in the lexicon?
- Initial-state annotator for unknown words:
proper noun (NNP) or common noun (NN), depending on capitalization
- Learn a seperate transformation sequence for unknown words.
This sequence is applied only to unknown words, before contextual
transformations.
Unknown-word (start-state) transformation templates
Assign tag A to word w if/
or
Change the tag of word w from A to B if:
-
w contains character K
-
w has prefix P (suffix S)
-
w has prefix P (suffix S), and removing P (S)
produces a word that is in the lexicon
-
Adding prefix P (suffix S) to word w results in a
word that is in the lexicon
-
w appears to the left (right) of word X anywhere in
the corpus being tagged
Stochastic Taggers
Probablistic/Statistical Terms:
Measure how ofter a particular event occurs.
Example
The word flies occurs in a corpus 1000 times.
600 as a verb, 400 as a noun:
- P(V) = 0.6 (i.e. P(flies is a verb)
- P(N) = 0.4 (i.e. P(flies is a noun)
flies occurs in the beginning of a sentence 100 times:
95 of these, as a noun:
- P(N & S) = 95/1000 = .095
(i.e. P(flies noun at the beginning of a sentence))
The target in statistical taggers:
for a sequence of w1....wn words, find the sequence of POS categories
C1....Cn that maximizes p(C1...Cn|W1....Wn)
Bayes rule:
P(C1...Cn)*P(W1....Wn|C1...Cn) / P(W1....Wn)
Difficult to calculate.
Approximate each factor by applying some independence assumptions:
Unigram: P(C1...Cn)=~P(C1)*...*p(Cn)
Bigram: P(Ci|Ci-1)
Estimator: P(C1,....Cn)=~P(C1|C2)*P(C2|C3)*...*P(Cn|Cn-1)
Cannonical Statistical Taggers:
N-Gram Tagging
most widely used.
Main algorithm:
Given a word sequence W - try to find the tag sequence T that maximizes
P(T|W).
*This can be done using the Viterbi algorithm to find the T
that maximizes:
P(T)*P(W|T)
Unseen words are handles with suffix information.
More detailed:
Let t vary over all possible tags. The most common tag for the
ith word of a sentence, wi, is the one
that maximizes the probablility:
p(t|wi)
This algorithm solves the tagging probelm for a single word by finding:
arg maxtp(t|wi)
(find the t that maximizes the following quantity: the probability
of a tag given the word.
This can be extended by looking for the sequence of tags that maximize the
product of the individual word probabilities:
arg max BIG_PAI tp(ti|wi)
Tags these days have 96-97% accurancy .
But this type of tagging fail in 'the can..' example - making
can a modal.
To allow a bit context:
Collect statistics on the probability of tag ti
following tag ti-1.
arg max BIG_PAI tp(ti|ti-1)p(wi|ti)
This is a product of the probablilities of each word but with two
probablilities: the probablitiy of a tag ti
given its possible tags, and the probablitiy of a tag given its previuos tag
(the context): p(ti|ti-1)
The above equation corresponds to a well understood mathematical construct:
Hidden Markov Models (HMM): a finite automaton in which a state
tarnsitions have probablitities and whose output is also probablisitic.
: HMM automatum
This allows seeing the tagging problem as simply the following way:
Given a string of words, find the sequence of states the machine
went through in order to output the sentence at hand.
These are 2 possible ways : linear equation (Viterbi algorithm) or
the automata.
What happens when allowed unkown words in input?
for such words p(t|wi) is zero.
Methods:
- assigning very low probablities - to allow tagger process sentence
- smoothing - find clues from morphology (as the last two letters...)
Maximum Entropy Tagging
This framework is a probabilistic framework, a model is found that is
consistent with the observed data and is maximally agnostic with
respect to all parameters for which no data exists.
Such a tagger can be found here.
References
- Good Slides: www.cs.bu.edu
- Charniak, E. Statistical Techniques for Natural Language Parsing
- Brill, E. Some Advances in Transformation-Based Part of Speech
Tagging
- Georgetown
university CL class -overview
- Eric Brill and Jun Wu Classifier combination for improved Lexical Disambiguation -
- in proceedings of Coling-ACL annual
meeting, 1998
Back to the course home page
Last modified April 25, 1999
Yael Dahan Netzer