Statistical Constituent-based Parsing
This lecture discusses computational methods to perform practical
statistical syntactic parsing of sentences in free text into a
constituent-based syntactic representation.
References
- If you want to go for a wonderful in-depth review of this
material, follow the whole seven lectures in Chris Manning's Seven Lectures on
Statistical Parsing from 2007.
- A compact presentation by Joakim Nivre.
- A more extended tutorial on PCFG by Michael Collins
followed by notes on lexicalized PCFGs.
- A tutorial by Yoav Goldberg: Parsing Natural Language With PCFGs an Overview.
Where We Stand
In the previous lecture, we reviewed how to
encode constituent trees (the tree datastructure), grammars (CFGs),
and a fast polynomial algorithm to parse a sequence of words into a
tree given a CFG: the CYK algorithm. Since CYK requires the grammar
to be in Chomsky Normal Form (CNF), we also discussed an algorithm to
transform a CFG into an equivalent CNF grammar.
We now turn to a corpus-based machine-learning method to derive a
parser from a treebank.
Treebank
To develop a constituent-based parser, we need a grammar. We can ask
a language expert to prepare a CFG grammar. It turns out that it is
extremely difficult to write large coverage grammars. Large teams
have tried and failed (or only partially succeeded).
Instead, we can try to learn a grammar from examples. As usual, we can
try supervised or unsupervised machine learning. We will start and
investigate supervised learning methods for parsing -- that is, we
require a set of pairs (sentence, parse-tree). Such a set of
annotated examples is called a treebank.
Treebanks are created by linguists to describe the syntactic structure
of naturally occurring sentences. The Penn Treebank is the
most famous treebank used in the field. It contains about 50,000
sentences (covering about 3M words). The process of creating a
treebank is described in this paper:
Building a Large Annotated Corpus of English: The Penn
Treebank, by Mitchell Marcus, Beatrice Santorini and Mary Ann
Marcinkiewicz, Computational Linguistics, 1993.
When creating a treebank, the key objective is to obtain a consistent
method to annotate sentences. Consistency is achieved by testing for
agreement among annotators: if 3 different annotators create the same
syntactic tree for a sentence, then we can assume this tree reflects a
well understood process. The treebank construction process consists
of asking annotators to propose a tree, comparing the trees and
whenever disagreements are detected, to define guidelines to resolve
them. The Penn Treebank Guidelines have
been collected into a 300 page document ("Bracketing Guidelines for
the Penn Treebank", 1995). The guidelines used in the Pen Treebank
are summarized in this note.
The Penn Treebank annotations are encoded in a Lisp-like notation to encode trees:
Tree for the sentence:
"Had Casey thrown the ball harder, it would have reached home plate in time."
(S (SBAR-ADV (SINV had
(NP-SBJ Casey)
(VP thrown
(NP the ball)
(ADVP-MNR harder))))
,
(NP-SBJ it)
(VP would
(VP have
(VP reached
(NP home plate)
(PP-TMP in
(NP time))))))
The tagset used to describe the syntactic categories is formed of:
- Part of speech: for the nodes immediately above the words (pre-terminals).
- Syntactic categories: S, NP, VP, PP, ADVP
- Annotated syntactic categories: for example, NP-SBJ is an NP in subject position.
Note that the Penn Treebank contains rich annotations of the form PP-TMP which indicate
the constituent is of category PP (prepositional phrase) and in addition that it fills
the function of a temporal adjunct. Similarly, NP-SBJ indicates an NP that serves as a subject.
In addition, the Penn Treebank contains empty nodes marked as (-NONE- *).
These nodes correspond to placeholders for deleted elements in the syntactic structure,
as illustrated in the following example:
Tree for the sentence:
A form of asbestos once used to make Kent cigarette filters has caused a high percentage of cancer deaths
among a group of workers exposed to it more than 30 years ago, researchers reported.
( (S
(S-TPC-1
(NP-SBJ
(NP
(NP (DT A) (NN form) )
(PP (IN of)
(NP (NN asbestos) )))
(RRC
(ADVP-TMP (RB once) )
(VP (VBN used)
(NP (-NONE- *) )
(S-CLR
(NP-SBJ (-NONE- *) )
(VP (TO to)
(VP (VB make)
(NP (NNP Kent) (NN cigarette) (NNS filters) )))))))
(VP (VBZ has)
(VP (VBN caused)
(NP
(NP (DT a) (JJ high) (NN percentage) )
(PP (IN of)
(NP (NN cancer) (NNS deaths) ))
(PP-LOC (IN among)
(NP
(NP (DT a) (NN group) )
(PP (IN of)
(NP
(NP (NNS workers) )
(RRC
(VP (VBN exposed)
(NP (-NONE- *) )
(PP-CLR (TO to)
(NP (PRP it) ))
(ADVP-TMP
(NP
(QP (RBR more) (IN than) (CD 30) )
(NNS years) )
(IN ago) )))))))))))
(, ,)
(NP-SBJ (NNS researchers) )
(VP (VBD reported)
(SBAR (-NONE- 0)
(S (-NONE- *T*-1) )))
(. .) ))
In many cases, function annotations and empty nodes are removed from the Penn Treebank before
statistical methods are applied.
Deriving a grammar from a treebank
Once given a treebank, we can "learn" a grammar that fits the observed treebank.
To perform this operation, called "grammar induction", we must first decide which
grammar model we want to derive.
The Chomsky Hierarchy
By grammar model, we refer to the family of grammars we are interested
in learning. Chomsky derived in 1956 a hierarchy of formal grammars
of varying expressive power and parsing complexity, known as the Chomsky
Hierarchy.
When we decide to induce a grammar from a treebank, we must decide which level of this hierarchy we aim for.
All grammars in the hierarchy are defined as a tuple: (T, N, P, S) where:
- T: a set of terminal symbols (in our case, the words of the language).
- N: a set of non-terminal symbols (in our case, part of speech and syntactic categories)
- P: a set of productions (also known as rewriting rules)
- S: a designated starting symbol in N.
The differences in the levels of the hierarchy correspond to the allowed forms of the rules.
The hierarchy includes the following levels:
- Unrestricted grammars: rules can be of any form (u → v)
- Context sensitive grammars: rules of the form (uAv → uwv)
- Context free grammars: rules of the form (A → u)
- Regular languages: rules of the form (A → a) and (A → aB)
(where u, v refer to any sequence of non-terminal and terminal symbols, A and B refer to non-terminal symbols
and a refers to terminal symbols).
This hierarchy defines more and more expressive grammars (least expressive is "regular", most expressive is
"unrestricted"). But more expressive grammars require more complex parsing methods:
- Unrestricted grammars: parsing is undecidable
- Context sensitive grammars: parsing is exponential
- Context free grammars: parsing is polynomial
- Regular languages: parsing is linear
For our purposes, we decide to use Context Free Grammars. This is an empirical decision.
CFGs and Natural Languages
Is it reasonable to use CFGs to model natural language syntax?
Characteristics of natural language syntax such as agreement and
pronoun co-reference constraints cannot be modeled using context free
grammars. On the other hand, the skeleton recursive structure of
natural language sentences can be modeled using CFGs.
Theoretically, any finite language can be modeled by a trivial CFG
(one non-terminal per sentence). But this trivial grammar is really
not interesting (it does not generalize at all). Similarly, assume
there is agreement in a language between subject and verb based on
number (if subject is singular, verb must be singular, same for
plural). Then a basic CFG rule that only describes the structure of
the constituents can be expanded into a more complex CFG that also
captures agreement:
S → NP VP
can be replaced by:
S → S-singular
S → S-plural
S-singular → NP-singular VP-singular
S-plural → NP-plural VP-plural
In other words, the counter-claims that a natural language phenomenon is not context free can be addressed by
creating a more complex CFG that captures the observed context dependencies through higher number of categories
and rules.
We are interested in inducing from examples the "most interesting" and
"least complex" CFG that can account for the data. To model this
notion of what is an interesting and what is a complex CFG, we will
intuitively design a metric, which we will call the "Description
Length" that measures how complex a CFG is: we will assume that the
number of non-terminals and the number of rules contribute to this
metric. Our goal is, therefore, to induce a CFG with Minimum
Description Length (MDL) that explains the observed treebank.
From trees to productions
The first method we use to derive a CFG from a treebank is to derive a
production (a rule) for each interior node observed in the treebank.
This method is illustrated in this figure:
For example, since in the tree we observe a node NP that covers nodes
Adj and Noun, we derive a rule (NP → Adj Noun).
Since there are many words in the treebank (about 3M), and for a
sentence with about 20 words there can be about once to twice as many
interior nodes, we will get a very large number of productions
generated from the treebank. We will observe that many of these rules
are in fact repeated many times, and a few occur very rarely in the
corpus (in the usual long tail distribution that is characteristic of
many corpus statistics). In fact, for a corpus of about 1M words, we
find about 5,000 distinct rules and about 100 non-terminal
symbols. (These figures give us an idea of the description length of
this first model.)
Dealing with Words
When analyzing corpus data, we observe that the rate of "unknown
words" remains high even with very large corpora: that is, if we
sample 1M words from a corpus, list all the observed words, then
sample another 1M words, we will find that more than 5% of the words
in the second part of the corpus were never seen before in the first
half. Even if we remember all the words observed in the 2M word
corpus, and sample another 1M words, we will find that 5% of the new
words are "unknown". This rate of unknown words remains high until we
reach "Web-scale" corpora (several hundreds of millions of words).
Compared to these dimensions, syntactic treebanks remain necessarily
small (because of the high cost required to produce reliable syntactic
annotations). To avoid the problem of learning how to cope with the
many unknown words and dealing with the large number of known words
(hundreds of thousands of words in the vocabulary) using the
statistical methods we will review below, we will rely on the results
of a pre-processing stage: a part-of-speech tagger as reviewed in this previous lecture.
Our assumptions in the following are therefore:
- Natural language text is segmented into "clean" sentences.
- Sentences are split into clean words.
- Sentences are tagged with parts of speech tags in a reliable manner.
Each of these steps (sentence and word segmentation, tagging) are much easier in English than in other languages
with rich morphology (such as Hebrew) or un-segmented words (Chinese, Japanese). Even in English, these steps
introduce some level of noise: for example, in English, the best tagging reaches 98% accuracy, but this depends
on the domain of the text and can go lower.
To achieve sentence segmentation, nltk includes a
statistical
sentence segmentation module, based on the Punkt sentence
segmenter (Kiss and Strunk: Unsupervised multilingual sentence boundary detection,
in Computational Linguistics, volume 32, pages 485–525, 2006).
Here is an example of how to segment text:
>>> sent_tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')
>>> text = nltk.corpus.gutenberg.raw('chesterton-thursday.txt')
>>> sents = sent_tokenizer.tokenize(text)
>>> pprint.pprint(sents[171:181])
['"Nonsense!',
'" said Gregory, who was very rational when anyone else\nattempted paradox.',
'"Why do all the clerks and navvies in the\nrailway trains look so sad and tired,...',
'I will\ntell you.',
'It is because they know that the train is going right.',
'It\nis because they know that whatever place they have taken a ticket\nfor that ...',
'It is because after they have\npassed Sloane Square they know that the next stat...',
'Oh, their wild rapture!',
'oh,\ntheir eyes like stars and their souls again in Eden, if the next\nstation w...'
'"\n\n"It is you who are unpoetical," replied the poet Syme.']
This example illustrates the complexity of sentence segmentation:
reported speech, new lines, acronyms are all confusing.
Another method to learn how to segment sentence relies on classification. You can read a short
intro on this in NLTK's Chapter
6.2.
Dealing with Ambiguity
Once we induce a simple CFG from the treebank, we can apply the CYK
algorithm and parse new, unseen sentences. How well does that work?
We have a first problem that CYK only works on grammars that are in
Chomsky Normal Form (CNF). We could translate the induced grammar
into CNF (there is a procedure to perform this operation). But it is
easier to instead convert the trees in the treebank to CNF and then
induce the grammar from the transformed trees. NLTK includes a module
to perform such tree transformations in the treetransforms
module.
The grammar we learn on the transformed trees is "equivalent" to the
one we would learn on the original trees, because the transformations
we apply are "reversible" (look at methods chomsky_normal_form and
un_chomsky_normal_form). The procedure consists of:
Training:
Original Treebank → CNF Treebank → CNF Grammar
Testing:
Parse Sentence with CNF Grammar → CNF Tree → un_CNF(tree) → Tree similar to the original grammar
We will find that the resulting grammar has 2 issues:
- It vastly overgenerates: it can generate many sentences which are really
not "grammatical" to the ears of English speakers.
- It is highly ambiguous: it will generate many different trees for the same sentence.
The first problem is not so much an issue for us: we will use the
parser to parse real sentences, and we will not use our model to
distinguish "real" sentences from "bad" ones. (But note that
over-generation is an issue if we consider that the CFG we
have learned is a language model -- that is, we use the generative
model encoded by the CFG to estimate the likelihood that a sentence
belongs to the language that is p(sentence | cfg). In this case,
over-generation means a large part of the probabilistic mass is
wasted on non-sentences.)
The second problem is a real issue: we do not want hundreds of parse
trees for a single sentence; we want the "best possible" tree -- the
one that best explains the observed sentence. We could go for a few
trees (say the top K trees) in case the parser feels there is real
ambiguity and we expect another module later to disambiguate based on
semantic or discourse level knowledge. But we certainly do not want
all possible parses without any ranking. How can we rank the trees
the grammar produces?
Probabilistic CFG
In order to decide which tree is the most likely for a given sentence,
we will extend the CFG formalism with a probabilistic model: PCFG. In
PCFG, each rule is associated to a probability, so that when we look
at all the rules with the same left-hand-side non-terminal, the sum of
the probabilities is 1.
X → Y1 Z1 | p1
X → Y2 Z2 | p2
...
X → Yn Zn | pn
Σpi = 1.0
The generative story of a PCFG is the following:
- Start from the start symbol S
- Pick a rule from the possible rewritings of (S → rhsi) according to the distribution pi
- For each non-terminal in the selected rhs, expand it using the same mechanism (pick a rule according to the distribution).
- Repeat until all non-terminals have been expanded to terminal leaves.
In a CFG, the context-free assumption means that a derivation (X → Y Z) is independent of its context.
In a PCFG, this independence assumption means that each derivation of a non-terminal is probabilistically
independent of its context -- which means that the probability of a whole tree is simply the product of
probabilities of the derivation of each node.
When we face ambiguity, we compute the probability of a sentence p(S) as the sum of the probabilities of
each possible parse of the sentence by the grammar.
Parsing PCFG with CYK
To parse with PCFGs, we use a simple variation of the CYK algorithm -- using a Viterbi-like algorithm.
In the PCFG formulation of CYK, the memoization matrix P(X, i, l) is a matrix of probabilities instead
of boolean values -- that is we define:
p[X, i, l] = p(wi....wi+l-1 is covered by category X)
We adjust the initialization step and the recursive step of the algorithm as follows:
Init:
For i = 1...n:
For X in NT(G):
p[X, i, 1] = p(X | wi) # These probabilities of pre-terminals
# can be estimated using a POS tagger.
Recursive step:
For l = 2..n: # Length of span
For i = 1..n-l+1: # Start of span
For k = 1..l-1: # Partition of span
For (X → Y Z, q) in Rules(G): # q = p(Y Z | X)
new_p = p[Y, i, k] * p[Z, i+k, l-k, Z] * q
If new_p > p[X, i, l]:
p[X, i, l] = new_p
back[X, i, l] = (Y, Z, k)
Similarly to the case of the deterministic CFG, if we want to use the algorithm to recover the parse tree,
we must keep back-pointers when we add an entry for a larger constituent X to its children nodes Y and Z.
We add to the matrix P the back-pointers (Y,Z,k). The interpretation of the matrix P and matrix Back together
is that (X &rarr Y Z) is the most likely constituent covering the span of l words starting at position i.
The code of
ViterbiParser
in NLTK shows an implementation of this algorithm. (Note that this implementation does not assume the grammar is in
CNF, and hence it is a bit more complex than the one described above.)
The probability returned by this algorithm is that of the most likely tree covering a sentence S.
If we want to estimate the p(S), we must compute the sum of all the probabilities of all possible
parses. A variant of the Viterbi parse algorithm computes this sum efficiently: we simply replace
the "max" operator (shown above as "if new_p > ...") by an addition.
Evaluating a CFG Parser
In order to evaluate the success of a parser, we can use a stringent metric -- a parser is successful if it
returns exactly the same tree as that observed in a treebank. By this metric, the best existing parser today
has accuracy of ~40%.
A less stringent and more informative metric considers the overlap of
constituents recovered by the parser with those present in the
treebank. We define here a constituent as a triplet (X, i, l) -- that
is, a subtree rooted in non-terminal X that covers the span of l words
starting at position i (wi...wi+l-1).
Given this definition for constituents, we consider the set of
constituents returned by the parser (there are as many constituents as
interior nodes in the tree, which is about 2n for a sentence of length
n with a CNF grammar). And we compute precision, recall and F-measure
for the list of constituents when we compare it with that of the observed
tree in the treebank.
Independence Assumptions
When we train a treebank with simple MLE estimation of the parameters of a PCFG model, we obtain a very poor parser
(with F-measure of about 50-60%). One reason the parser behaves poorly is that the probabilistic model we learn
makes very strong independenc assumptions (those of a PCFG) which do not hold in the treebank.
To assess the extent of this issue, we translate the notion of "context-freeness" of a grammar to a measure of
"context-freeness" in a treebank -- where we do not know which grammar produced the observed treebank.
One way to measure context-freeness of a treebank is to measure the extent to which the distribution of a the
productions below a given non-terminal are identical in different contexts. If the treebank is sampled from
a truly context free grammar, we expect these distributions to be similar. If we observe widely divergent
distributions in different contexts, we can measure the extent to which the treebank is not context-free.
Consider for example the distribution of productions below the non-terminal NP in the Penn Treebank.
The following histogram shows the distribution of these productions in 2 distinct contexts -- when the NP
node occurs below an S node and when it occurs below a VP node:
(From Seven Lectures on Statistical Parsing, Chris Manning, Lecture 3).
We observe that these 2 distributions are very different. One way to assess the difference between distributions
is to use the notion of relative entropy,
also known as Kullback-Leibler Divergence (KL).
On the basis of this intuition, we can assess that a treebank cannot have been generated by a CFG.
In order to improve the quality of the parser we want to learn, we will transform the non-terminals
to force the treebank to behave in a more "context free" manner. This method is similar to the step
we applied above.
Historically, several techniques have been applied to transform this intuition into effective learning methods
of constituent parsers:
- lexicalized models: annotate the non-terminals with the lexical head of the constituent they cover.
This has been proposed by Charniak (1997) and Collins (1997).
See this lecture by Collins (2011) for an
excellent summary of lexicalized parsing approaches.
- un-lexicalized models: annotate the non-terminals with non-lexicalized tags that capture useful contexts.
This was performed manually, by carefully selecting useful contexts in
Accurate unlexicalized
parsing, Klein and Manning, ACL 2003.
Later, Probabilistic CFG with latent annotations,
by Matsuzaki, Miyao and Tsujii. ACL 2005 proposed to learn the annotation method automatically.
Their method was refined in Learning
Accurate, Compact, and Interpretable Tree Annotation, by Petrov, Barrett, Thibaux and Klein, ACL 2006.
The most current formulation of the strategy of tag-splitting is a
family of models named "latent PCFG". The definition of a latent PCFG
model is (as per Spectral
Learning of Latent-Variable PCFGs):
An L-PCFG is a 5-tuple (N, I,P,m, n) where:
- N is the set of non-terminal symbols in the grammar.
- I ⊂ N is a finite set of in-terminals.
- P ⊂ N is a finite set of pre-terminals.
- N = I ∪ P, and I ∩ P = ∅.
- [1..m] is the set of possible hidden states.
- [1..n] is the set of possible words.
- ∀ a ∈ I, b ∈ N, c ∈ N, h1, h2, h3 ∈ [1..m],
we have a context-free rule a(h1) → b(h2) c(h3).
- ∀ a ∈ P, h ∈ [1..m], x ∈ [1..n],
we have a context-free rule a(h) → x.
Hence each in-terminal a ∈ I is always the lefthand-side of a binary rule a → b c;
and each preterminal a ∈ P is always the left-hand-side of a rule a → x.
The latent variables capture the fact that we expect the CFG skeleton
to be refined. The objective of learning a latent-PCFG model is to
learn how to refine the non-terminals of the PCFG skeleton observed in
the treebank into appropriate sub-tags. Learning the latent
annotations is a difficult unsupervised learning problem. Different
methods have been tried to attack it:
- Expectation Maximization (EM) (see Probabilistic CFG
with latent annotations, and Learning Accurate, Compact, and Interpretable Tree Annotation)
- Spectral methods (see Spectral
Learning of Latent-Variable PCFGs and Tensor
Decomposition for Fast Parsing with Latent-Variable PCFGs)
- Factored models (see Training
Factored PCFGs with Expectation Propagation).
Last modified Dec 24, 2015