Statistical Constituent-based Parsing

This lecture discusses computational methods to perform practical statistical syntactic parsing of sentences in free text into a constituent-based syntactic representation.

References

  1. If you want to go for a wonderful in-depth review of this material, follow the whole seven lectures in Chris Manning's Seven Lectures on Statistical Parsing from 2007.
  2. A compact presentation by Joakim Nivre.
  3. A more extended tutorial on PCFG by Michael Collins followed by notes on lexicalized PCFGs.
  4. A tutorial by Yoav Goldberg: Parsing Natural Language With PCFGs an Overview.

Where We Stand

In the previous lecture, we reviewed how to encode constituent trees (the tree datastructure), grammars (CFGs), and a fast polynomial algorithm to parse a sequence of words into a tree given a CFG: the CYK algorithm. Since CYK requires the grammar to be in Chomsky Normal Form (CNF), we also discussed an algorithm to transform a CFG into an equivalent CNF grammar. We now turn to a corpus-based machine-learning method to derive a parser from a treebank.

Treebank

To develop a constituent-based parser, we need a grammar. We can ask a language expert to prepare a CFG grammar. It turns out that it is extremely difficult to write large coverage grammars. Large teams have tried and failed (or only partially succeeded). Instead, we can try to learn a grammar from examples. As usual, we can try supervised or unsupervised machine learning. We will start and investigate supervised learning methods for parsing -- that is, we require a set of pairs (sentence, parse-tree). Such a set of annotated examples is called a treebank. Treebanks are created by linguists to describe the syntactic structure of naturally occurring sentences. The Penn Treebank is the most famous treebank used in the field. It contains about 50,000 sentences (covering about 3M words). The process of creating a treebank is described in this paper: Building a Large Annotated Corpus of English: The Penn Treebank, by Mitchell Marcus, Beatrice Santorini and Mary Ann Marcinkiewicz, Computational Linguistics, 1993. When creating a treebank, the key objective is to obtain a consistent method to annotate sentences. Consistency is achieved by testing for agreement among annotators: if 3 different annotators create the same syntactic tree for a sentence, then we can assume this tree reflects a well understood process. The treebank construction process consists of asking annotators to propose a tree, comparing the trees and whenever disagreements are detected, to define guidelines to resolve them. The Penn Treebank Guidelines have been collected into a 300 page document ("Bracketing Guidelines for the Penn Treebank", 1995). The guidelines used in the Pen Treebank are summarized in this note. The Penn Treebank annotations are encoded in a Lisp-like notation to encode trees:
Tree for the sentence: 
"Had Casey thrown the ball harder, it would have reached home plate in time."

(S (SBAR-ADV (SINV had
		   (NP-SBJ Casey)
		   (VP thrown
		       (NP the ball)
		       (ADVP-MNR harder))))
   ,
   (NP-SBJ it)
   (VP would
       (VP have
	   (VP reached
	       (NP home plate)
	       (PP-TMP in
		       (NP time))))))
The tagset used to describe the syntactic categories is formed of:
  1. Part of speech: for the nodes immediately above the words (pre-terminals).
  2. Syntactic categories: S, NP, VP, PP, ADVP
  3. Annotated syntactic categories: for example, NP-SBJ is an NP in subject position.
Note that the Penn Treebank contains rich annotations of the form PP-TMP which indicate the constituent is of category PP (prepositional phrase) and in addition that it fills the function of a temporal adjunct. Similarly, NP-SBJ indicates an NP that serves as a subject.

In addition, the Penn Treebank contains empty nodes marked as (-NONE- *). These nodes correspond to placeholders for deleted elements in the syntactic structure, as illustrated in the following example:

Tree for the sentence:
A form of asbestos once used to make Kent cigarette filters has caused a high percentage of cancer deaths
among a group of workers exposed to it more than 30 years ago, researchers reported.

( (S 
    (S-TPC-1 
      (NP-SBJ 
        (NP 
          (NP (DT A) (NN form) )
          (PP (IN of) 
            (NP (NN asbestos) )))
        (RRC 
          (ADVP-TMP (RB once) )
          (VP (VBN used) 
            (NP (-NONE- *) )
            (S-CLR 
              (NP-SBJ (-NONE- *) )
              (VP (TO to) 
                (VP (VB make) 
                  (NP (NNP Kent) (NN cigarette) (NNS filters) )))))))
      (VP (VBZ has) 
        (VP (VBN caused) 
          (NP 
            (NP (DT a) (JJ high) (NN percentage) )
            (PP (IN of) 
              (NP (NN cancer) (NNS deaths) ))
            (PP-LOC (IN among) 
              (NP 
                (NP (DT a) (NN group) )
                (PP (IN of) 
                  (NP 
                    (NP (NNS workers) )
                    (RRC 
                      (VP (VBN exposed) 
                        (NP (-NONE- *) )
                        (PP-CLR (TO to) 
                          (NP (PRP it) ))
                        (ADVP-TMP 
                          (NP 
                            (QP (RBR more) (IN than) (CD 30) )
                            (NNS years) )
                          (IN ago) )))))))))))
    (, ,) 
    (NP-SBJ (NNS researchers) )
    (VP (VBD reported) 
      (SBAR (-NONE- 0) 
        (S (-NONE- *T*-1) )))
    (. .) ))
In many cases, function annotations and empty nodes are removed from the Penn Treebank before statistical methods are applied.

Deriving a grammar from a treebank

Once given a treebank, we can "learn" a grammar that fits the observed treebank. To perform this operation, called "grammar induction", we must first decide which grammar model we want to derive.

The Chomsky Hierarchy

By grammar model, we refer to the family of grammars we are interested in learning. Chomsky derived in 1956 a hierarchy of formal grammars of varying expressive power and parsing complexity, known as the Chomsky Hierarchy. When we decide to induce a grammar from a treebank, we must decide which level of this hierarchy we aim for. All grammars in the hierarchy are defined as a tuple: (T, N, P, S) where:
  1. T: a set of terminal symbols (in our case, the words of the language).
  2. N: a set of non-terminal symbols (in our case, part of speech and syntactic categories)
  3. P: a set of productions (also known as rewriting rules)
  4. S: a designated starting symbol in N.
The differences in the levels of the hierarchy correspond to the allowed forms of the rules. The hierarchy includes the following levels:
  1. Unrestricted grammars: rules can be of any form (u → v)
  2. Context sensitive grammars: rules of the form (uAv → uwv)
  3. Context free grammars: rules of the form (A → u)
  4. Regular languages: rules of the form (A → a) and (A → aB)
(where u, v refer to any sequence of non-terminal and terminal symbols, A and B refer to non-terminal symbols and a refers to terminal symbols). This hierarchy defines more and more expressive grammars (least expressive is "regular", most expressive is "unrestricted"). But more expressive grammars require more complex parsing methods:
  1. Unrestricted grammars: parsing is undecidable
  2. Context sensitive grammars: parsing is exponential
  3. Context free grammars: parsing is polynomial
  4. Regular languages: parsing is linear
For our purposes, we decide to use Context Free Grammars. This is an empirical decision.

CFGs and Natural Languages

Is it reasonable to use CFGs to model natural language syntax? Characteristics of natural language syntax such as agreement and pronoun co-reference constraints cannot be modeled using context free grammars. On the other hand, the skeleton recursive structure of natural language sentences can be modeled using CFGs. Theoretically, any finite language can be modeled by a trivial CFG (one non-terminal per sentence). But this trivial grammar is really not interesting (it does not generalize at all). Similarly, assume there is agreement in a language between subject and verb based on number (if subject is singular, verb must be singular, same for plural). Then a basic CFG rule that only describes the structure of the constituents can be expanded into a more complex CFG that also captures agreement:
S → NP VP
can be replaced by:
S → S-singular 
S → S-plural
S-singular → NP-singular VP-singular
S-plural   → NP-plural VP-plural
In other words, the counter-claims that a natural language phenomenon is not context free can be addressed by creating a more complex CFG that captures the observed context dependencies through higher number of categories and rules. We are interested in inducing from examples the "most interesting" and "least complex" CFG that can account for the data. To model this notion of what is an interesting and what is a complex CFG, we will intuitively design a metric, which we will call the "Description Length" that measures how complex a CFG is: we will assume that the number of non-terminals and the number of rules contribute to this metric. Our goal is, therefore, to induce a CFG with Minimum Description Length (MDL) that explains the observed treebank.

From trees to productions

The first method we use to derive a CFG from a treebank is to derive a production (a rule) for each interior node observed in the treebank. This method is illustrated in this figure:

For example, since in the tree we observe a node NP that covers nodes Adj and Noun, we derive a rule (NP → Adj Noun). Since there are many words in the treebank (about 3M), and for a sentence with about 20 words there can be about once to twice as many interior nodes, we will get a very large number of productions generated from the treebank. We will observe that many of these rules are in fact repeated many times, and a few occur very rarely in the corpus (in the usual long tail distribution that is characteristic of many corpus statistics). In fact, for a corpus of about 1M words, we find about 5,000 distinct rules and about 100 non-terminal symbols. (These figures give us an idea of the description length of this first model.)

Dealing with Words

When analyzing corpus data, we observe that the rate of "unknown words" remains high even with very large corpora: that is, if we sample 1M words from a corpus, list all the observed words, then sample another 1M words, we will find that more than 5% of the words in the second part of the corpus were never seen before in the first half. Even if we remember all the words observed in the 2M word corpus, and sample another 1M words, we will find that 5% of the new words are "unknown". This rate of unknown words remains high until we reach "Web-scale" corpora (several hundreds of millions of words). Compared to these dimensions, syntactic treebanks remain necessarily small (because of the high cost required to produce reliable syntactic annotations). To avoid the problem of learning how to cope with the many unknown words and dealing with the large number of known words (hundreds of thousands of words in the vocabulary) using the statistical methods we will review below, we will rely on the results of a pre-processing stage: a part-of-speech tagger as reviewed in this previous lecture. Our assumptions in the following are therefore:
  1. Natural language text is segmented into "clean" sentences.
  2. Sentences are split into clean words.
  3. Sentences are tagged with parts of speech tags in a reliable manner.
Each of these steps (sentence and word segmentation, tagging) are much easier in English than in other languages with rich morphology (such as Hebrew) or un-segmented words (Chinese, Japanese). Even in English, these steps introduce some level of noise: for example, in English, the best tagging reaches 98% accuracy, but this depends on the domain of the text and can go lower. To achieve sentence segmentation, nltk includes a statistical sentence segmentation module, based on the Punkt sentence segmenter (Kiss and Strunk: Unsupervised multilingual sentence boundary detection, in Computational Linguistics, volume 32, pages 485–525, 2006). Here is an example of how to segment text:
 	
>>> sent_tokenizer=nltk.data.load('tokenizers/punkt/english.pickle')
>>> text = nltk.corpus.gutenberg.raw('chesterton-thursday.txt')
>>> sents = sent_tokenizer.tokenize(text)
>>> pprint.pprint(sents[171:181])
['"Nonsense!',
 '" said Gregory, who was very rational when anyone else\nattempted paradox.',
 '"Why do all the clerks and navvies in the\nrailway trains look so sad and tired,...',
 'I will\ntell you.',
 'It is because they know that the train is going right.',
 'It\nis because they know that whatever place they have taken a ticket\nfor that ...',
 'It is because after they have\npassed Sloane Square they know that the next stat...',
 'Oh, their wild rapture!',
 'oh,\ntheir eyes like stars and their souls again in Eden, if the next\nstation w...'
 '"\n\n"It is you who are unpoetical," replied the poet Syme.']
This example illustrates the complexity of sentence segmentation: reported speech, new lines, acronyms are all confusing. Another method to learn how to segment sentence relies on classification. You can read a short intro on this in NLTK's Chapter 6.2.

Dealing with Ambiguity

Once we induce a simple CFG from the treebank, we can apply the CYK algorithm and parse new, unseen sentences. How well does that work? We have a first problem that CYK only works on grammars that are in Chomsky Normal Form (CNF). We could translate the induced grammar into CNF (there is a procedure to perform this operation). But it is easier to instead convert the trees in the treebank to CNF and then induce the grammar from the transformed trees. NLTK includes a module to perform such tree transformations in the treetransforms module. The grammar we learn on the transformed trees is "equivalent" to the one we would learn on the original trees, because the transformations we apply are "reversible" (look at methods chomsky_normal_form and un_chomsky_normal_form). The procedure consists of:
Training: 
Original Treebank → CNF Treebank → CNF Grammar

Testing: 
Parse Sentence with CNF Grammar → CNF Tree → un_CNF(tree) → Tree similar to the original grammar
We will find that the resulting grammar has 2 issues: The first problem is not so much an issue for us: we will use the parser to parse real sentences, and we will not use our model to distinguish "real" sentences from "bad" ones. (But note that over-generation is an issue if we consider that the CFG we have learned is a language model -- that is, we use the generative model encoded by the CFG to estimate the likelihood that a sentence belongs to the language that is p(sentence | cfg). In this case, over-generation means a large part of the probabilistic mass is wasted on non-sentences.)

The second problem is a real issue: we do not want hundreds of parse trees for a single sentence; we want the "best possible" tree -- the one that best explains the observed sentence. We could go for a few trees (say the top K trees) in case the parser feels there is real ambiguity and we expect another module later to disambiguate based on semantic or discourse level knowledge. But we certainly do not want all possible parses without any ranking. How can we rank the trees the grammar produces?

Probabilistic CFG

In order to decide which tree is the most likely for a given sentence, we will extend the CFG formalism with a probabilistic model: PCFG. In PCFG, each rule is associated to a probability, so that when we look at all the rules with the same left-hand-side non-terminal, the sum of the probabilities is 1.
X → Y1 Z1 | p1 
X → Y2 Z2 | p2 
...
X → Yn Zn | pn 

Σpi = 1.0
The generative story of a PCFG is the following:
  1. Start from the start symbol S
  2. Pick a rule from the possible rewritings of (S → rhsi) according to the distribution pi
  3. For each non-terminal in the selected rhs, expand it using the same mechanism (pick a rule according to the distribution).
  4. Repeat until all non-terminals have been expanded to terminal leaves.
In a CFG, the context-free assumption means that a derivation (X → Y Z) is independent of its context. In a PCFG, this independence assumption means that each derivation of a non-terminal is probabilistically independent of its context -- which means that the probability of a whole tree is simply the product of probabilities of the derivation of each node.

When we face ambiguity, we compute the probability of a sentence p(S) as the sum of the probabilities of each possible parse of the sentence by the grammar.

Parsing PCFG with CYK

To parse with PCFGs, we use a simple variation of the CYK algorithm -- using a Viterbi-like algorithm. In the PCFG formulation of CYK, the memoization matrix P(X, i, l) is a matrix of probabilities instead of boolean values -- that is we define:
p[X, i, l] = p(wi....wi+l-1 is covered by category X)
We adjust the initialization step and the recursive step of the algorithm as follows:
Init:
For i = 1...n:
  For X in NT(G):
    p[X, i, 1] = p(X | wi)  # These probabilities of pre-terminals 
                            # can be estimated using a POS tagger.

Recursive step:
For l = 2..n:              # Length of span
  For i = 1..n-l+1:        # Start of span
    For k = 1..l-1:        # Partition of span
      For (X → Y Z, q) in Rules(G): # q = p(Y Z | X)
        new_p = p[Y, i, k] * p[Z, i+k, l-k, Z] * q
        If new_p > p[X, i, l]:
          p[X, i, l] = new_p
          back[X, i, l] = (Y, Z, k)
Similarly to the case of the deterministic CFG, if we want to use the algorithm to recover the parse tree, we must keep back-pointers when we add an entry for a larger constituent X to its children nodes Y and Z. We add to the matrix P the back-pointers (Y,Z,k). The interpretation of the matrix P and matrix Back together is that (X &rarr Y Z) is the most likely constituent covering the span of l words starting at position i.

The code of ViterbiParser in NLTK shows an implementation of this algorithm. (Note that this implementation does not assume the grammar is in CNF, and hence it is a bit more complex than the one described above.)

The probability returned by this algorithm is that of the most likely tree covering a sentence S. If we want to estimate the p(S), we must compute the sum of all the probabilities of all possible parses. A variant of the Viterbi parse algorithm computes this sum efficiently: we simply replace the "max" operator (shown above as "if new_p > ...") by an addition.

Evaluating a CFG Parser

In order to evaluate the success of a parser, we can use a stringent metric -- a parser is successful if it returns exactly the same tree as that observed in a treebank. By this metric, the best existing parser today has accuracy of ~40%.

A less stringent and more informative metric considers the overlap of constituents recovered by the parser with those present in the treebank. We define here a constituent as a triplet (X, i, l) -- that is, a subtree rooted in non-terminal X that covers the span of l words starting at position i (wi...wi+l-1).

Given this definition for constituents, we consider the set of constituents returned by the parser (there are as many constituents as interior nodes in the tree, which is about 2n for a sentence of length n with a CNF grammar). And we compute precision, recall and F-measure for the list of constituents when we compare it with that of the observed tree in the treebank.

Independence Assumptions

When we train a treebank with simple MLE estimation of the parameters of a PCFG model, we obtain a very poor parser (with F-measure of about 50-60%). One reason the parser behaves poorly is that the probabilistic model we learn makes very strong independenc assumptions (those of a PCFG) which do not hold in the treebank.

To assess the extent of this issue, we translate the notion of "context-freeness" of a grammar to a measure of "context-freeness" in a treebank -- where we do not know which grammar produced the observed treebank. One way to measure context-freeness of a treebank is to measure the extent to which the distribution of a the productions below a given non-terminal are identical in different contexts. If the treebank is sampled from a truly context free grammar, we expect these distributions to be similar. If we observe widely divergent distributions in different contexts, we can measure the extent to which the treebank is not context-free.

Consider for example the distribution of productions below the non-terminal NP in the Penn Treebank. The following histogram shows the distribution of these productions in 2 distinct contexts -- when the NP node occurs below an S node and when it occurs below a VP node:
(From Seven Lectures on Statistical Parsing, Chris Manning, Lecture 3). We observe that these 2 distributions are very different. One way to assess the difference between distributions is to use the notion of relative entropy, also known as Kullback-Leibler Divergence (KL).

On the basis of this intuition, we can assess that a treebank cannot have been generated by a CFG. In order to improve the quality of the parser we want to learn, we will transform the non-terminals to force the treebank to behave in a more "context free" manner. This method is similar to the step we applied above.

Historically, several techniques have been applied to transform this intuition into effective learning methods of constituent parsers:

  1. lexicalized models: annotate the non-terminals with the lexical head of the constituent they cover. This has been proposed by Charniak (1997) and Collins (1997). See this lecture by Collins (2011) for an excellent summary of lexicalized parsing approaches.
  2. un-lexicalized models: annotate the non-terminals with non-lexicalized tags that capture useful contexts. This was performed manually, by carefully selecting useful contexts in Accurate unlexicalized parsing, Klein and Manning, ACL 2003.

    Later, Probabilistic CFG with latent annotations, by Matsuzaki, Miyao and Tsujii. ACL 2005 proposed to learn the annotation method automatically.

    Their method was refined in Learning Accurate, Compact, and Interpretable Tree Annotation, by Petrov, Barrett, Thibaux and Klein, ACL 2006.

The most current formulation of the strategy of tag-splitting is a family of models named "latent PCFG". The definition of a latent PCFG model is (as per Spectral Learning of Latent-Variable PCFGs):
An L-PCFG is a 5-tuple (N, I,P,m, n) where:
  1. N is the set of non-terminal symbols in the grammar.
  2. I ⊂ N is a finite set of in-terminals.
  3. P ⊂ N is a finite set of pre-terminals.
  4. N = I ∪ P, and I ∩ P = ∅.
  5. [1..m] is the set of possible hidden states.
  6. [1..n] is the set of possible words.
  7. ∀ a ∈ I, b ∈ N, c ∈ N, h1, h2, h3 ∈ [1..m], we have a context-free rule a(h1) → b(h2) c(h3).
  8. ∀ a ∈ P, h ∈ [1..m], x ∈ [1..n], we have a context-free rule a(h) → x. Hence each in-terminal a ∈ I is always the lefthand-side of a binary rule a → b c; and each preterminal a ∈ P is always the left-hand-side of a rule a → x.
The latent variables capture the fact that we expect the CFG skeleton to be refined. The objective of learning a latent-PCFG model is to learn how to refine the non-terminals of the PCFG skeleton observed in the treebank into appropriate sub-tags. Learning the latent annotations is a difficult unsupervised learning problem. Different methods have been tried to attack it:
  1. Expectation Maximization (EM) (see Probabilistic CFG with latent annotations, and Learning Accurate, Compact, and Interpretable Tree Annotation)
  2. Spectral methods (see Spectral Learning of Latent-Variable PCFGs and Tensor Decomposition for Fast Parsing with Latent-Variable PCFGs)
  3. Factored models (see Training Factored PCFGs with Expectation Propagation).


Last modified Dec 24, 2015