LL(1) and LR(0) parsing

Table of Contents

1 LL(1) parsing

1.1 First/Follow recap

  • First(A) = {w | A → * wB}
  • Follow(A) = {w | S → * BAwC}
  • Nullable(α) = {True | α →* ε}

1.2 Conflicts

  • For any two productions \(A\) → α and \(A\) → β, if \(First(\alpha)\) intersects with \(First(\beta)\), this is a first-first conflict. Example: A → ab | ac
  • For a non-Terminal \(A\), if \(First(A)\) intersects with \(Follow(A)\) and \(Nullable(A)\) = True, this is a first-follow conflict. Example:
    B → Ac
    A → c | ε

1.3 LL(1) Parsing using a pushdown automata

There are 3 components in this kind of parser:

  1. The input stream (automata input of course)
  2. prediction stack (the pushdown automata stack)
  3. Transition table (the transition function & states of the pushdown automata)

The transition table is a function from non-Terminal \(X\) and a token to derivation rules. It's very convenient to represent this function as a table.

There are 2 actions performed by an LL(1) parser:

  1. \(Prediction(A,t)\) - If the top of the stack is the non-terminal \(A\), perform a prediction:
    Pop \(A\), look up cell \([A, t]\) in the transition table (\(t\) is the current token in the input stream). If the cell is non empty, push the derivation onto the stack. Otherwise, there is a syntax error in the input string.
  2. \(Match(a,t)\) - If the top of the stack is a terminal \(a\), perform a match:
    If \((t == a)\) (\(t\) is the current token in the input stream), consume \(t\) (advance 1 token in the input stream). If \(t \neq a\), there is a syntax error in the input string.

Remember LL Parsers consume the input from left-to-right.

When the prediction stack is empty the parsing terminates. If the input is completely consumed then accept it, otherwise (there are input tokens left) reject it.

1.4 Running example:

1.4.1 Grammar

S -> A, A -> aAb | c

1.4.2 Input

aacbb$

1.4.3 Transition table

  a b c
A A -> aAb   A -> C

1.4.4 Parse

  Input suffix Stack content Action
1 aacbb$ A$ predict(A, a)
2 aacbb$ aAb$ match(a, a)
3 acbb$ Ab$ predict(A, a)
4 acbb$ aAbb$ match(a, a)
5 cbb$ Abb$ predict(A, c)
6 cbb$ cbb$ match(c, c)
7 bb$ bb$ match(b, b)
8 b$ b$ match(b, b)
9 $ $ match($, $)

1.5 Running example with illegal input (same table and grammar)

1.5.1 Input

abcbb$

1.5.2 Parse

  input suffix Stack content Action
1 abcbb$ A$ predict(A, a)
2 abcbb$ aAb$ match(a, a)
3 bcbb$ Ab$ predict(A, b)

As cell \((A, b)\) is empty, reject the input and throw a syntax error.

1.6 Constructing the transition table

  1. For each rule \(A\) → α
    1. For each \(t\) in \(first(\alpha )\) ∪ {\(Follow(A)\) | if α is nullable}
      1. if \(Table[A, t]\) is not empty, there is a conflict in the grammar. Return an error.
      2. \(Table[A, t]\) = \(A\) → α

2 LR(0) parsing

2.1 Terminology

  • Reduction - The opposite of derivation. For example given a production rule \(A\) → α, if we have a sentential form (a sequance of terminals and non-terminals) like \(a\alpha b\) then we can reduce it to \(aAb\).
  • Handle - A substring of terminals and non-terminals in the parsing process that is reduced.

When parsing a grammar \(G\) with an initial non-terminal \(S\), with an \(LR(k)\) parser we will actually work on the grammar \(G'\) where we create a new initial rule non-terminal \(S' \rightarrow S$\). This will make sure we have an end-of-input marker.

2.2 Reasoning

Until now we started with the initial nonterminal \(S\) and tried to derive the input from it. This seems a bit un-intuitive (first thing we do when parsing an input is to completely ignore that input). LR parsers work Bottom-Up - they read the input (the bottom of the parse tree) and try to figure out what was written there (the structure of the tree). Eventually the tree is reduced to the initial non-terminal, and if at that point we reduced all the input, then the input was valid.

2.3 Properties of LR(k)

  • \(LL(k) \subset LR(k)\) (note that \(LL(k) \neq LR(k)\)). This means that any LL(k) language is also an LR(k) language.
  • LR(k) have no problem with left recursion.
  • LR(0) parsers are commonly used for parsing computer languages (i.e. compiling).
  • Shift-reduce parsers are a very popular type of LR(0) parsers.

2.4 How LR(0) parsing works

We gather the input tokens (left-to-right) until we can reduce the suffix of the gathered tokens (remember rightmost derivation means we derive the suffix).

  • Gathering a token is called "shifting".
  • Reduction is the replacement of a sequence of tokens with the non-terminal that derives it.

2.5 Improper example

2.5.1 Grammar

E → E + (E) | id

Keep in mind that we are actually working on the grammar
S → E$
E → E + (E) | id

2.5.2 Input

a + (b + (c))

2.5.3 Parse

  input suffix Stack Action
1 a + (b + (c))$   Shift
2 + (b + (c))$ a Reduce
3 + (b + (c))$ E Shift
4 (b + (c))$ E + Shift
5 b + (c))$ E + ( Shift
6 + (c))$ E + (b Reduce
7 + (c))$ E + (E Shift
8 (c))$ E + (E + Shift
9 c))$ E + (E + ( Shift
10 ))$ E + (E + (c Reduce
11 ))$ E + (E + (E Shift
12 )$ E + (E + (E) Reduce
13 )$ E + (E Shift
14 $ E + (E) Reduce
15 $ E Shift
16   E$ Reduce
17   S Accept

If we were working on the original grammar we would have accepted at step 16, but the next 2 steps make the parser simpler to implement.

2.6 How Does a shift-reduce parser work

LR(0) parsers are still pushdown stack automata. The stack keeps track of current and previous handles, the states represent derivation rules that we are "trying to use" for further reducing (a way to keep track what substrings we gathers so far) and a transition table to tell it what to do given the current state and the symbol at top of stack.

Today we will construct the parsing table for an LR(0) parser.

  • States represent the derivation rule that we are building in order to reduce the LHS to its non-terminal. Also they tell us which tokens from the RHS did we already consume (in Left-To-Right order). For example, given this grammar:
    <Assign> → Id = <Exper>
    <Exper> → <Exper> + number | <Exper> - number | number

    If we already identified <Exper> we remember that we are still able to build the RHS of the derivations: <Exper> → <Exper> + number and <Exper> → <Exper> - number. And we also keep track of where we are in each of them and mark it (using • in this example):
    <Exper> → <Exper>• + number and <Exper> → <Exper>• - number

    Derivation rules with this marker are called \(LR(0)\) items. A state is usually a set of \(LR(0)\) items.

  • The stack is used to store partially identified RHS strings. In the example above, in steps 4 though 14 we used the stack to keep track at the partial RHS of the rule E → E + (E) until we shifted (and reduced) enough to reduce it. For this purpose the stack holds terminals, non-terminals and states that we can "return" to. Initially the stack contains just the init state (state 0).
  • The parsing table maps the terminals and non-terminals with the parser states to a parser action. There are 5 actions:
    1. accept - input syntax is valid, accept and finish
    2. error - input syntax is invalid, reject and finish. Empty cells are traditionally error cells
    3. sn - shift input and move to state \(n\). Also push shifted input and \(n\) into the stack (in that order)
    4. rk - reduce top of stack by rule \(k\). Pop RHS of rule \(k\) and move to state at top of stack push LHS of \(k\)
    5. gn - go to state \(n\) (push state \(n\)).

2.7 Proper parse example

2.7.1 Grammar

  1. S → E$
  2. E → E+(E)
  3. E → id

2.7.2 The LR(0) items (simply place a dot at point in every production)

  1. S → • E$
  2. S → E• $
  3. S → E$•
  4. E → • E+(E)
  5. E → E• +(E)
  6. E → E+• (E)
  7. E → E+(• E)
  8. E → E+(E• )
  9. E → E+(E)•
  10. E → • id
  11. E → id•

2.7.3 Creating states from Items

States are composed of closures constructed from items. Initially the only closure is {S → • E$}. Next, we construct the closure like so:
Closure(I) = Closure(I) ∪ {A → • α | B → β• Aγ ∈ I}

Basically, for a non-terminal \(A\) in \(I\) with a • before it, add all items of the form "A → • …".

Given our example (Initial = {S → • E$}) we create the following closure:
Closure({S → • E$ } = {S → • E$ , E → • E+(E), E → • id}

to create more closures we define a "goto" function that creates new closures. Given a closure \(I\) and a symbol \(a\) (terminal or non-terminal):
goto(I, a) = {B → α a• β | B→α • aβ ∈ I}

Basically, For every item in \(I\) that has a • before \(a\) we create a new closure by pushing the • one symbol forward. For instance, given our example closure and the symbol \(E\) we get:
goto({S → • E$, E → • E+(E), E → • id}, E) = {S → E• $, E → E• +(E)}

Now, for each of these items we create a closure and for each of those closures we create all possible goto sets. We keep going until there are no more new states (items that are not part of a closure).

Lets finish building the states:

automata.png

2.7.4 Creating the transition table

The table is index by state and symbol. We created the states already and the symbols are given by the grammar, now we need to create the action within the cells. The goto functions defines the transitions between the closures. Transition from state q1 to state q2 given symbol a \(\iff\) goto(closure(q1), a) = closure(q2).

  • If the • is at the end of the item, this is a reduction action.
  • If the symbol is a non-terminal, the action for the transition is a go-to.
  • If the symbol is a terminal, the action is a shift

Now we create the transition table:

      Actions       go-to actions
States a + ( ) $ S E
0 s1           g2
1 rIII rIII rIII rIII rIII    
2   s4     s3    
3 acc acc acc acc acc    
4     s5        
5 s1           g6
6   s4   s7      
7 rII rII rII rII rII    

2.7.5 Running the parser

Input: a + (a + (a))

  input suffix Stack Action
1 a + (a + (a))$ 0 s1
2 + (a + (a))$ 0a1 rIII
3 + (a + (a))$ 0E g2
4 + (a + (a))$ 0E2 s4
5 (a + (a))$ 0E2+4 s5
6 a + (a))$ 0E2+4(5 s1
7 + (a))$ 0E2+4(5a1 rIII
8 + (a))$ 0E2+4(5E g2
9 + (a))$ 0E2+4(5E2 s4
10 (a))$ 0E2+4(5E2+4 s5
11 a))$ 0E2+4(5E2+4(5 s1
12 ))$ 0E2+4(5E2+4(5a1 rIII
13 ))$ 0E2+4(5E2+4(5E g6
14 ))$ 0E2+4(5E2+4(5E6 s7
15 )$ 0E2+4(5E2+4(5E6)7 rII
16 )$ 0E2+4(5E g6
17 )$ 0E2+4(5E6 s7
18 $ 0E2+4(5E6)7 rII
19 $ 0E g2
20 $ 0E2 s3
21   0E2$3 accept

It is a matter of preference whether to add an extra state q8 to the automata that represents the final reduction of E$• to S, in which case an extra couple of parsing steps would be added:

21   0e2$3 rI
22   0S g8
23   0S8 accept

2.8 Conflicts

There are two kinds of conflicts we encounter

  1. Shift-reduce conflict - a state contains items that corespond to both reduce and shift actions
  2. Reduce-reduce conflict - a state has 2 different items coresponding to different reduce actions

2.9 Indications of a conflict

Any grammar with an ε derivation cannot be LR(0). This is because there is no input to reduce, so at any point that derivation rule can be used to reduce (add the rule's LHS non-terminal to the stack)

Some of these can be resolved, but we won't cover it here.

Author: Lior Zur-Lotan and Avi Hayoun

Created: 2014-12-18 Thu 16:22

Emacs 24.3.1 (Org mode 8.2.10)

Validate