Tirgul 9 Bottom-up parsing , Yacc


1.Bottom-Up Parsing

2.YACC


Bottom-Up Parsing

Many bottom-up techniques are availbale: LR(0), SLR(1), LR(1) and LALR(1).

Immediate advantage: no problem with left-recursion

As oppose to top-down parsers we construct the parse tree from the leaves, specificly, the leftmost node that was not constructed yet, but all of its children have been constructed, the sequence of children (or right hand side of a rule) is a handle, constructing a node N and connecting children is called reducing.

In order to search for a handle, we keep an item-set, each item is a hypothesis (or option), an LR item N→A()B means that AB is a possible handle, and will be reduced if indeed it is applicable, when the dot reach the end of the rule AB() then we have identified a handle. An item with the dot at the end is called a reduce item.

To parse, we first compute the initial set and than from each option the next set. The initial set is computed from the tree top – the start symbol (or state #1), we compute the set of rules (adding the dot at the proper position) that can be reached using an epsilon-move (epsilon closure algorithm).

The parse table has a row for each of the states (always a finite number) in which parser may be, and a column for each terminal and nonterminal in the grammar (in every grammar $ is present), it is a combination of the GOTO table (which determine the next state for a shift) and the ACTION table which determine the production rule to reduce.

At the start of the parsing process, the parser is in the state 1 with the first symbol in the input stream as the input table. Each parsing step will be driven by the table entry identified by the current state and the input symbol. A table entry may be of one of two types:

  1. A shift entry of the form Sm will cause the parser to perform a shift action and to change state to the state m.

  1. A reduce entry of the form Rn will case the parser to perform a reduce action,using production n.

Given grammar, the steps for forming parse table are

States Assignment

The basic rules are:

  1. If state i appears before non terminal X, it must be added at the beginning of each production of X

  2. If we have the following assignment isymbolj (symbol may be both terminal and non terminal, the meaning of the statement is: if parser is in state i and reads from input symbol, it shifts to state j), then if in any other production there is an assignment isymbol, we'll add state j after the symbol. (Pay attention that this assignment is "corresponding", e.g. if there is i,ksymbolj, only if parser was in state i will it shift to state to state j, and there will be another state to shift to for k)

Example Of State Assignment

Given grammar:

1.S->XY
2.X->x
3.X->xX
4.Y->y
5.Y->yY

The parser is at the state 1 at the beginning:

1.S->1XY
2.X->x
3.X->xX
4.Y->y
5.Y->yY

The rule 1 is applied for X:

1.S->1XY
2.X->1x
3.X->1xX
4.Y->y
5.Y->yY

We are adding new state in production 1:

1.S->1X2Y
2.X->1x
3.X->1xX
4.Y->y
5.Y->yY

The rule 2 is applied for Y now:

1.S->1X2Y
2.X->1x
3.X->1xX
4.Y->2y
5.Y->2yY

We are adding new state in production 1:

1.S->1X2Y3
2.X->1x
3.X->1xX
4.Y->2y
5.Y->2yY

We are adding a new state in production 2, the rule 2 can be applied in production 3:

1.S->1X2Y3
2.X->1x4
3.X->1x4X
4.Y->2y
5.Y->2yY

In production 3 we see  state  4 before X, rule 1 can be applied:

1.S->1X2Y3
2.X->1,4x4
3.X->1,4x4X
4.Y->2y
5.Y->2yY

We are adding new state in production 3:

1.S->1X2Y3
2.X->1,4x4
3.X->1,4x4X5
4.Y->2y
5.Y->2yY

We are adding new state in production 4:

1.S->1X2Y3
2.X->1,4x4
3.X->1,4x4X5
4.Y->2y6
5.Y->2y6Y

The rule 2 is applied in production 5:

1.S->1X2Y3
2.X->1,4x4
3.X->1,4x4X5
4.Y->2,6y6
5.Y->2,6y6Y

We are adding new state in production 5:

1.S->1X2Y3
2.X->1,4x4
3.X->1,4x4X5
4.Y->2,6y6
5.Y->2,6y6Y7

CFSM Construction

Building the state machine we encounter conflicts:

To resolve, we compute FOLLOW(N) for each rule item.

CFSM automata .

 

Parse table

 

X

Y

x

y

$

1

S2

 

S4

 

 

2

 

S3

 

S6

 

3

R1

R1

R1

R1

R1

4

S5

 

S4

R2

 

5

R3

R3

R3

R3

R3

6

 

S7

 

S6

R4

7

R5

R5

R5

R5

R5



There are shift-reduse conflicts in states 4 and 6, they can be resolved only by checking lookahead. In order to do so,we are checking follow of the non terminal on the left side of the production that might be reduced.In state 4 we can make either reduction of production 2 or see X. If next token (lookahead) is y (follow(X)=first(Y)={y}), we'll make R2, if we'll see X we'll make S5. In state 6 we can make either reduction of production 4 or see Y; If next token (lookahead) is $ (follow(Y)={$}) we'll make R4, if we'll see Y we'll make S7. Therefor, this language is not LR(0), but LR(1) - lookahead of 1 is enough.



LR(1) costs:

LR1 parsing table is larger by a one or two orders than SLR(1) and LR(0), i.e: megabytes are needed. Solution: LALR(1).


YACC

YACC (Yet Another Compiler-Compiler) produces LP parser from any LALR(1) (lookahead LR(1)) grammar. It's fully compatible with Lex. Lex is used to recognize regular expressions. YACC uses Lex tokens as his input and recognizes grammars. Yacc uses bottom-up parsing to recognize grammars.

YACC input file is always of the form:

declarations
%%
rules
%%
user-defined functions

*declarations and user-defined functions sections may be empty.

Arithmetic Expressions Grammar Example

%token NUMBER
%left '+' '--
%left '*' '/'
%%
expr : expr '+' expr |
 expr '-' expr |
 expr '*' expr |
 expr '/' expr |
 '(' expr ')' |
 NUMBER
 ;

Terminals are surrounded by single quotes and the : is used in place of ->. The | is used to separate alternative right sides of productions. The first line of the input above defines + and - to have the same level of precedence, since they appear on the same line, and the second line does the same for * and /. In addition, since * and / are on lower line that + and -, they are defined to have a higher level of precedence that + and -. The occurrence of left before each operator indicates that they associate from the left, e.g. 3+4+5 should be evaluated as ((3+4)+5). The grammar rules, on their own, are ambiguous, for example there is more that one derivation of 3+4+5, but the associativity and precedence rules resolve the ambiguities completely.

Another example:

%token NUMBER
%left '+' '--
%left '*' '/'
%left UMINUS
%%
s : expr {printf("%d\n",$1);}
expr : expr '+' expr {$$=$1+$3;}
 |expr '-' expr {$$=$1-$3;}
 |expr '*' expr {$$=$1*$3;}
 |expr '/' expr {if ($3==0) yyerror ("divide by 0\n"); else $$=$1/$3;}
 |'-' expr %prec UMINUS {$$=-$2;} 
|'(' expr ')' {$$=$2;}
 |NUMBER
 ;
%%
#include "lex.yy.c"
void yyerror(char* s){printf("%s\n",s);}
int main(){return yyparse();}

As for Lex,the actions are written in C and usually occur at the end of syntax rules, thus corresponding to reductions. For instance:

expr: expr '+' expr {$$=$1+$3;}

The C code is "bold"; $n is a numerical value (an attribute) associated with n-th symbol on the right side of production; $$ is a numerical value to be associated with the symbol on the left side.
Finding the values at the terminal nodes of the syntax tree is a matter for the lexical analyzer. To link in with the lexical analyzer produced by Lex, lex.yy.c is "included". The user should supply main and yyerror (if he wants to use it). Terminals in the grammar that don't correspond to single characters are "declared" as %token. This provides the link with the lexical analyzer.

The input to Lex for the expression evaluator would be

%%
[0-9]+ {yylval = atoi(yytext); return NUMBER;
[\t] ; /*ignore whitespaces*/
 . return yytext[0];

Compiling

Assuming Lex file is nums.l and YACC file - nums.y the folowing is required to generate the parser:

flex nums.l 
bison nums.y
cc -o nums y.tab.c -ll -ly

The parser is prodused into nums, -ll and -ly ensure that the Lex and YACC libraries are included. Preferably use makefile as in big example.

Debugging YACC

If you give YACC -v flag, the file y.output will be created each time you run the program. In y.output you'll parse table, conflicts taht occured while running program on your input etc.


Big example files:
header.h
lexical.l
grammar.y
main.c
input example
internal Yacc parse table (y.output)
makefile

A Preview on Lex and Yacc