Bottom-Up Parsing

From Wiki**3

Compiladores
Introdução ao Desenvolvimento de Compiladores
Aspectos Teóricos de Análise Lexical
A Ferramenta Flex
Introdução à Sintaxe
Análise Sintáctica Descendente
Gramáticas Atributivas
A Ferramenta YACC
Análise Sintáctica Ascendente
Análise Semântica
Geração de Código
Tópicos de Optimização

Bottom-up parsers start from the input tokens and try to build increasingly complex structures until the initial symbol is reached.

The bottom-up parsers covered in this section are the ones belonging to the LR family (left-right rightmost derivation). These parsers are also known as shift/reduce parsers, after the actions they perform (shift = input consumption; reduce = node building). The parsers covered are LR(0) (a simple parser without any kind of lookahead), SLR (a parser that builds only when the input matches the elements in the follow set of a rule being reduced), and LALR(1) (a parser that restricts the follow symbols that can be matched in each state).

The sections below describe, first, the LR(0) and SLR(1) algorithms, and then, the LALR(1) algorithm. All three parsers share the same "shift" and "goto" entries, but differ in the restrictions put on "reduce" actions.

If the parse table contains conflicts, the grammar is said not to belong to that class, e.g., a grammar is said to be SLR(1) if its SLR(1) parse table does not contain any conflicts.

Building SLR Parse Tables

For building SLR tables we will define the notion of LR(0) item.

An LR(0) item is, informally, a production with a dot (•). The dot represents the portion of the rule that was seen by the parser so far. When the dot crosses over a symbol (advancing its position within the rule), the parser has performed a shift action (if the symbol is a terminal) or a reduce (if the symbol is a non-terminal).

For instance:

 E → • A B C
 E → A • B C

Internally, the parser may use a pair of integers for representing an item (one for the rule, another for the dot's position).

For each production in a grammar, there is a set of LR(0) items, each corresponding to a possible position of the dot.

The main idea is to build an automaton that recognizes viable prefixes from the input. This automaton may be thought of as an NFA in which every single LR(0) item and corresponding transition is considered. This NFA could then be converted into a DFA and minimized. This process is, however, too long and, since a direct approach exists, unnecessary.

The following steps describe the direct construction of the DFA.

The first step is to build the augmented grammar. This is a grammar like the original, but with an additional rule S' (S is the start symbol):

 S'→ S $

This rule indicates that when S (the start symbol) is reduced, then the input should be accepted, i.e., we would go from item S'→•S$ to item S'→S•$.

Like for lexical analyzers, we will define two primitive functions: ε-closure and goto (similar to "move"), that will allow us to describe an algorithm for building the DFA.

ε-closure

Consider I a set of LR(0) items. The following conditions apply to rules in ε-closure(I):

  1. Initially, all elements of I are in ε-closure(I)
  2. If A→α•Bβ is in ε-closure(I) and B→γ is a production, the add B→•γ to ε-closure(I)

(repeat until the ε-closure(I) is unchanged)

Example:

Consider the following grammar (rule 0 corresponds to the augmented grammar):

0) E' → E $
1) E  → E + T
2) E  → T
3) T  → T * F
4) T  → F
5) F  → ( E )
6) F  → id

Then:

 I = {[E'→•E$]}
 ε-closure(I) = {[E'→•E$], [E→•E+T], [E→•T], [T→•T*F], [T→•F], [F→•(E)], [F→•id]}

Goto

The goto function is formally defined as goto(I,X) (I is a set of LR(0) items, and X is a grammar symbol).

 goto(I,X) = ε-closure({[A→αX•β]: ∀[A→α•Xβ]∈I})

Informally, if γ was a viable prefix for the elements of I, then γX is a viable prefix for the elements of goto(I,X).

Example:

 I = {[E'→E•$], [E→E•+T]}
 goto(I,+) = ε-closure({[E→E+•T]}) = {[E→E+•T], [T→•T*F], [T→•F], [F→•(E)], [F→•id]}

Building the DFA

The DFA (a set of sets) starts by containing a single element, consisting of the set obtained by computing the ε-closure of the singular set {[S'→•S$]} (S is the start symbol):

 DFA = { ε-closure({[S'→•S$]}) }

Then, repeat until it is not possible to add any further elements: for each element (set) I in the DFA and for each grammar symbol X, add goto(I,X) (if not empty) to the DFA.

Example:

Considering the grammar above, the initial configuration is:

 DFA = {I0} = {ε-closure({[E'→•E$]})}

With:

 I0 = E'→ •E$
      E → •E+T
      E → •T
      T → •T*F
      T → •F
      F → •(E)
      F → •id

Then, applying the goto function as described above (essentially, we will look only to the symbols immediately in front of the dot), we get the following sets in the DFA (its states):

I0 = E' → •E$    I1 = E' → E•$    I5 = F → (•E)   I6 = F → id•    I4 = T → T*•F   I11 = E → E+T•
     E  → •E+T        E  → E•+T        E → •E+T                        F → •(E)         T → T•*F
     E  → •T                           E → •T     I2 = E → E+•T        F → •id
     T  → •T*F   I3 = E  → T•          T → •T*F        T → •T*F                   I8  = T → T*F•
     T  → •F          T  → T•*F        T → •F          T → •F     I9 = F → (E•)
     F  → •(E)                         F → •(E)        F → •(E)        E → E•+T   I10 = F → (E)•
     F  → •id    I7 = T  → F•          F → •id         F → •id


Graphically:

Dfaslr.png

The items above the red line correspond to kernels, i.e., the items that, in each state, characterize the state, since they are not derivable from any other items in the same state.

State 1 is the accepting state.

The Parse Table

The SLR(1) parse table is built, as mentioned above, based on the transitions of the DFA and on the FOLLOWs associated with reduced non-terminals (in each state where reductions occur -- the dot is at the end of the LR(0) item, meaning that all of its components have been either seen or reduced and they are available on the parser's stack).

The parse table has two zones, that associate states and symbols: action (shifts, reduces, accept), defined over states and terminal symbols (as well as with the end of phrase); and goto (defined over states and non-terminal symbols; associated with transitions after reducing non-terminals).

Assuming that the DFA has been built and, thus, the I0 ... In states are known, the following steps produce the parse table:

  1. State i is corresponds to Ii; the corresponding actions are:
    • If [A→α•aβ] (a is a terminal) is in Ii and goto(Ii,a) = Ij, then define action[i,a] = shift j
    • If [A→α•] is in Ii, then define action[i,a] = reduce A→α for all α ∈ FOLLOW(A) (A≠S', S' being the start symbol)
    • If [S'→S•$] (S' being the start symbol) is in Ii, then define action[i,$] = ACCEPT
  2. For state i and A a non-terminal symbol, define goto(i,A) = j if goto(Ii, A) = Ij

All cells left empty correspond to syntax errors.

If step 1 above produces cells with multiple values, i.e., a shift and a reduce or more than one reduce, then the parse table has action conflicts and the grammar is said not to be SLR(1). A grammar is said to be SLR(1) if an SLR(1) parser can be built, that is, if there are no conflicts in the parse table.

Considering our example grammar, the following parse table would result from applying the steps above:

Bupslrtable.png

Building LALR(1) Parse Tables

The following video presents the complete process of building a parser.

The grammar is the one from Exercise 9: LALR(1) (also below).

Compacting Parse Tables

The approaches to parse table compaction described here focus on two aspects: eliminating states that only perform a single reduction and eliminating multiple repeating entries in parse tables.

The compaction operations assume that the parse table does not have conflicts. If conflicts exist, they must be solved first (e.g., following YACC: shift-reduce conflicts select the shift operation; reduce-reduce conflicts select the reduce corresponding to the rule that appear first in the grammar).

Eliminating single-reduction states

The first aspect is one in which, either following a shift or a goto operation, the parser ends up in a state where the only possible operation is a reduce. This means that the parser, in the previous step pushed something on the stack (a terminal and a state, in the case of a shift; a state, in the case of a goto) and, in the current step, is going to pop it. That is, the previous operation is, in a certain sense, undone by the reduction, wasting time. Thus, instead of doing these two operations in sequence, we describe two new ones that perform a lookahead peek and perform a composite operation instead.

Thus, in the case of a shift from a state to another one followed by a reduction, the parser will instead use the stack and the input to perform a composite "shift" (removing the terminal from the input, as in a regular shift, but without changing to a new state) plus reduce (using the entries at the top of the stack plus the one from the input) operation.

In the second case, i.e., a goto followed by a reduce, instead of going to a new state, the composite operation will simply perform a second reduction: the one present in the previous destination state.

Eliminating multiple reduction entries

The other compaction technique is used to eliminate multiple (frequent) reduction entries in a given state: instead of having a dense parse table in which many cells have descriptions, the most frequent reduction is moved to a special column, called "default", that describes the parser's action in that state. Entries in that state different from the default reduction are not changed.

This form of compaction, in practice, reproduces the behaviour of LR(0) parsers: the parser will perform reductions even when the default was not applicable. This means that, instead of a syntax error, the parser will perform a reduction. This is not actually a problem, because the reduction is actually incorrect (not before a member of the FOLLOW set) and a syntax error is guaranteed in the next step (in this case, this compaction operation trades savings in space for time of analysis).

Example

The following example (unrelated to the one above) shows the two tables (full and compacted).

Compact-slr-table.png

Exercises: SLR(1)

Exercises: LALR(1)