Difference between revisions of "Theoretical Aspects of Lexical Analysis"

Revision as of 22:52, 15 March 2008

Regular Expressions

Regular expressions are defined considering a finite alphabet Î£ = { a, b, ..., c } and the empty string Îµ:

The languages (sets of strings) for each of these entities are:

{Îµ}, for Îµ
{a}, for an entry a in Î£

The following primitive constructors are defined:

concatenation
alternative
Kleene-star (*)

Extensions (derived from the above):

Transitive closure (+) - a+ ("one or more 'a'")
Optionality (?) - a? ("zero or one 'a'")
Character classes - [a-z] ("all chars in the 'a-z' range" - only one character is matched)

Recognizing/Matching Regular Expressions: Thompson's Algorithm

Since we are going to use sets of regular expressions for recognizing input strings, we need a way of implementing that functionality. The recognition process can be efficiently carried out by finite state automata that either accept of reject a given string.

Ken Thompson, the creator of the B language (one of the predecessors of C) and one of the creators of the UNIX operating system, devised the algorithm that carries his name and describes how to build an acceptor for a given regular expression.

Created for Thompson's implementation of the grep UNIX command, the algorithm creates an NFA from a regular expression specification that can then be converted into a DFA. It is this DFA that after minimization yields an automaton that is an acceptor for the original expression.

The following sections cover the algorithm's construction primitives and how to recognize a simple expression. Lexical analysis such as performed by flex requires several expressions to be watched for, each one corresponding to a token. Such automatons feature multiple final states, one or more for each recognized expression.

Building the NFA: Thompson's Algorithm

Thompson's algorithm is based on a few primitives, as show in the following table:

Example	Diagram	Meaning
		Empty expression (in the following diagrams, empty expressions will be represented by unlabeled edges).
		One occurrence of an expression.
		Zero or more occurrences of an expression: this case may be generalized for more complex expression. In this case, the complex expression will simply take the place of arc in the diagram.
		Concatenation of two or more expressions: the first expression's final state coincides with the second's. This case, like the previous one, may be generalized to describe more complex concatenations.
		Alternative expressions: the to initial states and the final states of each expression are connected to two new states. Both expressions may be replaced by more general cases.

Complex expressions are built from these primitives. The following diagram corresponds to the expression a(a|b)âˆ—|c (note how the Kleene-star operator affects an alternative group):

Building DFAs from NFAs

NFAs are not well suited for computers to work with, since each state may have multiple acceptable conditions for transitioning to another state (searching and backtracking would be needed to directly using an NFA). Thus, it is necessary to transform the automaton so that each state has a single transition for each possible condition. This process is called determination. The algorithm for transforming an NFA into a DFA is a simple one and relies on two primitive functions, move and Îµâˆ’closure.

The move function is deï¬ned over a set of NFA states and input symbol pairs and a set of NFA states sets: for each state and input symbol, it computes the set of reacheable states. As an example consider, for the NFA in the previous automaton:

move({2}, a) = {3}
move({5}, a) = {6}
move({11}, a) = {}

The Îµâˆ’closure function is deï¬ned for sets of states: the function computes a new set of states reachable from the initial set by using only all the possible transitions to other states (including the each state itself), as well as the states reacheable through transitions from those states. Thus, considering the previous NFA, we could write:

Îµâˆ’closure({1}) = {1, 2, 11}
Îµâˆ’closure(move({2}, a)) = Îµâˆ’closure({3}) = {3, 4, 5, 7, 10, 13}

With the two above functions we can describe a determinization algorithm. The input for the determinization algorithm is a set of NFA states and their corresponding transitions; a distinguished start state and a set of ï¬nal states. The output is a set of DFA states (as well as the conï¬guration of NFA states corresponding to each DFA state); a distinguished start state and a set of ï¬nal states.

The algorithm considers an agenda containing pairs of DFA states and input symbols. Each pair corresponds to a possible transition in the DFA (possible in the sense that it may not exist). Each new state, obtained from considering successful transitions from agenda pairs, must be considered as well with each input symbol. The algorithm ends when no more pairs exist in the agenda and no more can be added.

DFA states containing in their conï¬gurations ï¬nal NFA states are also ï¬nal.

Step 1: Compute the Îµâˆ’closure of the NFAâ€™s start state. The resulting set will be the DFAâ€™s start state, I₀. Add all pairs (I₀, Î±) (âˆ€_{Î±âˆˆÎ£}, with Î£ the input alphabeth) to the agenda.

Step 2: For each unprocessed pair in the agenda (I_n, Î±), remove it from the agenda and compute Îµâˆ’closure(move(I_n, Î±)): if the resulting conï¬guration, I_n+1, is not a known one (i.e., it is different from all I_k, âˆ€_k<n+1), add the corresponding pairs to the agenda.

Step 3: Repeat step 2 until the agenda is empty.

The algorithmâ€™s steps can be tabled (see below): Î£ = {a, b, c} is the input alphabet; Î± âˆˆ Î£ is an input symbol; and I_n+1 = Îµâˆ’closure(move(I_n, Î±)). Numbers in bold face correspond to final states

I_n	Î±âˆˆÎ£	move(I_n, Î±)	I_n+1 âˆ’ move(I_n, Î±)	I_n+1 = Îµâˆ’closure(move(I_n, Î±))
â€“	â€“	1	2, 11	1
1	a	3	4, 5, 7, 10, 13	2
1	b	â€“	â€“	â€“
1	c	12	13	3
2	a	6	4, 5, 7, 9, 10, 13	4
2	b	8	4, 5, 7, 9, 10, 13	5
2	c	â€“	â€“	â€“
3	a	â€“	â€“	â€“
3	b	â€“	â€“	â€“
3	c	â€“	â€“	â€“
4	a	6	4, 5, 7, 9, 10, 13	4
4	b	8	4, 5, 7, 9, 10, 13	5
4	c	â€“	â€“	â€“
5	a	6	4, 5, 7, 9, 10, 13	4
5	b	8	4, 5, 7, 9, 10, 13	5
5	c	â€“	â€“	â€“

The graph representation of the DFA computed in accordance with the determinization algorithm is presented below (the right part of the figure presents a simplified view). The numbers correspond to DFA states whose NFA state conï¬gurations are presented above.

DFA Minimization

Input Processing

Recognizing Multiple Expressions

Example 1: Ambiguous Expressions

Example 2: Backtracking

Revision as of 22:50, 15 March 2008 (view source) Root (talk \| contribs) (→‎Building DFAs from NFAs) ← Older edit		Revision as of 22:52, 15 March 2008 (view source) Root (talk \| contribs) (→‎Building DFAs from NFAs) Newer edit →
Line 198:		Line 198:


−	The graph representation of the DFA computed in accordance with the determinization algorithm is presented below. The numbers correspond to DFA states whose NFA state conï¬gurations are presented above.	+	The graph representation of the DFA computed in accordance with the determinization algorithm is presented below (the right part of the figure presents a simplified view). The numbers correspond to DFA states whose NFA state conï¬gurations are presented above.

	[[Image:dfa-aabc.png]]		[[Image:dfa-aabc.png]]