Theoretical Aspects of Lexical Analysis

From Wiki**3

Revision as of 03:18, 14 March 2008 by Root (talk | contribs)

Lexical analysis, the first step in the compilation process, splits the input data into segments and classifies them. Each segment of the input (a lexeme) will be assigned a label (the token).

In this case, we will be using regular expressions for recognizing portions of the input text.

Regular Expressions

Regular expressions are defined considering an alphabet { a, b, ..., c } and the empty string eps.

  • eps
  • a

Primitive constructors:

  • concatenation
  • alternative
  • Kleene-star (*)

Extensions:

  • Transitive closure (+) - a+ ("one or more 'a'")
  • Optionality (?) - a? ("zero or one 'a'")
  • Character classes - [a-z] ("all chars in the 'a-z' range" - only one character is matched)

Recognizing Regular Expressions

Building the NFA: Thompson's Algorithm

Building DFAs from NFAs

DFA Minimization

Input Processing

Recognizing Multiple Expressions

Example 1: Ambiguous Expressions

Example 2: Backtracking