Theoretical Aspects of Lexical Analysis

From Wiki**3

Revision as of 03:33, 14 March 2008 by Root (talk | contribs) (Recognizing Regular Expressions)

Lexical analysis, the first step in the compilation process, splits the input data into segments and classifies them. Each segment of the input (a lexeme) will be assigned a label (the token).

In this case, we will be using regular expressions for recognizing portions of the input text.

Regular Expressions

Regular expressions are defined considering a finite alphabet Σ = { a, b, ..., c } and the empty string ε:

The languages (sets of strings) for each of these entities are:

  • {ε}, for ε
  • {a}, for an entry a in Σ

The following primitive constructors are defined:

  • concatenation
  • alternative
  • Kleene-star (*)

Extensions (derived from the above):

  • Transitive closure (+) - a+ ("one or more 'a'")
  • Optionality (?) - a? ("zero or one 'a'")
  • Character classes - [a-z] ("all chars in the 'a-z' range" - only one character is matched)

Recognizing/Matching Regular Expressions

Building the NFA: Thompson's Algorithm

Building DFAs from NFAs

DFA Minimization

Input Processing

Recognizing Multiple Expressions

Example 1: Ambiguous Expressions

Example 2: Backtracking