The Flex Lexical Analyzer

From Wiki**3

Compiladores
Introdução ao Desenvolvimento de Compiladores
Aspectos Teóricos de Análise Lexical
A Ferramenta Flex
Introdução à Sintaxe
Análise Sintáctica Descendente
Gramáticas Atributivas
A Ferramenta YACC
Análise Sintáctica Ascendente
Análise Semântica
Geração de Código
Tópicos de Optimização

Flex is a tool that allows the recognition of language items (elements) specified as regular expressions. These regular expressions are as specified by the corresponding formal definition (alphabet and primitive operators). In addition, Flex also supports the derived operators (see theory page) as well as a number of meta- and pseudo-characters, such as end-of-file markers or special and non-printable characters. Flex is also capable of recognizing characters by their properties (such as being a digit, and so on). All these aspects make Flex very capable in what concerns alphabet specification.

To perform string recognition, from regular expression specifications, Flex uses a finite state machine (a finite automaton). Even though it is not its primary task, Flex is also capable of using a stack in conjunction with its state machine, making it theoretically capable of processing context-free languages (theoretically, since it is not as convenient as using a dedicated tool, such as BYACC, Bison, or others).

Basic Concepts

Flex is a code generator that reads a specification file and generates the lexical analyzer (a scanner) as a C or C++ module (depending on the options). The description is the same in both cases (only a few details in the actions are different from one case to the other). In current versions, the code generated for the C analyzer is compatible with a C++ compiler, making it possible to program actions in this language, even without using classes. This may be a good approach, since C++ data structures are usually more programmer-friendly than C, leading to fewer bugs.

A Flex specification file is composed by three parts:

  • Definitions
  • Rules (these are used to build the automaton), actions (for manipulating the strings that match the regular expressions)
  • Code

Structure of a Flex Specification

Basic Guidelines for a Successful Specification

... Do not try to recognize "negative" patterns

... Recognize patterns that are either of interest or that may cause problems if not recognized (example: sources of ambiguity)

... if there are self-contained parts of the language being recognized, consider recognizing them in a sub-automaton, controlled by start conditions (this will make the rules simpler and the overall specification will also be simpler -- the use of sub-automata tends to make the rules less dependent On one another, in the sense that they don't interfere with each other)

... sometimes, "bad rules" have to be written to account for badly specified language elements (e.g. octal numbers in which the digits do not belong to that base) -- these rules should be left for last, because if written too early they may complicate the overall task of rule writing (ambiguity and so on). It may be useful to activate debugging if behaviour is unexpected.

How to Debug a Flex Specification

There are various flags and variables to activate the debug functionality in Flex, i.e., there is no fundamental need to insert code such as printfs or similar, even through, of course, that style of debugging is also possible. These other ways are often much superior to printfs, since they can easily inform the programmer about matched rules, triggered actions, and so on.

Generation of debug code is requested by activating the corresponding option. In the Flex specification file, insert:

%option debug

This flag suffices to produce output information when developing in C. When using C++ scanners (%option c++), even though the above flag still generates debug code, it's still necessary to tell the scanner to actually output debug information. This can be done by calling the set_debug method with a non-zero argument (this can be conveniently done at the start of the rules section in a Flex specification file -- note how the action is indented from the left and without any rule):

%option debug 
... 

%% 
        { set_debug(1); } 

... other rules...  

%% 

The YYDEBUG environment variable will activate the debug messages both in Flex and YACC (allowing simultaneous debugging of token recognition by the scanner and corresponding use by the parser).

In the shell's command line (syntax may vary, depending on the actual environment definition circumstances):

 export YYDEBUG=1

Examples

Exercises

  • Exercise 1 - Printing the strings present in a C/C++ program.
  • Exercise 2 - Printing the number of times "." and "->" are used as operators.
  • Exercise 3 - Printing the comments present in a C/C++ program.
  • Exercise 4 - Removing the actions from a YACC specification.
  • Exercise 5 - Removing the actions from a Flex specification.
  • Exercise 6 - Playlist processing.
  • Exercise 7 - Text processing.
  • Exercise 8 - Message list and simple accounting.
  • Exercise 9 - Comments/code ratio in C++ and Java (Test 1, 2016/04/08).
  • Exercise 10 - Message complexity.