Semantic Analysis/The Tiny language: semantic analysis example and C generation

From Wiki**3

< Semantic Analysis

Considere a seguinte gramárica (ε representa a produção nula), onde os operadores tWRITE (não associativo), = (associativo à direita) e + (associativo à esquerda) têm precedências crescentes.

O primeiro objectivo é escrever, utilizando as classes disponibilizadas na CDK (subclasses de cdk::basic_node), uma especificação YACC para a gramática dada, contendo as acções semânticas que permitem construir a árvore sintáctica.

O segundo objectivo é escrever os métodos de o tradutor para C (padrão de desenho Visitor) a árvore sintáctica anteriormente obtida. Esta tradução implica que os símbolos encontrados sejam recordados, para verificar futuras utilizações. Para tal, a classe tiny::symbol (que representa cada símbolo; esquematizado aqui e que se baseia no símbolo do compilador Simple) e a classe cdk::symbol_table serão utilizadas.

prog  → decls exprs '.' 
decls → ε | decls decl ';' 
decl  → tINT tID | tSTR tID init 
init  → ε | '=' tSTRING 
exprs → expr | exprs ',' expr 
expr  → tINTEGER | tID | tID '=' expr | expr '+' expr | tWRITE expr

Gramática e Criação de Nós da Árvore Sintáctica Abstracta

In the answer to this exercise, we will consider the node tree defined as shown below (in a YACC-like syntax). Note that careful attention must be given to the choices between cdk::nil_node nodes and nullptr (null pointers) (these can be tested for performing special/exceptional actions, while Nil nodes are more useful when tests are undesirable and uniform behavior on the part of the code generator is desired).

We follow the same nomenclature used in the Simple compiler: LINE is a macro corresponding to the source line and all nodes are either CDK nodes or derived from them (as in Simple).

Completing (and, possibly, correcting) the following code, so that it runs, is left as an exercise.

File tiny.y
%{
#include <algorithm>
#include <memory>
#include <cdk/types/basic_type.h>
#include <cdk/types/primitive_type.h>
#include "ast/all.h"
%}
%parse-param {std::shared_ptr<cdk::compiler> compiler}
%union {
  //--- don't change *any* of these: if you do, you'll break the compiler.
  YYSTYPE() : type(cdk::primitive_type::create(0, cdk::TYPE_VOID)) {}
  ~YYSTYPE() {}
  YYSTYPE(const YYSTYPE &other) { *this = other; }
  YYSTYPE& operator=(const YYSTYPE &other) { type = other.type; return *this; }
  std::shared_ptr<cdk::basic_type> type;        /* expression type */
  //-- don't change *any* of these --- END!

  int i;
  std::string *s;
  cdk::basic_node *n;
  cdk::expression_node *e;
}  
%token tINT tSTR tWRITE
%token<i> tINTEGER
%token<s> tID tSTRING
%type<e> expr init
%type<n> prog decls exprs decl
%%
prog: decls exprs '.' { $$ = new program_node(LINE, $1, $2); }

decls: /* empty */    { $$ = new cdk::nil_node(LINE); }
     | decls decl ';' { $$ = new cdk::sequence_node(LINE, $2, $1); }
     ;

decl: tINT tID      {
        $$ = new declaration_node(LINE, cdk::primitive_type::create(4, cdk::TYPE_INT), *$2, nullptr);
        delete $2;
    }
    | tSTR tID init {
        $$ = new declaration_node(LINE, cdk::primitive_type::create(4, cdk::TYPE_STRING), *$2, $3);
        delete $2;
    }
    ;

init: /* empty */  { $$ = nullptr; /* must match the last argument in declaration_node */ }
    | '=' tSTRING  { $$ = new cdk::string_node(LINE, $2); }
    ;

exprs: expr             { $$ = new cdk::sequence_node(LINE, $1); }
     | exprs ',' expr   { $$ = new cdk::sequence_node(LINE, $3, $1); }
     ;

expr: tINTEGER           { $$ = new cdk::integer_node(LINE, $1); }
    | tID                { $$ = new cdk::rvalue_node(LINE, new cdk::variable_node(LINE, *$1)); delete $1; }
    | tID '=' expr       { $$ = new cdk::assignment_node(LINE, new cdk::variable_node(LINE, *$1), $3); delete $1; }
    | expr '+' expr      { $$ = new cdk::add_node(LINE, $1, $3); }
    | tWRITE expr        { $$ = new write_node(LINE, $2); }
    ;

In this code we assume that each declaration node contains two attributes: an integer, representing the declaration's type; a node, representing the identifier being declared; and an optional expression (nullptr if that is the case), representing the initial value of the variable being declared. Likewise, assignment nodes have two children: an identifier, corresponding to the left-value; and the expression to be assigned. Write nodes have a single child (the expression to be printed) (note that, in this simple grammar, write nodes are also integer expressions). The program node (the main node) has two children: a node containing declarations and another containing expressions.

The other nodes are as described in the CDK.

Symbol representation

Symbols describe named program entities and store their properties. They provide support for the semantic processor: declarations create new symbols. Expressions and left-values refer to those symbols.

A simple representation in this case could be done in the following way. Note that this definition is just an example and contains only minimal information. It should be extended to account for the needs of the language being implemented.

File symbol.h
#ifndef __TINY_TARGETS_SYMBOL_H__
#define __TINY_TARGETS_SYMBOL_H__

#include <string>
#include <memory>
#include <cdk/types/basic_type.h>

namespace tiny {

  class symbol {
    std::string _name; // identifier
    std::shared_ptr<cdk::basic_type> _type; // type (type id + type size)
  public:
    // constructors, destructor, getters, etc.

  public:
    // critical for type checking (interface similar to that of class cdk::typed_node)
    std::shared_ptr<cdk::basic_type> type() const { return _type; }
    void set_type(std::shared_ptr<cdk::basic_type> t) { _type = t; }
    bool is_typed(cdk::typename_type name) const { return _type->name() == name; }
  };

  // this function simplifies symbol creation in the type_checker visitor (see below)
  inline auto make_symbol(const std::string &name, std::shared_ptr<cdk::basic_type> type, /* rest of ctor args */) {
    return std::make_shared<symbol>(name, type, /* rest of ctor args */);
  }

} // tiny

#endif

Type Checking

The type checking visitor class could process the whole AST in one go, tagging all types and possible errors. Then, in a following step, the code generator would use this information without further use of a type checker. While this favors a step-by-step processing of the AST, it mandates the full implementation of the type checking visitor (i.e., all methods would have to be non-empty). The approach shown in this example, however, can be use partially implemented visitors, since it only needs to provide implementations for the nodes that have types to be checked. In this approach, the code generator creates a type checker whenever it needs to perform a check.

The ASSERT_UNSPEC macro

The type checker avoids visiting the same nodes by testing (initial assertion) whether it has done so before.

  #define ASSERT_UNSPEC  { \
    if (node->type() != nullptr && !node->is_typed(cdk::TYPE_UNSPEC)) \
      return; \
  }

Note that the use of this macro is restricted to visits to nodes that have a type, i.e., expressions.

The CHECK_TYPES and ASSERT_SAFE_EXPRESSIONS macros

The code generator may use the following macros at the start of a visit method, to ensure that the expressions present in a node are safe to use, i.e., that they are semantically consistent.

#define CHECK_TYPES(compiler, symtab, node) { \
  try { \
    tiny::type_checker checker(compiler, symtab, this); \
    (node)->accept(&checker, 0); \
  } \
  catch (const std::string &problem) { \
    std::cerr << (node)->lineno() << ": " << problem << std::endl; \
    return; \
  } \
}

#define ASSERT_SAFE_EXPRESSIONS CHECK_TYPES(_compiler, _symtab, node)

The type checking visitor

File type_checker.cpp
void tiny::type_checker::do_declaration_node(declaration_node *const node, int lvl) {
  std::string &id = node->identifier();
  auto symbol =  tiny::make_symbol(node->type(), id, 0);
  if (!_symtab.insert(id, symbol))
    throw id + " redeclared";

  if (node->initializer()) {
    node->initializer()->accept(this, lvl+2);
    if (node->type() != node->initializer()->type())
      throw std::string("wrong type for initializer");
  }

  _parent->set_new_symbol(symbol);
}

void tiny::type_checker::do_integer_node(cdk::integer_node *const node, int lvl) {
  ASSERT_UNSPEC;
  node->type(cdk::primitive_type::create(4, cdk::TYPE_INT));
}

void tiny::type_checker::do_string_node(cdk::string_node *const node, int lvl) {
  ASSERT_UNSPEC;
  node->type(cdk::primitive_type::create(4, cdk::TYPE_STRING));
}

void tiny::type_checker::do_variable_node(cdk::variable_node *const node, int lvl) {
  ASSERT_UNSPEC;
  const std::string &id = node->name();
  auto symbol = _symtab.find(id);
  if (!symbol) throw id + " undeclared";
  node->type(symbol->type());
}

void tiny::type_checker::do_assignment_node(cdk::assignment_node *const node, int lvl) {
  ASSERT_UNSPEC;
  node->lvalue()->accept(this, lvl+4);
  node->rvalue()->accept(this, lvl+4);
  if (node->lvalue()->type() != node->rvalue()->type())
    throw std::string("wrong types in assignment");
  node->type(node->lvalue()->type());
}

void tiny::type_checker::do_add_node(cdk::add_node *const node, int lvl) {
  ASSERT_UNSPEC;

  node->left()->accept(this, lvl+2);
  if (!node->left()->is_typed(cdk::TYPE_INT))
    throw std::string("integer expression expected in add operator (left)");

  node->right()->accept(this, lvl+2);
  if (!node->right()->is_typed(cdk::TYPE_INT))
    throw std::string("integer expression expected in add operator (right)");

  node->type(cdk::primitive_type::create(4, cdk::TYPE_INT));
}

void tiny::type_checker::do_write_node(write_node *const node, int lvl) {
  ASSERT_UNSPEC;
  node->argument()->accept(this, lvl+2);
  if (!(node->argument()->is_typed(cdk::TYPE_INT) || node->argument()->is_typed(cdk::TYPE_STRING)))
    throw std::string("wrong type in write expression");
  node->type(cdk::primitive_type::create(4, cdk::TYPE_INT));
}

Code Generation

The following code presents just the bare minimum for defining the code generation visitors (the omitted parts are as in Simple).

The C code generator

File c_writer.cpp
void tiny::c_writer::do_program_node(program_node *const node, int lvl) {
  os() << "int main() {\n";
  node->declarations()->accept(this, lvl+2);
  node->expressions()->accept(this, lvl+2);
  os() << "  return 0;\n}\n";
}
 
void tiny::c_writer::do_sequence_node(cdk::sequence_node *const node, int lvl) {
  for (size_t i = 0; i < node->size(); i++) {
    os() << std::string(lvl+2, ' ');
    node->node(i)->accept(this, lvl);
    os() << ";\n";
  }
}
 
void tiny::c_writer::do_declaration_node(declaration_node *const node, int lvl) {
  ASSERT_SAFE_EXPRESSIONS;
  std::string &id = node->identifier();
  if (new_symbol() != nullptr) {
    os() << std::string(lvl+2, ' ');
    if (node->is_typed(cdk::TYPE_INT)) os() << "int " << id;
    else os() << "char *" << id;
    if (node->initializer()) os() << " = \"" << *node->initializer() << "\"";
    reset_new_symbol();
  }
}
 
void tiny::c_writer::do_integer_node(cdk::integer_node *const node, int lvl) {
  ASSERT_SAFE_EXPRESSIONS;
  os() << node->value();
}
 
void tiny::c_writer::do_string_node(cdk::string_node *const node, int lvl) {
  ASSERT_SAFE_EXPRESSIONS;
  os() << node->value();
}
 
void tiny::c_writer::do_variable_node(cdk::variable_node *const node, int lvl) {
  ASSERT_SAFE_EXPRESSIONS;
  os() << node->name();
}
 
void tiny::c_writer::do_assignment_node(cdk::assignment_node *const node, int lvl) {
  ASSERT_SAFE_EXPRESSIONS;
  node->lvalue()->accept(this, lvl);
  os() << " = ";
  node->rvalue()->accept(this, lvl);
}
 
void tiny::c_writer::do_add_node(cdk::add_node *const node, int lvl) {
  ASSERT_SAFE_EXPRESSIONS;
  node->left()->accept(this, lvl);
  os() << " + ";
  node->right()->accept(this, lvl);
}
 
void tiny::c_writer::do_write_node(write_node *const node, int lvl) {
  ASSERT_SAFE_EXPRESSIONS;
  if (node->argument()->is_typed(cdk::TYPE_INT)) {
    os() << "printf(\"%d\\n\", ";
    node->argument()->accept(this, lvl+2);
    os() << ")";
  }
  else {
    os() << "printf(\"%s\\n\", \"";
    node->argument()->accept(this, lvl+2);
    os() << "\")";
  }
}