Optimization Topics

From Wiki**3

Compiladores
Introdução ao Desenvolvimento de Compiladores
Aspectos Teóricos de Análise Lexical
A Ferramenta Flex
Introdução à Sintaxe
Análise Sintáctica Descendente
Gramáticas Atributivas
A Ferramenta YACC
Análise Sintáctica Ascendente
Análise Semântica
Geração de Código
Tópicos de Optimização

Basic Blocks

Instruction sequence where control flow starts at the first instruction and ends at the last, without jumps (except at the last instruction).

Transformations within a basic block maintain the block semantics (considered as an atomic entity).

Optimization Levels

Compiler optimization is an essential process in compiler design that enhances program performance by minimizing execution time, memory consumption, and other metrics. Various optimization techniques can be employed at different stages of the compilation process, such as user, generic, peephole, local, inter-block, and global optimization. Each optimization level has a different focus, ranging from specific user algorithms to the entire code structure, aiming to streamline the execution process and make the compiled code as efficient as possible.

User: algorithm- and application-level optimizations

User-level, that is, algorithm- and application-level, optimizations are specific to a particular algorithm or application. They are usually done by the developer who writes the program and knows the algorithm's specificities and the application's behavior. These optimizations aim to rewrite or refactor the code to improve its performance, often based on knowledge of the algorithm or application. For instance, a developer could optimize a sorting algorithm by choosing the most efficient one based on the data's expected characteristics.

Generic: considers intermediate code (processor-independent)

Generic optimizations are carried out considering the intermediate code and are processor-independent. These optimizations are typically implemented in the middle-end of the compiler, after the frontend has parsed the code into an intermediate representation but before the backend generates machine code. Generic optimizations can include things like constant folding, dead code elimination, or loop optimization. Because they're not tied to a specific processor or machine architecture, these optimizations can be applied to any program being compiled.

Peephole: window over a series of consecutive instructions

Peephole optimization involves looking at a small window of consecutive instructions in the assembly or machine code and attempting to improve them. The compiler replaces these instructions with a more efficient sequence, reducing redundancy, and improving execution speed. This process could include eliminating unnecessary load and store operations, simplifying arithmetic operations, or replacing slow instructions with faster ones.

Local: optimization within a basic block

Local optimization is focused on enhancing the performance within a basic block (a sequence of code with one entry point and one exit point, with no possibility of branching in between). These blocks are identified after the program has been parsed into an intermediate representation. The optimization here focuses on improving the efficiency of each block by eliminating redundant computations, dead code, and performing register allocation, among others.

Inter-block: information flow between blocks (mainly cycles)

Inter-block optimization takes into consideration the information flow between blocks, mainly cycles. In this phase, the compiler looks for optimization opportunities that involve more than one basic block, such as loop unrolling or loop invariant code motion. These optimizations can help to minimize redundant computations and enhance overall code efficiency.

Global: multiple jumps (jumps to jumps, jumps to the next instruction)

Global optimizations focus on the entire function or program, optimizing across multiple jumps, including jumps to jumps and jumps to the next instruction. These optimizations are more complex than local optimizations, as they require a broader view of the program's control flow. Global optimizations may include function inlining, loop unrolling across multiple basic blocks, or interprocedural optimizations that span multiple functions or procedures. These kinds of optimizations can significantly improve execution speed but can also increase the size of the compiled code.

Machine-independent optimizations

Constant folding

Direct evaluation of expressions using only literal and known values:

 int *x = (int *)malloc(1024 * 4);
 int  y = 3;
 int  z = 20 * y + 10;

Becomes:

 int *x = (int *)malloc(4096);
 int  y = 3;
 int  z = 70;

Elimination of Common Subexpressions

Using temporary variables to store common results:

  d = c * (a + b)
  e = (a + b) / 2

Becomes:

  t = a + b
  d = c * t
  e = t / 2

Cycles / Loops

Loop unrolling

  for (i = 1; i < n; i++)
    a[i] = b[i] + 10;

Becomes:

  for (i = 1; i < (n/4)*4; i+=4) {
    a[i]   = b[i]   + 10;
    a[i+1] = b[i+1] + 10;
    a[i+2] = b[i+2] + 10;
    a[i+3] = b[i+3] + 10;
  }
 
  // rest of the cycle
  for (; i < n; i++)
    a[i] = b[i] + 10;

Moving invariant code

  for (i = 1; i < n; i++) {
    a[i] = b[i] + c * d;
    e = g[k];
  }

Becomes:

  t = c * d;
  for (i = 1; i < n; i++) {
    a[i] = b[i] + t;
  }
  e = g[k];

Variable induction

  for (i = 1; i < n; i++)
    k = i * 4 + m;

Becomes:

  k = m;
  for (i = 1; i < n; i++)
    k = k + 4;

Other

Machine-dependent Optimizations

Machine-dependent optimizations are compiler optimizations that are specific to a certain type of hardware architecture. These optimizations take advantage of the unique features of a particular machine to enhance code performance.

The following sections provide examples for the Intel x86 family but are also applicable in other architectures.

Algebraic Simplification

Algebraic simplification refers to the process of modifying certain operations in a way that they use fewer resources or take less time, while still producing the same result. Consider an operation such as mov eax, 0. This instruction moves the value 0 into the eax register. However, we can simplify this operation by using the XOR operation, xor eax, eax. The XOR operation performs a bitwise exclusive OR, so when a register is XORed with itself, the result is always 0. The xor eax, eax instruction is typically faster than mov eax, 0.

Another example can be eliminating redundant instructions. If there are two consecutive instructions mov eax, ebx and mov eax, ecx, the first instruction becomes redundant because eax is immediately overwritten with ecx. The optimized code would only include the instruction move eax, ecx.

Strength Reduction

Strength reduction is another optimization technique that replaces expensive operations with cheaper ones, again without changing the result. For instance, on x86 processors, multiplication and division are more expensive in terms of CPU cycles than addition or shift operations. Therefore, strength reduction may replace an operation like x * 2 with x << 1, which shifts the value of x one bit to the left, effectively multiplying it by 2. Another example would be replacing an operation like x / 2 with x >> 1, which shifts the value of x one bit to the right, halving it.

These operations can be combined: an operation like x * 10 could be replaced with (x << 3) + (x << 1), which is equivalent but faster. Another example is using x << 2 + x instead of x * 5.

Instruction Reordering

Instruction reordering is a powerful optimization technique that changes the order of instructions to minimize CPU idle time and efficiently allocate CPU registers. This technique is particularly impactful in architectures like the x86, which uses pipelining where instructions are processed in stages and can overlap. Dependencies between instructions, however, can lead to pipeline stalls, where the CPU is waiting for previous instructions to finish. For example, if instruction B depends on the result of instruction A, the CPU can't start executing instruction B until instruction A has finished. Register allocation also plays a crucial role in this context. The CPU has a limited number of registers, and compilers aim to use these resources effectively. If a variable is frequently accessed, it might be stored in a register for faster access, an operation called register promotion. Conversely, if there are more variables than available registers, some values have to be moved to memory, a situation called register spilling. By reordering instructions and managing register allocation efficiently, the compiler can minimize these pipeline stalls and avoid unnecessary register spilling, hence optimizing execution speed. For instance, if we have three instructions -- A = B + C; D = E + F; G = A + D -- the compiler could reorder the first two instructions so that they execute in parallel, thus reducing the time it takes to get the results needed for the third instruction, and allocate the variables A, D, and G to registers to speed up their access.

Other

Exercises