Prof Wagner: CPS 218.

Compiler Construction – Overview

 

Introduction

    1. What is a Compiler?
    2. Types of Compiler

Linguistic Notations:

    1. Precise specification of syntax (exact form) of source program
    2. Imprecise "semantics": description of meaning of source program –

Describes the execution agent actions each source statement specifies

Compiler Components

  1. Lexical analyzer
  2. Symbol table scheme

    Error messages

  3. Parser
  4. Name resolution and usage analysis

  5. Optimizer
  6. Code Generator

Interface data structures

LEX – PARSE:

Tokens, symbol table, constant table

PARSE – OPT:

Parse tree, Abstract Syntax Tree

OPT – CGEN:

AST or Intermediate Representation

CGEN – OBJ

Assembly language, or code in memory

OBJ – results

OBJ reads program input data, produces program output

Variations

Each component can be designed in several ways – appropriate for different compilers. I’ll try to cover at least 2 variations in class, and ask you to use one (usually the simplest).

Each interface can also be designed in several ways. I’ll suggest the one I’ve found simplest.

Emphasis on speed: The compiler is to be coded in C, using methods which minimize the total number of instructions executed; the goal of execution optimization is also the minimization of total instructions executed, given a program which is semantically correct.

I’ll specify a small source language. The language will be a complete programming language, although not one you’d like to use. I’ll suggest ways for this language and its compiler to be expanded.

Simple optimizations will be discussed in class, and you’ll be expected to implement some of them. If time permits, we will cover more elaborate optimizations.

Notation describing source language "drives" compilation process:

Backus Naur Form [BNF] grammar explains how any string in L is generated

This imposes a hierarchical structure on each string in L

      1. Tree structure contains single nodes to represent each "component" of the string (a program)
      2. Each component node points at the pieces of that component, so your compiler can use the structure to look at, say, all statements in a subroutine, or the operands of a specific operator in an expression

The "meaning" of any given component can be (partly) described in terms of the "meaning" (translation to machine code) of each of its components

Organization of compiler

    1. Based partly on efficiency:
    2. LEX is designed to group meaningful strings of characters together, replace each such string with a pair of integers. Such a pair of integers is called a token. This allows the rest of the compiler to operate on far fewer fixed-size items – the rest of the compiler never need examine the characters of which tokens are made.

    3. PARSE reconstructs the derivation of the source program from the grammar of the language. Reports errors, attempts to fix them, also. Builds the "parse tree" and may call "semantic routines" in the process.
    4. Semantic routines know the meaning of each language construct. Apply this knowledge to the parse tree to create an intermediate representation of the program, which represents meaning of that program, in terms of near-machine operations. Semantics also enters information in the symbol table, accumulating "usage" information about each name appearing in the program. Inconsistent uses are reported as errors; rest of the information assists in selecting "machine meanings", like "Floating point addition", or "string concatenation".
    5. Optimization. A misnomer – this phase attempts various transformations on the program which improve its running time. It never improves it to the limit. Several optional phases
    6. Code generation: Adds details to the IR, generates actual machine code (often in an assembly language). The executing code must produce the same results that a careful human interpreter of the source program would get on the same input data. It should also execute in as few machine cycles as possible.

Example:

while(Jasmine < Pearl) {

Zeta = Zeta – Pearl

}

Tokens returned:

Returned value

TokId

WHILE

0

LPAREN

0

ID

Look("Jasmine")

OP

LT

ID

Look("Pearl")

RPAREN

0

LBRACE

0

ID

Look("Zeta")

OP

EQ

ID

Look("Zeta")

OP

MINUS

ID

Look("Pearl")

RBRACE

0

 

The call Look(x) returns the entry in the hash table "Ht", where string x is stored. This entry holds string x, along with fields in which other useful information about string x can be placed. The symbols

WHILE, ID, LPAREN, RPAREN, LBRACE, RBRACE, and OP

are each #defined as different small integers. These are possible returned values. The symbols

LT, MINUS, and EQ

specify "OP" more fully. LEX will eliminate all white-space (but keep track of the line number, so error messages can be keyed to the line on which they occurred) and will eliminate all comments.

 

Parsing

This phase has the responsibility of reconstructing the derivation of the source program, using the "grammar" of the language. It reads the sequence of tokens produced by LEX, and organizes them into a "parse tree" which gives a hierarchical picture of this derivation. For the statement

while(Jasmine < Pearl) {

Zeta = Zeta – Pearl

}

PARSE might produce a tree which can be symbolized by:

 

Here, a name like "Zeta" represents the value of Look("Zeta"), a pointer to a hash table entry.

Code generation might produce the following SPIM code from this:

 

BR

L0

L1:

 

 

 

LW

$t0,Zeta

 

LW

$t1,Pearl

 

SUB

$t0,$t0,$t1

 

STW

$t0,Zeta

L0:

 

 

 

LW

$t0,Jasmine

 

LW

$t1,Pearl

 

SLT

$t0,$t0,$t1

 

BNZ

$t0,L1