Prof Wagner: CPS 218.
Compiler Construction – Overview
Introduction
Linguistic Notations:
Describes the execution agent actions each source statement specifies
Compiler Components
Symbol table scheme
Error messages
Name resolution and usage analysis
Interface data structures
|
LEX – PARSE: |
Tokens, symbol table, constant table |
|
PARSE – OPT: |
Parse tree, Abstract Syntax Tree |
|
OPT – CGEN: |
AST or Intermediate Representation |
|
CGEN – OBJ |
Assembly language, or code in memory |
|
OBJ – results |
OBJ reads program input data, produces program output |
Variations
Each component can be designed in several ways – appropriate for different compilers. I’ll try to cover at least 2 variations in class, and ask you to use one (usually the simplest).
Each interface can also be designed in several ways. I’ll suggest the one I’ve found simplest.
Emphasis on speed: The compiler is to be coded in C, using methods which minimize the total number of instructions executed; the goal of execution optimization is also the minimization of total instructions executed, given a program which is semantically correct.
I’ll specify a small source language. The language will be a complete programming language, although not one you’d like to use. I’ll suggest ways for this language and its compiler to be expanded.
Simple optimizations will be discussed in class, and you’ll be expected to implement some of them. If time permits, we will cover more elaborate optimizations.
Notation describing source language "drives" compilation process:
Backus Naur Form [BNF] grammar explains how any string in L is generated
This imposes a hierarchical structure on each string in L
The "meaning" of any given component can be (partly) described in terms of the "meaning" (translation to machine code) of each of its components
Organization of compiler
LEX is designed to group meaningful strings of characters together, replace each such string with a pair of integers. Such a pair of integers is called a token. This allows the rest of the compiler to operate on far fewer fixed-size items – the rest of the compiler never need examine the characters of which tokens are made.
Example:
while(Jasmine < Pearl) {
Zeta = Zeta – Pearl
}
Tokens returned:
|
Returned value |
TokId |
|
WHILE |
0 |
|
LPAREN |
0 |
|
ID |
Look("Jasmine") |
|
OP |
LT |
|
ID |
Look("Pearl") |
|
RPAREN |
0 |
|
LBRACE |
0 |
|
ID |
Look("Zeta") |
|
OP |
EQ |
|
ID |
Look("Zeta") |
|
OP |
MINUS |
|
ID |
Look("Pearl") |
|
RBRACE |
0 |
The call Look(x) returns the entry in the hash table "Ht", where string x is stored. This entry holds string x, along with fields in which other useful information about string x can be placed. The symbols
WHILE, ID, LPAREN, RPAREN, LBRACE, RBRACE,
and OPare each #defined as different small integers. These are possible returned values. The symbols
LT, MINUS,
and EQspecify "OP" more fully. LEX will eliminate all white-space (but keep track of the line number, so error messages can be keyed to the line on which they occurred) and will eliminate all comments.
Parsing
This phase has the responsibility of reconstructing the derivation of the source program, using the "grammar" of the language. It reads the sequence of tokens produced by LEX, and organizes them into a "parse tree" which gives a hierarchical picture of this derivation. For the statement
while(Jasmine < Pearl) {
Zeta = Zeta – Pearl
}
PARSE might produce a tree which can be symbolized by:

Here, a name like "Zeta" represents the value of Look("Zeta"), a pointer to a hash table entry.
Code generation might produce the following SPIM code from this:
|
|
BR |
L0 |
|
L1: |
|
|
|
|
LW |
$t0,Zeta |
|
|
LW |
$t1,Pearl |
|
|
SUB |
$t0,$t0,$t1 |
|
|
STW |
$t0,Zeta |
|
L0: |
|
|
|
|
LW |
$t0,Jasmine |
|
|
LW |
$t1,Pearl |
|
|
SLT |
$t0,$t0,$t1 |
|
|
BNZ |
$t0,L1 |