Characters and Lexical Analysis
A compiler accepts a sequence of characters in some alphabet, and parses or recognizes the sequence as defining a valid program in the compiler's source language. In general, parsing involves recognizing which sub-sequences of the input form recognizable units in the language, like assignment statements, or expressions. The first step in the recognition process is usually the replacement of long strings of characters with objects, called tokens which are fixed-sized codes for the corresponding input string. The rest of the compiler then processes the sequence of tokens, and need not examine the individual characters making up each token's string.
Lexical analysis is the name given to the part of the compiler that divides the input sequence of characters into meaningful token strings, and translates each token string to its encoded for, the corresponding token. Tokens are fairly simple in structure, allowing the recognition process to be done by a simple algorithm. Examples of tokens might be:
During lexical analysis, the source program is treated as a sequence of tokens, separated by optional whitespace. Lexical analysis ignores the sequence of tokens, leaving the decision as to whether tokens occur in a meaningful sequence up to the rest of the parsing portion of the compiler.
Lexical analysis replaces each token string with its encoded form:
Here, the first column represents the value of a single byte (the "token type") in the output sequence, where NUMBER, NAME, STRING, and each value of "op" are chosen to be different. The designation <description> describes the value of a single integer which follows the token type byte, while parenthesized material is just commentary, not represented in the output of Lex. Note that comments are simply discarded -- they do not give rise to any token in the output.
Algorithm:
The algorithm for lexical analysis is based on the concept of a Finite State Machine, a mathematical model of a simplified computer. An FSM maintains an internal state, and performs actions as it moves from one internal state to another. These moves, called transitions, take place when the machine in state X reads a character C from the input. The combination (X,C) selects a particular transition, to a target state T. During the transition, the machine may perform an action A. The entire machine is definined by giving the function d(X,C) = (T,A), and specifying which state the thing STARTs in. As a special cae, we will also allow a transition that reads no character, but performs an action, and moves to a new state. The character l will be used to represent the absence of a character. If it appears in a table defining the d() function, its transition will be performed only if the next input character does NOT match any character explicitly shown in d() entries for the same state. The d() table can be stored as a pair of arrays indexed by State and Character. Actions can be encoded, by giving each possible action a unique number, and performing
switch (Action[State][Character] { case 1: <first action>; break; case 2: ... }
The program can now execute
State = Next[State][Character];
and repeat these steps, until there is no more input. Action code can:
where val[] is a pre-initialized array which maps digits to their numeric values.
When the end of a token string is recognized, the associated action routine usually produces the required output. This may involve recording the final form of the the token string in an internal data dtructure, and producing the required token-type byte, and acccompanying "value" information. In most languages, the recognition of "end-of-tioken" is triggered only by the observation of an input character that can't be incorporated into the current token. The l transition can be used, to perform the end of token action, and then proceed to process the input character that triggered that transition.
A lexical analyzer must be prepared to process any possible character, in any state. In the ASCII code, there are some 256 possible characters, and it becomes tedious to fill in the table entries for all 256 for each of perhaps 10 states. A simplification can ease this burden: As each character is read, a reference to an array, indexed by character, can retrieve the character's "character class code", Class=Cl[Character]. The other tables can then be referenced, using Class instead of Character. There are only about 8 different character classes, for the tokens described above, so this reduces the size of the State and Action tables dramatically.
Example: Lexical Analyzer for numbers, whitespace and comments State Diagram
|
White [ \t\n] |
Dig [0-9] |
Op[-+=()] |
/ |
* |
|
|
Start |
Start:: X |
Number::A |
Start:: D |
/::X |
Start:: D |
|
/ |
/::X |
Start::l D |
Start::l D |
Start::l D |
Comment::X |
|
Comment |
Comment::X |
Comment::X |
Comment::X |
Comment::X |
*::X |
|
Number |
Start::l C |
Number::B |
Start::l C |
Start::l C |
Start::l C |
|
* |
Comment::X |
Comment::X |
Comment::X |
Start::X |
*::X |
In this example, I have used "c" for the value of the character read, and "n" for the value of the number being constructed, because longer names will not fit in my Finite State Machine diagram easily. This machine is encoded by a table given above. A table entry S::Y indicates that the next state should be state S, and that action Y should be taken during the transition from this state to S. If Y begins with l , the current input character will be re-read after state S is reached. There are 5 character classes shown, and a 6th (for illegal characters) also exists. The machine starts in state "Start". The actions that can be taken are:
A: n = val[c]; /* set initial value of the number being read */
B: n = n*10 +val[c]; /* Add next digit to number */
C: Emit NUM; Emit n; /* Put the token-type byte (value NUM) and n
into the output */
D: Emit OP+c; /* Put a token-type byte with value describing
this operator */
X: /* do nothing */
In the actual code, the state-names, and the action names should be replaced by small integers; the actions will become cases in a "switch(Action[State][c]" statement.
The "character" type:
This C example treats characters read in as small integers, with each distinct character represented by a distinct integer. In C, this is automatic. 'A' is just an abbreviation for the ASCII code of the character "A". In other languages, a distinction is made between characters (which may be single-character strings) and integers. This distinction makes logical sense, but also makes operations like lexical analysis on strings of characters inefficient. Many languages do provide for some kind of functions to convert between "characters" and "ASCII codes", where the ASCII codes are small integers. Others require that this be done using some kind of mapping, or possibly a series of if-tests, that the programmer must write.