Top Down Recursive Descent Parsing
Predictive Parsing:
Create a parse tree, by starting with the tree root, predicting which ALTERNATIVE RHS is used to expand the LEFTMOST FRONTIER NON-TERMINAL.
Verify prediction, by ensuring that predicted alternative can GENERATE the NEXT input token.
Deterministic
In a deterministic parse, the predictions are always correct. That is, the parser knows that the ONLY possible choice of RHS which can generate the next token is the one it has picked.
Because the sets of tokens FIRST(RHSi) are all disjoint sets for all rules X::= RHSi
This property does not hold for every grammar. A grammar for which it holds is termed LL(1), standing for "can build a leftmost derivation by scanning from left to right, with one symbol of lookahead". (A "leftmost" derivation always expands the leftmost non-terminal in the LHD of each derivation step.)
If it does not hold, you may be able to change the grammar so it does, WITHOUT changing the language that is generated.
Top Down Recursive Descent Parser
Build a set of subroutines, one for each NON-TERMINAL in the language. The subroutine named "X" has the job of parsing the longest sequence a of tokens it can find on the input, where X =>* a . X builds a parse tree for this derivation of a , and returns the node number of its root. I usually do this, by defining a routine "node(rule#, child1, child2, child3, …)", which computes the next node number, enters the values of its arguments into fields of that node, (or, maybe, prints the node number: argument values in tabular form). "node()" then returns the number of the node it constructed.
Whenever the prediction is "find an X next", where X is a non-terminal, the parser can simply CALL X(). If a TERMINAL is next, the parser should check that the PREDICTED terminal has been found, and report an error, if not (and perform some corrective action that will not generate too many error messages, and will allow parsing to continue). If the predicted terminal WAS found, Lex() should be called, to "scan the matched token away". In all cases, information returned by X(), or Lex(), may be saved in variables declared LOCAL to each parsing routine. These variables (which are usually node numbers of sub-parse-tree roots) can then be used as arguments to node().
Example
Here, E and V are non-terminals (E is a kind of "expression", V a "variable"). "I", "(", ")", and "-" are terminals.
V à I1 | I(E) 2
E à V3 | V-E4 | -E5 | (E) 6
Here, I have added subscripts to each RHS, so each can be uniquely identified. These subscripts are the "rule numbers" that will act as labels for the nodes in the parse tree.
This grammar does not have the LL(1) property, for when a V is predicted, the two RHSs for V both begin with I. Similar problems hold for E. However, the operation known as "factoring" can help.
Factoring
Factoring replaces RHSs which begin with the same prefix with a single RHS, which begins with that prefix, and is then followed by a choice of suffices separated by "|". The set of suffix choices is placed inside "meta-parentheses" ( and ).
V à I ( l 1 | ( E )2 )
E à V ( l 3| -E 4) | -E5 | (E)6
The new rule for V is equivalent to the rules:
V à IX
X à l | (E)
That is, we have in effect introduced a new non-terminal symbol X to "defer" the decision as to which RHS for V in the original grammar should be used in generating the current input sentence. "Deferring" the decision allows the parser to read over the common part of the RHSs (the I), and wait until the token FOLLOWING the I is available. The parser then makes its prediction based on this deferred input.
There are some caveats in checking for the LL(1) property, which you must be aware of. Essentially, where l is an alternative RHS, when l actually IS the rule to use, the input will be ready to read whatever FOLLOWS the l . That might be anything that could follow the rule’s LHS in any legal sentential form. These FOLLOW items must not match any token in any alternative FIRST(RHS), if the LL(1) property is to hold.
In the current grammar, FOLLOW(V) = {-, ) }, FOLLOW(E) = { ) }.
FIRST(V) = {I}. We must check that FOLLOW(V) and "(" have no symbols in common. Similarly, FOLLOW(E) must not contain "-". Luckily, all FIRST sets are now disjoint, and appropriate FOLLOW sets differ from their neighboring FIRST sets, so this transformed grammar IS LL(1).
Note that although we will use the new grammar as the basis for constructing a parser, the parse tree that is constructed MUST have the same structure as that implied by the ORIGINAL grammar. Sometimes that is hard to achieve, but the proper structure can usually be built by fiddling with the local variables of the parsing routines.
Below, I’ll write the parsing routine for V by directly following the "recipe" defined by the rule for the non-terminal V in the new grammar:
Note that ALL parse routines are built to preserve a parsing invariant: On call, the first token of the phrase to be parsed has ALREADY been Lex()’d. Each parsing routine exits with this property preserved, even if that means that the parsing routine must read one token beyond the last token it actually examines. The first token to be examined by each parsing routine is found in TOK on entry.
int V() {
int j,k;
if (TOK.TYP != ID ) error();
j = TOK.ID; Lex();
if (TOK.TYP == LPAREN ) {
Lex(); k=E();
If (TOK.TYP!= RPAREN ) error();
Lex(); return(node(2,j,k));
}
else return(node(1,j,0));
}
int E() {
int j,k;
switch (TOK.TYP) {
case ID:
j=V();
if (TOK.TYP == MINUS ) {
Lex(); k=E();
return ( node(4,j,k));
}
return (node(3,j,0));
case MINUS:
Lex(); return(node(5,E(),0));
case LPAREN:
Lex(); k=E();
if (TOK.TYP!=RPAREN) error();
Lex(); return(node(6,k,0));
default:
error();
}
}
In the above routines, error() is assumed to exit(1). Actually, each call on this generic "error" routine should be replaced by some corrective action specific to the context of the error. For instance, where a RPAREN is expected, the parser could emit an error message which says "I’m inserting a ‘)’ here", and act as if it has done so. Where a "V" is expected, but can’t be found, some parsers go so far as to insert a dummy variable name, which will allow the parse to continue. Care must be taken to "scan away" an input token which is completely illegal, to avoid causing an infinity of error comments.
Left Recursion Elimination
Many grammars include rules which involve "left recursion", for example:
E ® T1 | E - T2
The FIRST sets of each of these RHS's must be the same, since the same non-terminal T begins each RHS. However, some study shows that this grammar for E indicates that an E is simply a T, followed by zero or more "-T" sequences. In other words, E = T(-T)*, written as a regular expression. We can write a recognizer for these sequences which is deterministic, by using a "while" loop:
int E() {
int k,j;
j = node(1, T(), 0);
while (TOK.TYP==MINUS) {
Lex();
k = T();
j = node(2, j, k);
}
return (j);
}
The resulting parse tree maintains the implied parenthesis structure of the parse tree for the original grammar, so the string:
T-T-T-T
will be parsed to have structure
((((T)-T)-T)-T)