History of Parsing Methods
Anatomy of the Compilation Process
The different tasks to be done by a compiler can be broken down as
follows:
Preprocessing - handling include files, macros, and manifest
constants
Scanning - breaking the input file into tokens and managing the
symbol table
Parsing - verify that the input tokens form valid syntax
Semantics - derive meaning from the structure of the input tokens
Optimizations - find ways to accomplish the same task, quicker
Code Generation - generate assembly language code for a specific
computer
Assembly - assemble the assembly language code to an object module
ld - load one or more object modules with libraries to form an
executable.
The primary emphasis of this course is to study the parsing techniques
and the theory that makes it work.
BNF notation
John Backus, the principle designer of FORTRAN, and Peter Naur, a
journalist for a computer magazine, both attend a conference on Algol in
1960 in Paris, France, I believe. On their return trip they discussed
the definition of Algol-60, as they had perceived it. They had both
attending the same meetings and discussions and had come away from the
conference with widely differing interpretations of what the langauge
would look like and how it would work. Together they worked on refining
a notation for describing the grammars of languages. This notation is
called BNF, or Backus Naur Form. Some authors have dropped Peter Naur's
name from their definitions and called it Backus Normal Form instead.
An example of a BNF grammar that describes the rules for the +, -, *,
and / operators might look something like:
EXP => EXP + TERM
EXP => EXP - TERM
EXP => TERM
TERM => TERM * FACTOR
TERM => TERM / FACTOR
TERM => FACTOR
FACTOR => ( EXP )
FACTOR => identifier
The names EXP, TERM, and FACTOR are just arbitrary names which represent
groups of symbols, for example, the FACTOR is defined as either an
identifier (variable name) or an expression in parenthesis. Symbols
that are composed of other symbols are called Non-Terminal symbols.
Symbols that are composed of key words, operators, or names or any other
symbol that may be found in a program are called Terminal symbols.
The Terminal symbols for this grammar are +, -, *, /, (, ), and
identifier. The Non-Terminal symbols are EXP, TERM, and FACTOR.
Early Parsing Methods
Grammer Embedded in the Code
The earliest compilers were written with the definition of the langauge
buried deeply within the code. With these compilers it was very difficult
to verify that the compiler accepted all of the langauge syntax and only
the language syntax. This became especially difficult when the
definition of the language was changed for later versions. All
compilers before the early 1960's were of this type because there wasn't
any uniform method of describing the language grammars.
Recursive Descent
With the advent of the BNF notation for describing the languages,
compiler writers designed the structure of their subroutines and
functions to correspond to the structure of the BNF definition of the
language. To use our example grammar above, there would be seperate
functions to handle EXP's, TERM's, and FACTOR's. The EXP function would
call itself and the TERM function, etc. This way, when it came time to
update the language to meet changing standards, it would be easier to
find where the changes should be made. It also made it much easier to
verify that the language accepted all legal syntax and only the legal
syntax.
The Recursive Descent does not guarantee that the program matches the
grammar. It only aids in making it easier for the compiler writer to
try to verify the accuracy of the parser. The search for better parsing
methods continued with some that analyzed the grammars and attempted to
automate the parsing methods. The first such method was called Top Down
Parsing or LL Parsing.
Top Down Parsing
The top down parsing method, also called a predictive parse or LL parse,
requires that we reorganize the grammar so that the first symbol of each
rule defining a given Non-Terminal will indicate which rule to choose for
the Non-Terminal. This transformation can be done to any grammar, but is
sometimes awkward. There are also some cases that cannot be parsed
correctly with Top Down Parsing methods.
Bottom UP Parsing
The bottom up parse, also called an LR parse is the most powerful
parseing method. It also has the most complicated set of algorithms for
building the parse tables. There are a set of algorithms for building
LR parse tables. The same algorithm is used for all of the LR parse
tables.
LR(0)
The first of the LR parse generation algorithms is called LR(0)
and it generates tables that are somewhat large. The LR(0) parse
algorithm do not parse grammars with certain types of ambiguities.
LR(1)
The second algorithm, which handles all of the grammars correctly is LR(1).
The LR(1) algorithm generates correct parse tables for grammars with all
of the ambiguities that are found in most useful langauges. The biggest
strike against LR(1) parse tables is the fact that the tables generated
are much larger then the LR(0).
SLR
The third algoritm attempts to handle some of the amiguities that LR(0)
fails at and keeps the size of the parse tables the same as those
generated by LR(0). It is called Simple LR.
LALR
The last algorithm, Look Ahead LR, generates parse tables that are the
same size of LR(0), but handles all of the ambiguities that are handled
by LR(1).