A Simple Compiler - Part 2: Syntax analysis.

Like Lexical analysis, Syntax analysis (or so-called parsing) is a highly analyzed and well understood part of compiling. Basically, parsing takes the symbols returned by the scanner and makes sure that they form sentences (or productions) that are legal, according to the languages grammar.

There are many different parsing algorithms, some more flexible than others. Some are more suited to automated parser generators than others. Well known parser generators include YACC, BISON (the GNU version of YACC), Antlr, and Coco/R. We will cover one algorithm - called recursive descent - which is well suited to a hand-crafted compiler. But before we get there, we'll need to talk about grammars.

A grammar is a formal way of specifying the syntax of a language. Developing a well defined non-ambiguous grammar is one of the most important steps you can take to make writing your compiler successful. The particular grammar format we'll use is called EBNF, for Extended Backus Naur Form.

An EBNF grammar consists of:

Terminals - which consist of symbols in the language such as keywords and operators.

Non-terminals - these are syntactic classes that can be substituted, and eventually lead to terminals.

Meta symbols - such as '=', '|', '()', '{}', '[]' and '.'.

Productions - this is simply a rule in the grammar.

The start symbol - this is where the grammar starts.

By convention, terminals are specified in quotes, and non-terminals are not.

A production in a grammar takes the following form:

non-terminal = terminal non-terminal ....

For example:

 sentence = subject predicate "." .
 subject = "Peter" | "Paul" .
 predicate = "writes compilers" | "crashes compilers" .

This grammar has 3 productions. Sentence is the start symbol, subject and predicate are non-terminals, and "Peter", "Paul", "writes compilers", "crashes compilers" are terminals. Additionally, '|' is used for alteration. Note how all non-terminals have to eventually be defined by terminals.

Valid sentences in this language are:

 Peter writes compilers.
 Peter crashes compilers.
 Paul writes compilers.
 Paul crashes compilers.

To specify a function call in a C-like language, we could write:

 function_call = ident '(' [parm_list] ')' .
 parm_list = ident {',' ident} .

The grammar above has two productions: function_call and parm_list. ident, '(', ',', and ')' are terminals, and parm_list is a non-terminal. [] and {} are meta-symbols, meaning zero-or-one and zero-or-more of what they enclose, respectively. Other meta-symbols are '|' for 'or', and '(' and ')' for grouping. As you can see from the above simple grammar, non-terminals in the production to the right of the '=' have to also be listed to the left of the '='.

Note that by convention, ident and number are treated as terminals, and are usually specified by something similar to:

 ident = letter { letter | digit }
 number = digit { digit }

Another common terminal is String, which can be defined as:

 string = "'" {no_single_quote} "'" .
 no_single_quote = AnyChar - "'" .
 AnyChar = Chr(32) .. Chr(255) .

function_call can be read as requiring an identifier, an opening '(', an optional parameter list, and a closing )'. parm_list can be read as an identifier, optionally followed by zero or more ',' followed by another identifier.

According to the above grammar, the following are valid function calls:

 factor()
 term(a)
 expr(tx, prec, tptr)

The following are invalid function calls:

 1fun()  --an identifier must start with an alphabetic character
 fun     --a function call must have '(' and ')'
 myfun(a b c)    --parameters must be separated by ','

Another simple example. We'd like to write a teeny compiler for a subset of Pascal. Our subset will only support one statement - writeln. Here is a grammar for teeny Pascal:

  Program = 'program' ident ';' 'begin' Stmtseq 'end' '.' .
  Stmtseq = [Stmt {';' Stmt }] .
  Stmt    = 'writeln' '(' String ')' .

This grammar has 3 productions. Program is the start symbol. Pascal, Stmtseq and Stmt are non-terminals. Note how each of the non-terminal symbols is substituted for until it leads to a terminal.

  A teeny Pascal example program:

  program one;
  begin
    writeln('hello, world!');
    writeln('goodbye.');
  end.

  A erroneous teeny Pascal program:

  program;    -- ident missing after program
  begin
    write     -- unknown identifier
    writeln() -- writeln requires a string
  end;        -- terminating 'end' requires a '.'

Can you see how the grammar above can be used to see if a program is syntactically correct?

So how would you write a parser for the teeny Pascal grammar?

It turns out that the parsing closely follows the grammar. We assume the existence of a routine getsym() that fetches the next symbol from the source, and stores it in the global variable sym. We also assume the existence of a routine expect(), which verifies that the passed symbol is equal to the current symbol, and if so, retrieves the next symbol by calling getsym(). Further, we assume that getsym() has already been called at least once.

To parse the Pascal production:

 Program = 'program' ident ';' 'begin' Stmtseq 'end' '.' .

 program()
   expect(programsym)
   expect(idsym)
   expect(semi)
   expect(beginsym)
   stmtseq()
   expect(endsym)
   expect(period)
 end

Apparently, for terminals, we simply make sure they are there, and in the proper order.

For non-terminals we call a function that handles the work.

To parse Stmtseq:

  Stmtseq = Stmt {';' Stmt } .

 stmtseq()
   stmt();
   while sym == semi
     getsym()    -- consume ';'
     stmt()
   end
 end

Remember, {} specifies symbols that appear zero or more times.

The above could also be written as:

 stmtseq()
   loop
     stmt()
     if sym <> semi break
     getsym()
   end
 end

Finally, to parse Stmt:

  Stmt = ['writeln' '(' String ')'] .

 stmt()
   if sym == writelnsym
     getsym()
     expect(lparen)
     expect(stringsym)
     expect(rparen)
   endif
 end

Of course I am grossly oversimplifying here, but I hope you get the general idea. I also hope you see how important it is to have a grammar, and how easy it can be to develop a parser from that grammar.

A simple example implementation of the above grammar follows. The example can run the above teeny Pascal program.

Table of Contents

Prev Part 1: Lexical analysis

Next Part 3: More about grammars

Code Repository A Simple Compiler

SemWare is a registered trademark of Sammy and Bobbi Mitchell.