A Simple Compiler - Part 3: More about grammars.

Some notes about the teeny compiler listing from last time:

Code is generated for a virtual machine. This virtual machine is based on the ones found in the texts by Wirth and Terry.

There is no error recovery - when an error is found, the compiler simply halts. We'll improve this in the future.

In the scanner, idsym is assigned to the (empty) last entry in the keywords table, hence idsym is the Symbol type if the symbol is not a keyword. Also, the keywords table is searched in a linear fashion. In a production compiler, a hash table should be used.

More about grammars:

We've looked at some very simple grammars. Do you remember the difference between terminals (basic language elements) and non-terminals (syntactic elements that are substituted for). Do you remember the EBNF meta symbols? () for grouping, | for alternation, {} for items occurring zero or more times, and [] for optional items.

Ready for something a little more complex? Remember our writeln statement? Lets allow for multiple strings, separated by commas, so that we can now write:

 writeln("string 1", "string 2", "string 3")

This isn't really useful now, but it will be once our language supports expressions. Here is the grammar:

 Writeln = 'writeln' '(' String { ',' String } ')' .

This says that writeln expects a left parenthesis, followed by a string, optionally followed by zero or more Strings, separated by commas. How would we parse this?

 if accept(writeln)
   expect(lparen)
   expect(litstrsym)
   while accept(comma)
     expect(litstrsym)
   end
   expect(rparen)
 end

See how the parsing code follows the grammar? As an aside, we could rewrite the above as:

 if accept(writeln)
   expect(lparen)
   repeat
     expect(litstrsym)
   until not accept(comma)
   expect(rparen)
 end

But let's slightly change the grammar, so that the parameters to Writeln are optional. That way, we can write just a newline by itself:

 Writeln = 'writeln' '(' [String { ',' String }] ')' .

Now the code becomes:

 if accept(writeln)
   expect(lparen)
   if not accept(rparen)
     repeat
       expect(litstrsym)
     until not accept(comma)
     expect(rparen)
   endif
 end

What about arithmetic expressions, for instance just the basics, those including plus, minus, multiply and divide.

 Expr    = Term {('+'|'-') Term} .
 Term    = Factor {('*'|'/') Factor} .
 Factor  = number
         | ('+'|'-') Factor
         | '(' Expr ')'
         .

Do you see how the above grammar forces the standard arithmetic precedence?

How would we parse the above? Just like we've been doing all along. Here goes:

To parse Expr:

  Expr = Term {('+'|'-') Term} .

 expr()
   term()
   while sym in [plus, minus]
     getsym()
     term()
   end
 end

  Term = Factor {('*'|'/') Factor} .

 term()
   factor()
   while sym in [multiply, divide]
     getsym()
     factor()
   end
 end

  Factor  = number
          | ('+'|'-') Factor
          | '(' Expr ')'
          .

 factor()
   if sym == number
     getsym()
   elseif sym in [plus, minus]
     getsym()
     factor()
   elseif sym == lparen
     getsym()
     expr()
     expect(rparen)
   else
     error()
   end
 end

I'm not trying to overemphasize this point, but do you see how simple it is to implement the parser, as long as you have a grammar? See how important a grammar is?

But all isn't quite as easy as it seems. In order to parse your grammar via recursive descent, you must make sure that your grammar is not ambiguous. Here is an example of an ambiguous grammar:

 Statement = 'if' Expr 'then' StmtSeq 'end'
           | 'while' Expr 'do' StmtSeq 'end'
           | ident ':=' Expr
           | ident '(' [ Expr {',' Expr}] ')'
           .

Can you see the ambiguity? Both assignment (ident ':=' Expr) and procedure call (ident '(' [ Expr {',' Expr}] ')') start with the same terminal. This won't work in a recursive descent parser. For that, we would need to use a more sophisticated algorithm.

So how can we solve this problem? Well, the only way is to either redesign your language or somehow change the grammar in an appropriate fashion. We'll solve the problem by doing the latter when we implement assignment and procedures.

Here is the new grammar for our teeny (version .03) language:

 Teeny     = 'program' ident ';' CmpndStmt '.' .
 CmpndStmt = 'begin' [Stmt {';' Stmt }] 'end' .
 Stmt      = Writeln .
 Writeln   = 'writeln' '(' [(string|Expr)
               {',' (string|Expr) }] ')' .
 Expr      = Term { ('+' | '-') Term } .
 Term      = Factor { ('*' | '/') Factor } .
 Factor    = number
           | ('+' | '-') Factor
           | '(' Expr ')'
           .

And here is a sample teeny v0.03 program:

 program three;
 begin
   writeln('hello, world!');
   writeln('2 + 3 * 4 = ', 2 + 3 * 4);
 end.

It took about 100 lines of code to implement the new features. The major changes are listed here:

To the Virtual Machine interpreter:
- Added new instructions
  - wri - write an integer
  - lit - load an integer constant on the stack
  - neg - negate the integer at top
  - mul, dvd, add, sub: replace top-1 with the appropriate operation applied to top and top - 1

Added a stack for variable evaluation.

To the VirtualMachine:
- Added methods to encapsulate code generation and make the parser less knowledgeable about what kind of machine code is generated for.

To the Scanner:
- Added various operators ('-', '/', ',')

To the Parser:
- Updated to handle new grammar

Next time we'll implement integer constants.

Table of Contents

Prev Part 2: Syntax analysis

Next Part 4: Semantic Analysis - the Symbol Table

Code Repository A Simple Compiler

SemWare is a registered trademark of Sammy and Bobbi Mitchell.