A Simple Compiler - Part 1: Lexical analysis.

Lexical analysis, also called scanning, this part of a compiler breaks the source code into meaningful symbols that the parser can work with. Typically, the scanner returns an enumerated type (or constant, depending on the language) representing the symbol just scanned. Scanning is the easiest and most well-defined aspect of compiling.

Additional responsibilities of the scanner include removing comments, identifying keywords, and converting numbers to internal form.

For an example, given the following assignment statement from a Basic-like language:

 average = total / num_items

A scanner might return, in the order given:

 Symbol:         Source text:
 -------         -----------

 identifier      average
 assign          =
 identifier      total
 divide          /
 identifier      num_items

For a function call, from a C-family language:

 puts("Hello, World");

A scanner might return, in the order given:

 Symbol:         Source text:
 -------         -----------

 identifier      puts
 left_paren      (
 string_literal  "Hello, World"
 right_paren     )
 semicolon       ;

When designing a new language, it is a good idea to think about the symbols that make up the language, and define them in such a way that they can be specified in a non-ambiguous fashion.

A simple example:

Suppose we have a simple language that allows you to display the output of constant integer expressions, featuring the addition and multiplication operators. An example statement in the language:

 write(2001 + 42 * 457)

From this, we might say that the language is composed of Identifiers, integer constants, '(', ')', '+', '*'.

We could specify this more formally using regular expressions:

 identifiers:        [a-zA-Z][a-zA-Z0-9]*
 integer constants:  [0-9]+
 left parenthesis    (
 right parenthesis   )
 plus                +
 times               *

Note that we are assuming the language is free-form and not line based.

Tools exist that will take a specification not too far removed from this and automatically create a scanner. Lex and Flex are both popular scanner generators. Also, many parser generators include built-in scanner generators. Examples are Coco/R and Antlr.

It turns out that scanners, especially for non-ambiguously defined languages, are fairly easy to write.

Many compiler texts recommend constructing a scanner via a finite state machine. But for our purposes, a simple ad-hoc scanner is sufficient.

The main routine of a scanner, which returns an enumerated constant of the next symbol read is:

 getsym()
     -- skip white space
     while the char is a space or eol
         keep reading characters
     switch on the char
         when '+' read next char; return plus
         when '*' read next char; return times
         ...
         when 'a'..'z', 'A'..'Z':
             while the char is alphanumeric
                 accumulate characters in id
             return ident or keyword symbol, as appropriate
         when '0'..'9'
             while the char is a digit
                 accumulate the integer value in val
             return number
         otherwise
             read next char;
             return null
 end getsym

Table of Contents

Next Part 2: Syntax Analysis

Code Repository A Simple Compiler


TSE
  • Copyright © 2024 SemWare Corporation. All rights reserved worldwide.
  • SemWare is a registered trademark of Sammy and Bobbi Mitchell.