Additional responsibilities of the scanner include removing comments, identifying keywords, and converting numbers to internal form.
For an example, given the following assignment statement from a Basic-like language:
average = total / num_items
A scanner might return, in the order given:
Symbol: Source text: ------- -----------
identifier average assign = identifier total divide / identifier num_items
For a function call, from a C-family language:
puts("Hello, World");
A scanner might return, in the order given:
Symbol: Source text: ------- -----------
identifier puts left_paren ( string_literal "Hello, World" right_paren ) semicolon ;
When designing a new language, it is a good idea to think about the symbols that make up the language, and define them in such a way that they can be specified in a non-ambiguous fashion.
A simple example:
Suppose we have a simple language that allows you to display the output of constant integer expressions, featuring the addition and multiplication operators. An example statement in the language:
write(2001 + 42 * 457)
From this, we might say that the language is composed of Identifiers, integer constants, '(', ')', '+', '*'.
We could specify this more formally using regular expressions:
identifiers: [a-zA-Z][a-zA-Z0-9]* integer constants: [0-9]+ left parenthesis ( right parenthesis ) plus + times *
Note that we are assuming the language is free-form and not line based.
Tools exist that will take a specification not too far removed from this and automatically create a scanner. Lex and Flex are both popular scanner generators. Also, many parser generators include built-in scanner generators. Examples are Coco/R and Antlr.
It turns out that scanners, especially for non-ambiguously defined languages, are fairly easy to write.
Many compiler texts recommend constructing a scanner via a finite state machine. But for our purposes, a simple ad-hoc scanner is sufficient.
The main routine of a scanner, which returns an enumerated constant of the next symbol read is:
getsym()
-- skip white space
while the char is a space or eol
keep reading characters
switch on the char
when '+' read next char; return plus
when '*' read next char; return times
...
when 'a'..'z', 'A'..'Z':
while the char is alphanumeric
accumulate characters in id
return ident or keyword symbol, as appropriate
when '0'..'9'
while the char is a digit
accumulate the integer value in val
return number
otherwise
read next char;
return null
end getsym
Code Repository A Simple Compiler
![]() |
|