Lexicon
Lexical analysis is the first step in character string
processing: it segments character strings into a sequence of words
also known as lexical units or lexemes.
Module Genlex
This module provides a simple primitive allowing the analysis of a string
of characters using several categories of predefined lexical
units. These categories are distinguished by type:
# type
token
=
Kwd
of
string
|
Ident
of
string
|
Int
of
int
|
Float
of
float
|
String
of
string
|
Char
of
char
;;
Hence, we will be able to recognize within a character string an
integer (constructor Int) and to recover its value
(constructor argument of type int). Recognizable strings and
characters respect the usual conventions: a string is delimited by two
(") characters and character literals by two (')
characters. A float is represented by using either floating-point
notation (for example 0.01) or exponent-mantissa notation
(for example 1E-2).
Constructor Ident designates the category of
identifiers. These are the names of variables or functions in
programming languages, for example. They comprise all strings of
letters and digits including underscore (_) or apostrophe
('). Such a string should not start with a digit. We also consider
as identifiers (for this module at least) strings containing
operator symbols, such as +, *, > or =. Finally,
constructor Kwd defines the category of keywords containing
distinguished identifiers or special characters (specified by the
programmer when invoking the lexer).
The only variant of the token type controlled by parameters is
that of keywords. The following primitive allows us to create a lexical
analyser (lexer) taking as keywords the list passed as first argument to it.
# Genlex.make_lexer
;;
- : string list -> char Stream.t -> Genlex.token Stream.t = <fun>
The result of applying make_lexer to a list of keywords is a
function taking as input a stream of characters and returning a stream
of lexical units (of type token.)
Thus we can easily obtain a lexer for our BASIC interpreter. We declare the
set of keywords:
# let
keywords
=
[
"REM"
;
"GOTO"
;
"LET"
;
"PRINT"
;
"INPUT"
;
"IF"
;
"THEN"
;
"-"
;
"!"
;
"+"
;
"-"
;
"*"
;
"/"
;
"%"
;
"="
;
"<"
;
">"
;
"<="
;
">="
;
"<>"
;
"&"
;
"|"
]
;;
With this definition in place, we define the lexer:
# let
line_lexer
l
=
Genlex.make_lexer
keywords
(Stream.of_string
l)
;;
val line_lexer : string -> Genlex.token Stream.t = <fun>
# line_lexer
"LET x = x + y * 3"
;;
- : Genlex.token Stream.t = <abstr>
Function line_lexer takes as input a string of characters and
returns the corresponding stream of lexemes.
Use of Streams
We can carry out the lexical analysis ``by hand'' by directly
manipulating streams.
The following example is a lexer for arithmetical
expressions. Function lexer takes a character stream and
returns a stream of lexical units of type
lexeme Stream.t1. Spaces, tabs and newline
characters are removed. To simplify, we do not consider variables
or negative integers.
# let
rec
spaces
s
=
match
s
with
parser
[<
'
' '
;
rest
>]
->
spaces
rest
|
[<
'
'\t'
;
rest
>]
->
spaces
rest
|
[<
'
'\n'
;
rest
>]
->
spaces
rest
|
[<>]
->
();;
val spaces : char Stream.t -> unit = <fun>
# let
rec
lexer
s
=
spaces
s;
match
s
with
parser
[<
'
'('
>]
->
[<
'Lsymbol
"("
;
lexer
s
>]
|
[<
'
')'
>]
->
[<
'Lsymbol
")"
;
lexer
s
>]
|
[<
'
'+'
>]
->
[<
'Lsymbol
"+"
;
lexer
s
>]
|
[<
'
'-'
>]
->
[<
'Lsymbol
"-"
;
lexer
s
>]
|
[<
'
'*'
>]
->
[<
'Lsymbol
"*"
;
lexer
s
>]
|
[<
'
'/'
>]
->
[<
'Lsymbol
"/"
;
lexer
s
>]
|
[<
'
'0'
..
'9'
as
c;
i,
v
=
lexint
(Char.code
c
-
Char.code('0'
))
>]
->
[<
'Lint
i
;
lexer
v>]
and
lexint
r
s
=
match
s
with
parser
[<
'
'0'
..
'9'
as
c
>]
->
let
u
=
(Char.code
c)
-
(Char.code
'0'
)
in
lexint
(1
0
*
r
+
u)
s
|
[<>]
->
r,
s
;;
val lexer : char Stream.t -> lexeme Stream.t = <fun>
val lexint : int -> char Stream.t -> int * char Stream.t = <fun>
Function lexint carries out the lexical analysis for the
portion of a stream describing an integer constant. It is called by
function lexer when lexer finds a digit on the input stream.
Function lexint then consumes all consecutive digits
to obtain the corresponding integer value.
Regular Expressions
2
Let's abstract a bit and consider the problem of lexical units from a
more theoretical point of view.
From this point of view, a lexical unit is a word. A word is
formed by concatening items in an alphabet. For our purposes,
the alphabet we are considering is a subset of the ASCII
characters. Theoretically, a word may contain no characters (the
empty word3) or just a single
character. The theoretical study of the assembly of lexical items
(lexemes) from members of an alphabet has brought about a simple formalism
known as regular expressions.
Definition
A regular expression defines a set of words. For example, a regular expression
could specify the set of words that are valid identifiers.
Regular expressions are specified by a few
set-theoretic operations. Let M and N be two sets of words. Then
we can specify:
-
the union of M and N, denoted by M | N.
- the complement of M, denoted by ^M. This
is the set of all words not in M.
- the concatenation of M and N. This is the set of all the words
formed by placing a word from M before a word from N. We denote
this set simply by MN.
- the set of words formed by a finite sequence of words in M,
denoted M+.
- for syntactic convenience, we write M? to denote
the set of words in M, with addition of the empty word.
Individual characters denote the singleton set of words containing
just that character. Expression a | b | c thus describes the
set containing
three words: a, b ant c. We will use the more compact
syntax [abc] to define such a set. As our alphabet is ordered
(by the ASCII code order) we can also define intervals. For example, the
set of digits can be written: [0-9]. We can use parentheses to
group expressions.
If we want to use one of the operator characters as a character in a
regular expression, it should be preceded by the escape character
\. For example, (\*)* denotes the set of sequences of
stars.
Example
Let's consider the alphabet comprising digits
(0, 1, 2, 3, 4, 5, 6, 7, 8, 9) the plus (+), minus (-) and dot (.) signs and letter E. We can define the set
num of words denoting numbers. Let's call integers the set defined
with [0-9]+. We define the set unum of unsigned numbers
as:
integers?(.integers)?(E(\+|-)?integers)?
The set of signed numbers is thus defined as:
unum | -unum or with -?unum
Recognition
While regular expressions are a useful formalism in their own right,
we usually wish to implement a program that determines whether a
string of characters (or one of its substrings) is a member of the set
of words described by a regular expression. For that we need to
translate the formal definition of the set into a recognition and
expression processing program. In the case of regular expressions such
a translation can be automated. Such translation techniques are
carried out by module Genlex in library Str
(described in the next section) and by the ocamllex tools that
we introduce in the following two sections.
The Str Library
This module contains an abstract data type regexp which
represents regular expressions as well as a function regexp
which takes a string describing a regular expression, more or less
following the syntax described above, and returns its abstract
representation.
This module contains, as well, a number of functions which exploit
regular expressions and manipulate strings. The syntax of regular
expressions for library Str is given in figure 11.1.
. |
any character except \n |
* |
zero or more occurences of the preceding expression |
+ |
one or more occurences of the preceding expression |
? |
zero or one occurences of the preceding expression |
[ ..] |
set of characters (example [abc] ) |
|
intervals, denoted by - (example [0-9] ) |
|
set complements, denoted by ^ (example [^A-Z] ) |
^ |
start of line (not to be mistaken with the use of ^ as a set
complement) |
$ |
end of line |
| |
alternative |
( ..) |
grouping of a complex expression (we can later refer to such an
expression by an integer index -- see below) |
i |
an integer constant, referring to the string matched by the i-th
complex expression |
\ |
escape character (used when matching a reserved character in regular expressions) |
Figure 11.1: Regular expressions
.
Example
We want to write a function translating dates in
anglo-saxon format into French dates within a data file. We suppose
that the file is organised into lines of data fields and the
components of an anglo-saxon date are separated by dots. Let's define
a function which takes as argument a string (i.e. a line
from the file), isolates the date, decomposes and translates it, then
replaces the original with the translation.
# let
french_date_of
d
=
match
d
with
[
mm;
dd;
yy]
->
dd^
"/"
^
mm^
"/"
^
yy
|
_
->
failwith
"Bad date format"
;;
val french_date_of : string list -> string = <fun>
# let
english_date_format
=
Str.regexp
"[0-9]+\.[0-9]+\.[0-9]+"
;;
val english_date_format : Str.regexp = <abstr>
# let
trans_date
l
=
try
let
i=
Str.search_forward
english_date_format
l
0
in
let
d1
=
Str.matched_string
l
in
let
d2
=
french_date_of
(Str.split
(Str.regexp
"\."
)
d1)
in
Str.global_replace
english_date_format
d2
l
with
Not_found
->
l
;;
val trans_date : string -> string = <fun>
# trans_date
"..............06.13.99............"
;;
- : string = "..............13/06/99............"
The ocamllex Tool
The ocamllex tool is a lexical analyzer generator built for Objective CAML
after the model of the lex tool for the C language. It generates
a source Objective CAML file from a file describing the lexical elements to
be recognized in the form of regular expressions. The programmer can
augment each lexical element description with a processing action known
as a semantic action. The generated code manipulates an abstract
type lexbuf defined in module Lexing. The
programmer can use this module to control processing of lexical
buffers.
Usually the lexical description files are given the extension
.mll. Later, to obtain a Objective CAML source from a lex_file.mll
you type the command
ocamllex lex_file.mll
A file lex_file.ml is generated containing the code for the
corresponding analyzer. This file can then be compiled with other
modules of an Objective CAML application. For each set of lexical analysis
rules there is a corresponding function taking as input a lexical
buffer (of type Lexing.lexbuf) and returning the value
defined by the semantic actions. Consequently, all actions in the same
rule must produce values of the same type.
The general format for an ocamllex file is
{ |
|
header |
} |
|
let ident = regexp |
|
... |
rule ruleset1 = parse |
|
|
regexp { action } |
|
|
| ... |
|
|
| regexp { action } |
and ruleset2 = parse |
|
|
... |
and ... |
{ |
|
|
trailer-and-end |
}
|
Both section ``header'' and ``trailer-and-end'' are optional. They
contain Objective CAML code defining types, functions, etc. needed for
processing. The code in the last section can use the lexical analysis
functions that will be generated by the middle section. The
declaration list preceding the rule definition allows the user to give
names to some regular expressions. They can later be invoked by name
in the definition of rules.
Example
Let's revisit our BASIC
example. We will want to refine the type of lexical units
returned. We will once again define function lexer
(as we did on page ??) with the same type of output
(lexeme), but taking as input a buffer of type
Lexing.lexbuf.
{
let
string_chars
s
=
String.sub
s
1
((String.length
s)-
2
)
;;
}
let
op_ar
=
[
'-'
'+'
'*'
'%'
'/'
]
let
op_bool
=
[
'!'
'&'
'|'
]
let
rel
=
[
'='
'<'
'>'
]
rule
lexer
=
parse
[
' '
]
{
lexer
lexbuf
}
|
op_ar
{
Lsymbol
(Lexing.lexeme
lexbuf)
}
|
op_bool
{
Lsymbol
(Lexing.lexeme
lexbuf)
}
|
"<="
{
Lsymbol
(Lexing.lexeme
lexbuf)
}
|
">="
{
Lsymbol
(Lexing.lexeme
lexbuf)
}
|
"<>"
{
Lsymbol
(Lexing.lexeme
lexbuf)
}
|
rel
{
Lsymbol
(Lexing.lexeme
lexbuf)
}
|
"REM"
{
Lsymbol
(Lexing.lexeme
lexbuf)
}
|
"LET"
{
Lsymbol
(Lexing.lexeme
lexbuf)
}
|
"PRINT"
{
Lsymbol
(Lexing.lexeme
lexbuf)
}
|
"INPUT"
{
Lsymbol
(Lexing.lexeme
lexbuf)
}
|
"IF"
{
Lsymbol
(Lexing.lexeme
lexbuf)
}
|
"THEN"
{
Lsymbol
(Lexing.lexeme
lexbuf)
}
|
'-'
?
[
'0'
-
'9'
]+
{
Lint
(int_of_string
(Lexing.lexeme
lexbuf))
}
|
[
'A'
-
'z'
]+
{
Lident
(Lexing.lexeme
lexbuf)
}
|
'"'
[
^
'"'
]*
'"'
{
Lstring
(string_chars
(Lexing.lexeme
lexbuf))
}
The translation of this file by ocamllex returns function
lexer of type Lexing.lexbuf -> lexeme. We will
see later how to use such a function in conjunction with syntactic
analysis (see page ??).