Lexicon

Lexicon

Lexical analysis is the first step in character string processing: it segments character strings into a sequence of words also known as lexical units or lexemes.

Module `Genlex`

This module provides a simple primitive allowing the analysis of a string of characters using several categories of predefined lexical units. These categories are distinguished by type:


# type token =
    Kwd of string
  | Ident of string
  | Int of int
  | Float of float
  | String of string
  | Char of char ;;

Hence, we will be able to recognize within a character string an integer (constructor Int) and to recover its value (constructor argument of type int). Recognizable strings and characters respect the usual conventions: a string is delimited by two (") characters and character literals by two (') characters. A float is represented by using either floating-point notation (for example 0.01) or exponent-mantissa notation (for example 1E-2).

Constructor Ident designates the category of identifiers. These are the names of variables or functions in programming languages, for example. They comprise all strings of letters and digits including underscore (_) or apostrophe ('). Such a string should not start with a digit. We also consider as identifiers (for this module at least) strings containing operator symbols, such as +, *, > or =. Finally, constructor Kwd defines the category of keywords containing distinguished identifiers or special characters (specified by the programmer when invoking the lexer).

The only variant of the token type controlled by parameters is that of keywords. The following primitive allows us to create a lexical analyser (lexer) taking as keywords the list passed as first argument to it.


# Genlex.make_lexer ;;
- : string list -> char Stream.t -> Genlex.token Stream.t = <fun>

The result of applying make_lexer to a list of keywords is a function taking as input a stream of characters and returning a stream of lexical units (of type token.)

Thus we can easily obtain a lexer for our BASIC interpreter. We declare the set of keywords:


# let keywords =
  [ "REM"; "GOTO"; "LET"; "PRINT"; "INPUT"; "IF"; "THEN";
    "-"; "!"; "+"; "-"; "*"; "/"; "%"; 
    "="; "<"; ">"; "<="; ">="; "<>";
    "&"; "|" ] ;;

With this definition in place, we define the lexer:


# let line_lexer l = Genlex.make_lexer keywords (Stream.of_string l) ;;
val line_lexer : string -> Genlex.token Stream.t = <fun>
# line_lexer "LET x = x + y * 3" ;;
- : Genlex.token Stream.t = <abstr>

Function line_lexer takes as input a string of characters and returns the corresponding stream of lexemes.

Use of Streams

We can carry out the lexical analysis ``by hand'' by directly manipulating streams.

The following example is a lexer for arithmetical expressions. Function lexer takes a character stream and returns a stream of lexical units of type lexeme Stream.t¹. Spaces, tabs and newline characters are removed. To simplify, we do not consider variables or negative integers.


# let rec spaces s =
   match s with parser
     [<'' '  ; rest  >] -> spaces rest
   | [<''\t' ; rest  >] -> spaces rest
   | [<''\n' ; rest  >] -> spaces rest
   | [<>] -> ();;
val spaces : char Stream.t -> unit = <fun>
# let rec lexer s =
   spaces s;
   match s with parser
     [< ''('  >] -> [< 'Lsymbol "(" ; lexer s >]
   | [< '')'  >] -> [< 'Lsymbol ")" ; lexer s >]
   | [< ''+'  >] -> [< 'Lsymbol "+" ; lexer s >]
   | [< ''-'  >] -> [< 'Lsymbol "-" ; lexer s >]
   | [< ''*'  >] -> [< 'Lsymbol "*" ; lexer s >]
   | [< ''/'  >] -> [< 'Lsymbol "/" ; lexer s >]
   | [< ''0'..'9' as c; 
        i,v = lexint (Char.code c - Char.code('0')) >] 
       -> [<'Lint i ; lexer v>]
 and lexint r s =
   match s with parser  
     [< ''0'..'9' as c >] 
     -> let u = (Char.code c) - (Char.code '0') in  lexint (10*r + u) s
   | [<>] -> r,s
 ;;
val lexer : char Stream.t -> lexeme Stream.t = <fun>
val lexint : int -> char Stream.t -> int * char Stream.t = <fun>

Function lexint carries out the lexical analysis for the portion of a stream describing an integer constant. It is called by function lexer when lexer finds a digit on the input stream. Function lexint then consumes all consecutive digits to obtain the corresponding integer value.

Regular Expressions

²

Let's abstract a bit and consider the problem of lexical units from a more theoretical point of view.

From this point of view, a lexical unit is a word. A word is formed by concatening items in an alphabet. For our purposes, the alphabet we are considering is a subset of the ASCII characters. Theoretically, a word may contain no characters (the empty word³) or just a single character. The theoretical study of the assembly of lexical items (lexemes) from members of an alphabet has brought about a simple formalism known as regular expressions.

Definition

A regular expression defines a set of words. For example, a regular expression could specify the set of words that are valid identifiers. Regular expressions are specified by a few set-theoretic operations. Let M and N be two sets of words. Then we can specify:

the union of M and N, denoted by M | N.
the complement of M, denoted by ^M. This is the set of all words not in M.
the concatenation of M and N. This is the set of all the words formed by placing a word from M before a word from N. We denote this set simply by MN.
the set of words formed by a finite sequence of words in M, denoted M+.
for syntactic convenience, we write M? to denote the set of words in M, with addition of the empty word.

Individual characters denote the singleton set of words containing just that character. Expression a | b | c thus describes the set containing three words: a, b ant c. We will use the more compact syntax [abc] to define such a set. As our alphabet is ordered (by the ASCII code order) we can also define intervals. For example, the set of digits can be written: [0-9]. We can use parentheses to group expressions.

If we want to use one of the operator characters as a character in a regular expression, it should be preceded by the escape character \. For example, (\*)* denotes the set of sequences of stars.

Example

Let's consider the alphabet comprising digits (0, 1, 2, 3, 4, 5, 6, 7, 8, 9) the plus (+), minus (-) and dot (.) signs and letter E. We can define the set num of words denoting numbers. Let's call integers the set defined with [0-9]+. We define the set unum of unsigned numbers as:

integers?(.integers)?(E(\+|-)?integers)?

The set of signed numbers is thus defined as:

unum | -unum or with -?unum

Recognition

While regular expressions are a useful formalism in their own right, we usually wish to implement a program that determines whether a string of characters (or one of its substrings) is a member of the set of words described by a regular expression. For that we need to translate the formal definition of the set into a recognition and expression processing program. In the case of regular expressions such a translation can be automated. Such translation techniques are carried out by module Genlex in library Str (described in the next section) and by the ocamllex tools that we introduce in the following two sections.

The `Str` Library

This module contains an abstract data type regexp which represents regular expressions as well as a function regexp which takes a string describing a regular expression, more or less following the syntax described above, and returns its abstract representation.

This module contains, as well, a number of functions which exploit regular expressions and manipulate strings. The syntax of regular expressions for library Str is given in figure 11.1.

. any character except \n

* zero or more occurences of the preceding expression

+ one or more occurences of the preceding expression

? zero or one occurences of the preceding expression

[..] set of characters (example [abc])

intervals, denoted by - (example [0-9])

set complements, denoted by ^ (example [^A-Z])

^ start of line (not to be mistaken with the use of ^ as a set complement)

$ end of line

| alternative

(..) grouping of a complex expression (we can later refer to such an expression by an integer index -- see below)

i an integer constant, referring to the string matched by the i-th complex expression

\ escape character (used when matching a reserved character in regular expressions)

Figure 11.1: Regular expressions.

Example

We want to write a function translating dates in anglo-saxon format into French dates within a data file. We suppose that the file is organised into lines of data fields and the components of an anglo-saxon date are separated by dots. Let's define a function which takes as argument a string (i.e. a line from the file), isolates the date, decomposes and translates it, then replaces the original with the translation.


# let french_date_of d =
  match d with
   [mm; dd; yy] -> dd^"/"^mm^"/"^yy
  | _ -> failwith "Bad date format" ;;
val french_date_of : string list -> string = <fun>

# let english_date_format = Str.regexp "[0-9]+\.[0-9]+\.[0-9]+" ;;
val english_date_format : Str.regexp = <abstr>

# let trans_date l = 
  try
   let i=Str.search_forward english_date_format l 0 in
   let d1 = Str.matched_string l in
   let d2 = french_date_of (Str.split (Str.regexp "\.") d1) in
     Str.global_replace english_date_format d2 l
  with Not_found -> l ;;
val trans_date : string -> string = <fun>

# trans_date "..............06.13.99............" ;;
- : string = "..............13/06/99............"

The `ocamllex` Tool

The ocamllex tool is a lexical analyzer generator built for Objective CAML after the model of the lex tool for the C language. It generates a source Objective CAML file from a file describing the lexical elements to be recognized in the form of regular expressions. The programmer can augment each lexical element description with a processing action known as a semantic action. The generated code manipulates an abstract type lexbuf defined in module Lexing. The programmer can use this module to control processing of lexical buffers.

Usually the lexical description files are given the extension .mll. Later, to obtain a Objective CAML source from a lex_file.mll you type the command

ocamllex lex_file.mll

A file lex_file.ml is generated containing the code for the corresponding analyzer. This file can then be compiled with other modules of an Objective CAML application. For each set of lexical analysis rules there is a corresponding function taking as input a lexical buffer (of type Lexing.lexbuf) and returning the value defined by the semantic actions. Consequently, all actions in the same rule must produce values of the same type.

The general format for an ocamllex file is

`{`
	header
`}`

let ident `=` regexp
	...
rule ruleset1 `=` parse
		regexp `{` action `}`
		`\|` ...
		`\|` regexp `{` action `}`
and ruleset2 `=` `parse`
		...
and ...
`{`

	trailer-and-end
`}`

Both section ``header'' and ``trailer-and-end'' are optional. They contain Objective CAML code defining types, functions, etc. needed for processing. The code in the last section can use the lexical analysis functions that will be generated by the middle section. The declaration list preceding the rule definition allows the user to give names to some regular expressions. They can later be invoked by name in the definition of rules.

Example

Let's revisit our BASIC example. We will want to refine the type of lexical units returned. We will once again define function lexer (as we did on page ??) with the same type of output (lexeme), but taking as input a buffer of type Lexing.lexbuf.


{
 let string_chars s =
  String.sub s 1 ((String.length s)-2) ;;
}

let op_ar = ['-' '+' '*' '%' '/']
let op_bool = ['!' '&' '|'] 
let rel = ['=' '<' '>']

rule lexer = parse
   [' ']   { lexer lexbuf }

 | op_ar   { Lsymbol (Lexing.lexeme lexbuf) }
 | op_bool { Lsymbol (Lexing.lexeme lexbuf) }

 | "<="    { Lsymbol (Lexing.lexeme lexbuf) }
 | ">="    { Lsymbol (Lexing.lexeme lexbuf) }
 | "<>"    { Lsymbol (Lexing.lexeme lexbuf) }
 | rel     { Lsymbol (Lexing.lexeme lexbuf) }

 | "REM"   { Lsymbol (Lexing.lexeme lexbuf) }
 | "LET"   { Lsymbol (Lexing.lexeme lexbuf) }
 | "PRINT" { Lsymbol (Lexing.lexeme lexbuf) }
 | "INPUT" { Lsymbol (Lexing.lexeme lexbuf) }
 | "IF"    { Lsymbol (Lexing.lexeme lexbuf) }
 | "THEN"  { Lsymbol (Lexing.lexeme lexbuf) }
 
 | '-'? ['0'-'9']+   { Lint (int_of_string (Lexing.lexeme lexbuf)) }
 | ['A'-'z']+        { Lident (Lexing.lexeme lexbuf) }
 | '"' [^ '"']* '"'  { Lstring (string_chars (Lexing.lexeme lexbuf)) }

The translation of this file by ocamllex returns function lexer of type Lexing.lexbuf -> lexeme. We will see later how to use such a function in conjunction with syntactic analysis (see page ??).

`.`	any character except `\n`
`*`	zero or more occurences of the preceding expression
`+`	one or more occurences of the preceding expression
`?`	zero or one occurences of the preceding expression
`[`..`]`	set of characters (example `[abc]`)
	intervals, denoted by `-` (example `[0-9]`)
	set complements, denoted by `^` (example `[^A-Z]`)
`^`	start of line (not to be mistaken with the use of `^` as a set complement)
`$`	end of line
`\|`	alternative
`(`..`)`	grouping of a complex expression (we can later refer to such an expression by an integer index -- see below)
i	an integer constant, referring to the string matched by the i-th complex expression
`\`	escape character (used when matching a reserved character in regular expressions)

Lexicon

Module Genlex

Use of Streams

Regular Expressions

Definition

Example

Recognition

The Str Library

Example

The ocamllex Tool

Example

Module `Genlex`

The `Str` Library

The `ocamllex` Tool