INITIAL : "abc*" : return "pattern1" "d|efg" : return "pattern2" "^abx" : return "pattern3" "\n" : return # ignore newlines
This specification states that if an a
then a b
and then an
arbitrary number of c
characters are encountered, the lexer should
return the string with a token value of "pattern1"
.
Similarly if a d
or e
followed
by the characters fg
are seen, then a token value of "pattern2"
should be returned.
If the characters abx
are encountered at the start of a line then
"pattern3"
is returned. Finally if the newline character is seen,
no token is returned. Any other characters cause an error token
to be returned.
Once a specification file has been made, it is a simple matter to use it. The examples/test1.py program provides an example:
import pyggy l,tab = pyggy.getlexer("test1.pyl") l.setinput("-") while 1 : x = l.token() if x is None : break print x, l.value
The getlexer
function takes care of parsing the lexer specification
file, generating tables for a lexer, loading the tables and constructing
a lexer. It returns the constructed lexer and a handle on the generated
lexer table module. The example makes use of the lexer by first specifying
an input file for the lexer to read from and then by calling the
token
method to retrieve each successive token. The setinput
method is used to specify which file to read input from. In this case
the special name - is used which denotes that input should
come from stdin
. The token
method returns the next token
value from the input stream when called. It returns None
when
the input source has been exhausted. An auxiliary value stored in
the value
member contains the value of the token (which is usually
a string of the characters that make up the token).
The example can be run as follows:
$ echo "abccccdfghabx" | python test1.py pattern1 abcccc pattern2 dfg #ERR# h pattern1 ab #ERR# x
Notice that the lexer returns error tokens for the unrecognized patterns.
See the PyGgy Home Page.