2.2 Using PyLly

PyLly reads in a specification file for a lexer and generates tables that can be used by a lexer to tokenize a stream of data. The first step in using PyLly is to construct a specification file. This file specifies how to pull tokens out of an input stream. An example of a simple specfile is given in test1.pyl:

INITIAL :
    "abc*" :    return "pattern1"
    "d|efg" :   return "pattern2"
    "^abx" :    return "pattern3"
    "\n" :      return              # ignore newlines

This specification states that if an a then a b and then an arbitrary number of c characters are encountered, the lexer should return the string with a token value of "pattern1". Similarly if a d or e followed by the characters fg are seen, then a token value of "pattern2" should be returned. If the characters abx are encountered at the start of a line then "pattern3" is returned. Finally if the newline character is seen, no token is returned. Any other characters cause an error token to be returned.

Once a specification file has been made, it is a simple matter to use it. The examples/test1.py program provides an example:

import pyggy

l,tab = pyggy.getlexer("test1.pyl")
l.setinput("-")
while 1 :
    x = l.token()
    if x is None :
        break
    print x, l.value

The getlexer function takes care of parsing the lexer specification file, generating tables for a lexer, loading the tables and constructing a lexer. It returns the constructed lexer and a handle on the generated lexer table module. The example makes use of the lexer by first specifying an input file for the lexer to read from and then by calling the token method to retrieve each successive token. The setinput method is used to specify which file to read input from. In this case the special name - is used which denotes that input should come from stdin. The token method returns the next token value from the input stream when called. It returns None when the input source has been exhausted. An auxiliary value stored in the value member contains the value of the token (which is usually a string of the characters that make up the token).

The example can be run as follows:

$ echo "abccccdfghabx" | python test1.py
pattern1 abcccc
pattern2 dfg
#ERR# h
pattern1 ab
#ERR# x

Notice that the lexer returns error tokens for the unrecognized patterns.

See the PyGgy Home Page.