The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

IO::Tokenized - Extension of Perl for tokenized input

SYNOPSIS

  #Functional interface

  use IO::Tokenized qw/:parse/;
  
  open FOO,"<","some/input/file" or die "Can't open 'some/input/file': $!";
  setparser(\*FOO,[num => qr/\d+/],
                  [ident => qr/[a-z_][a-z0-9_]],
                  [op => qr![+*/-]!,\&opname]);
  
  while (my ($tok,$val) = gettoken(\*FOO)) {
    ... do something smart...
  }

  close(FOO);

ABSTRACT

Defines an extension to perl filehandles allowing spliting the input stream according to regular expressions.

DESCRIPTION

IO::Tokenized defines a bunch of functions allowing tokenized input from perl filehandles. In this alpha version tokens are specified by passing to the initialize_parsing function a list of token specifications. Each token specification is (a reference to) an array containing: the token name (a string), a regular expression defining the token and, optionally, an action function which calculates the value to be returned when a token matching the regexp is found.

Once the tokens are been specified, each invocation the gettoken function return a pair consisting of a token name and a token value or undef at end of file.

IO::Tokenized can also be used as a base class to add tokenized input methods to the object modules in the IO::* namespace. As an example, see the IO::Tokenized::File module, which is included in this distrution.

RATIONALE

Lexical analysis, which is a fundamental step in all parsing, mainly consists in decomposing an input stream into smal chunks called tokens. The tokens are in turn defined by regular expressions.

As Perl is good at handling regular expressions, one should expects that writing lexical analyser in Perl should be easy. In truth it is not, and tools like lex or flex are even been ported to Perl. There are also a whole lot of ad-hoc lexers for different parsing modules/programmes.

Now, approaches to lexical analysis like those underlying Parse::Lex and Parse::Flex are general but fairly complexes to use, while ad-hoc solutions are obviously, well... ad-hoc.

What I'd always sought was a way to tell to a file handle: "well, that is how the chunks I'm interested are. Please, found them in your input stream". It seems a simple thingh enough, but I could not found a module doing it.

Obviously, impatience pushed me to implement such a module, but until little time ago I had no real need for it, so lazines spoke against it. Recently I started to write a compiler for a scripting language and I started using the Parse::RecDescent module. There, in the documentation Damian Conway says

  • There's no support for parsing directly from an input stream. If and when the Perl Gods give us regular expressions on streams, this should be trivial (ahem!) to implement.

Why, regular expression on streams was exactly what I had in mind, so hubris kicked in and I wrote this module and its compagnon IO::Tokenized::File.

FUNCTIONS

The following functions are defined by the IO::tokenized module.

  • initialize_parser($handle[,@tokens])

    Initialize the filehandle $handle for tokenized input.The @token optional parameter, if present, is passed to settparser.

  • setparser($handle,@tokens)

    Defines or redefines the tokens used by gettoken. If @tokens contains a token whose name is the empty string, then the regexp defining it is passed to token_separator

  • gettoken($handle)

    Returns the next token in the input stream identified by $handle. Each token is returned as a pair (token_name = $value)> where $value is either the initial portion of the input stream amtching the token regular expression (if no action was defined for token token_name) or the result of the action function evaluation if such a function was defined for token token_name.

    On end of file, gettoken($handle) returns undef. If the end of file is hitten, or the internal buffer overflows, without a token beeing found, the functions croaks.

  • gettokens($handle>

    It returns the list of tokens contained in the input stream $handle until the end of file.

  • buffer_space($handle [,Number])

    Retrives or sets the size of the internal buffer used by IO::Tokenized. By default the buffer size is of 30720 characters (30 Kb). If used for setting, by providing a new value, it returns the old value.

  • token_separator($handle[,regex]

    Retrives or set the regular expression used as a fill-up between tokens. The default value is /\s/.

  • flushbuffer($handle)

    Flushes the internal buffer, returning the characters contained in it.

  • skip($handle)

    Repetedly removes from the start of the file pattern matching the token separator setted by token_:separator until either the end of file is reached or the start of the file does not match the regex.

  • resynch($handle)

    Try to remove as little characters as possible from the beginning of the file as it is necessary to get a substring matching a token in the front of the input stream.

  • getline($handle) and getlines($handle)

    These functions work as the function of the same name in IO::Handle, they are redefined to take into account the presence of an internal buffer.

EXPORTS

IO::Tokenized does not export any function by default but all the above mentioned functions are exportable. There are, beside the classical :all, two more export tags: :parse, which exports initialize_parsing, gettoken and gettokens, and :buffer, which exports bufferspace, flushbuffer and resynch.

OBJECT ORIENTED

All the functions described above can be called in an object oriented way. For contructing IO::Tokenized objects a new method is provided which is basicaly a wrapper around initialize_parsing.

SEE ALSO

IO::Tokenized::File.

TOKENS SPECIFICATION

Tokens are specified, either to the new creator or to the settparser mutator, by a list of token definitions. Each token definition is (a reference to) an array with two or three elements. The first element represents the token name, the second one is the regexp defining the token itself while the third, if present, is the action function.

ACTION FUNCTIONS

As stated above, the user can associate a function to each token, called the action of the token. The action serves to purposes: it calculates the value of the token and completes the verification of the match. The action function specified in [token = $re,\&foo()]> will be called with the result of @item = $buffer =~ /($re)/s. The default action is simpli to pop @_, so giving the text that matched $re.

MATCHING STRATEGY

The gettoken function uses the following method to find the token to be returned.

1. it removes from the beginning of the internal buffer strings matching the skip regular expression as set by the token_separator function. In doing so, it can read more lines from the file into the buffer.
2. consider the token definitions in the order they where passed to settparser. If token token is defined by regexp $re, check that the buffer matches /^($re)/. If it is not so, then pass to the following token if any, to step 4. below if none.
3. if there is a user defined action for the token, apply it. If it returns undef then pass to the following token if any, to step 4. below if none. If the return value is defined, return a pair formed by the token name and the value itself. If there is no user defined action, then return a pair consisting of the token name and the matched string. Before returning, the buffer is updated removing the matched string.
4. if no match could be found, try reading one more line into the buffer and go back to step 2. If in entering step 4 the internal buffer holds more characters that was fixed by buffer_space then gettoken croacks.

CAVEATS

  • The selected token is the first matching token, not the longest one. I'm wondering what would be best: 1) let this alone, 2) change it to 'the longest match', 3) add an option, 4) write another module, 5) some other thing.

  • No token can span more than one line unless it has a well defined end marker. This does not appear to be a real problem. The only tokens spanning more than one line I ever seen are multiline strings and block comments, both of which have end markers: closed quote and end of comment respectively.

BUGS

Please remember that this is an alpha version of the module, and will stay so until the version number gets to 1.00. This means that there surely are plenty of bugs which aren't be discovered yet, more so because testing is all but complete.

Bugs reports are welcome and are to be addressed directly to the author at the address below.

TODO

There is still lot of work to do on this module, both at the programming level and at the conceptual level. Feature requests as well as insights are welcome.

AUTHOR

Leo "TheHobbit" Cacciari, <hobbit@cpan.org>

COPYRIGHT AND LICENSE

Copyright 2003 by Leo Cacciari

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 383:

Expected '=item *'