IO::Tokenized - Extension of Perl for tokenized input
#Functional interface use IO::Tokenized qw/:parse/; open FOO,"<","some/input/file" or die "Can't open 'some/input/file': $!"; setparser(\*FOO,[num => qr/\d+/], [ident => qr/[a-z_][a-z0-9_]], [op => qr![+*/-]!,\&opname]); while (my ($tok,$val) = gettoken(\*FOO)) { ... do something smart... } close(FOO);
Defines an extension to perl filehandles allowing spliting the input stream according to regular expressions.
IO::Tokenized defines a bunch of functions allowing tokenized input from perl filehandles. In this alpha version tokens are specified by passing to the initialize_parsing function a list of token specifications. Each token specification is (a reference to) an array containing: the token name (a string), a regular expression defining the token and, optionally, an action function which calculates the value to be returned when a token matching the regexp is found.
initialize_parsing
Once the tokens are been specified, each invocation the gettoken function return a pair consisting of a token name and a token value or undef at end of file.
gettoken
undef
IO::Tokenized can also be used as a base class to add tokenized input methods to the object modules in the IO::* namespace. As an example, see the IO::Tokenized::File module, which is included in this distrution.
Lexical analysis, which is a fundamental step in all parsing, mainly consists in decomposing an input stream into smal chunks called tokens. The tokens are in turn defined by regular expressions.
As Perl is good at handling regular expressions, one should expects that writing lexical analyser in Perl should be easy. In truth it is not, and tools like lex or flex are even been ported to Perl. There are also a whole lot of ad-hoc lexers for different parsing modules/programmes.
Now, approaches to lexical analysis like those underlying Parse::Lex and Parse::Flex are general but fairly complexes to use, while ad-hoc solutions are obviously, well... ad-hoc.
What I'd always sought was a way to tell to a file handle: "well, that is how the chunks I'm interested are. Please, found them in your input stream". It seems a simple thingh enough, but I could not found a module doing it.
Obviously, impatience pushed me to implement such a module, but until little time ago I had no real need for it, so lazines spoke against it. Recently I started to write a compiler for a scripting language and I started using the Parse::RecDescent module. There, in the documentation Damian Conway says
There's no support for parsing directly from an input stream. If and when the Perl Gods give us regular expressions on streams, this should be trivial (ahem!) to implement.
Why, regular expression on streams was exactly what I had in mind, so hubris kicked in and I wrote this module and its compagnon IO::Tokenized::File.
The following functions are defined by the IO::tokenized module.
initialize_parser($handle[,@tokens])
Initialize the filehandle $handle for tokenized input.The @token optional parameter, if present, is passed to settparser.
$handle
@token
settparser
setparser($handle,@tokens)
Defines or redefines the tokens used by gettoken. If @tokens contains a token whose name is the empty string, then the regexp defining it is passed to token_separator
@tokens
token_separator
gettoken($handle)
Returns the next token in the input stream identified by $handle. Each token is returned as a pair (token_name = $value)> where $value is either the initial portion of the input stream amtching the token regular expression (if no action was defined for token token_name) or the result of the action function evaluation if such a function was defined for token token_name.
(token_name =
$value
token_name
On end of file, gettoken($handle) returns undef. If the end of file is hitten, or the internal buffer overflows, without a token beeing found, the functions croaks.
gettokens($handle>
gettokens($handle
It returns the list of tokens contained in the input stream $handle until the end of file.
buffer_space($handle [,Number])
Retrives or sets the size of the internal buffer used by IO::Tokenized. By default the buffer size is of 30720 characters (30 Kb). If used for setting, by providing a new value, it returns the old value.
token_separator($handle[,regex]
Retrives or set the regular expression used as a fill-up between tokens. The default value is /\s/.
/\s/
flushbuffer($handle)
Flushes the internal buffer, returning the characters contained in it.
skip($handle)
Repetedly removes from the start of the file pattern matching the token separator setted by token_:separator until either the end of file is reached or the start of the file does not match the regex.
token_:separator
resynch($handle)
Try to remove as little characters as possible from the beginning of the file as it is necessary to get a substring matching a token in the front of the input stream.
getline($handle) and getlines($handle)
getline($handle)
getlines($handle)
These functions work as the function of the same name in IO::Handle, they are redefined to take into account the presence of an internal buffer.
IO::Tokenized does not export any function by default but all the above mentioned functions are exportable. There are, beside the classical :all, two more export tags: :parse, which exports initialize_parsing, gettoken and gettokens, and :buffer, which exports bufferspace, flushbuffer and resynch.
gettokens
bufferspace
flushbuffer
resynch
All the functions described above can be called in an object oriented way. For contructing IO::Tokenized objects a new method is provided which is basicaly a wrapper around initialize_parsing.
new
IO::Tokenized::File.
Tokens are specified, either to the new creator or to the settparser mutator, by a list of token definitions. Each token definition is (a reference to) an array with two or three elements. The first element represents the token name, the second one is the regexp defining the token itself while the third, if present, is the action function.
As stated above, the user can associate a function to each token, called the action of the token. The action serves to purposes: it calculates the value of the token and completes the verification of the match. The action function specified in [token = $re,\&foo()]> will be called with the result of @item = $buffer =~ /($re)/s. The default action is simpli to pop @_, so giving the text that matched $re.
[token =
@item = $buffer =~ /($re)/s
pop
@_
$re
The gettoken function uses the following method to find the token to be returned.
token
/^($re)/
buffer_space
The selected token is the first matching token, not the longest one. I'm wondering what would be best: 1) let this alone, 2) change it to 'the longest match', 3) add an option, 4) write another module, 5) some other thing.
No token can span more than one line unless it has a well defined end marker. This does not appear to be a real problem. The only tokens spanning more than one line I ever seen are multiline strings and block comments, both of which have end markers: closed quote and end of comment respectively.
Please remember that this is an alpha version of the module, and will stay so until the version number gets to 1.00. This means that there surely are plenty of bugs which aren't be discovered yet, more so because testing is all but complete.
Bugs reports are welcome and are to be addressed directly to the author at the address below.
There is still lot of work to do on this module, both at the programming level and at the conceptual level. Feature requests as well as insights are welcome.
Leo "TheHobbit" Cacciari, <hobbit@cpan.org>
Copyright 2003 by Leo Cacciari
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
1 POD Error
The following errors were encountered while parsing the POD:
Expected '=item *'
To install IO::Tokenized, copy and paste the appropriate command in to your terminal.
cpanm
cpanm IO::Tokenized
CPAN shell
perl -MCPAN -e shell install IO::Tokenized
For more information on module installation, please visit the detailed CPAN module installation guide.