RTF::Tokenizer - Tokenize RTF
Tokenizes RTF
use RTF::Tokenizer; # Create a tokenizer object my $tokenizer = RTF::Tokenizer->new(); my $tokenizer = RTF::Tokenizer->new( string => '{\rtf1}' ); my $tokenizer = RTF::Tokenizer->new( file => \*STDIN ); my $tokenizer = RTF::Tokenizer->new( file => 'lala.rtf' ); # Populate it from a file $tokenizer->read_file('filename.txt'); # Or a file handle $tokenizer->read_file( \*STDIN ); # Or a string $tokenizer->read_string( '{\*\some rtf}' ); # Get the first token my ( $token_type, $argument, $parameter ) = $tokenizer->get_token();
This documentation assumes some basic knowledge of RTF. If you lack that, go read The_RTF_Cookbook:
http://search.cpan.org/search?dist=RTF-Writer
Returns a Tokenizer object. Normally called with no arguments, however, you can save yourself calling read_file() or read_string() by passing new() a hash (well, a list really) containing either a 'file'- or 'string'-indexed couplet, where the value is what you would like passed to the respective routine. The example in the synopsis makes this much more clear than does this description :-)
read_file()
read_string()
new()
Appends the string to the tokenizer-object's buffer (earlier versions would over-write the buffer - this version does not).
Appends a chunk of data from the filehandle to the buffer, and remembers the filehandle, so if you ask for a token, and the buffer is empty, it'll try and read the next line from the file (earlier versions would over-write the buffer - this version does not).
This chunk is 500 characters, and then whatever is left until the next occurrence of the IRS (a newline character in this case). If for whatever reason, you want to change that number to something else, $self->{_INITIAL_READ} can be modified.
Returns the next token as a three-item list: 'type', 'argument', 'parameter'. Token is one of: text, control, group, or eof.
text
control
group
eof
'type' is set to 'text'. 'argument' is set to the text itself. 'parameter' is left blank. NOTE: \{, \}, and \\ are all returned as control words, rather than rendered as text for you, as are \_, \- and friends.
\{
\}
\\
\_
\-
'type' is 'control'. 'argument' is the control word or control symbol. 'parameter' is the control word's parameter if it has one - this will be numeric, EXCEPT when 'argument' is a literal ', in which case it will be a two-letter hex string.
'type' is 'group'. If it's the beginning of an RTF group, then 'argument' is 1, else if it's the end, argument is 0. 'parameter' is not set.
End of file reached. 'type' is 'eof'. 'argument' is 1. 'parameter' is 0.
Don't call this unless you actually have a good reason. When the Tokenizer reads from a file, it first attempts to work out what the correct input record-seperator should be, by reading some characters from the file handle. This value starts off as 512, which is twice the amount of characters that version 1.7 of the RTF specification says you should go before including a line feed if you're writing RTF.
Called with no argument, this returns the current value of the number of characters we're going to read. Called with a numeric argument, it sets the number of characters we'll read.
You really don't need to use this method.
To avoid intrusively deep parsing, if an alternative ASCII representation is available for a Unicode entity, and that ASCII representation contains {, or \, by themselves, things will go funky. But I'm not convinced either of those is allowed by the spec.
{
\
Pete Sergeant -- rtfr@clueball.com
rtfr@clueball.com
Copyright 2003 Pete Sergeant.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
To install RTF::Tokenizer, copy and paste the appropriate command in to your terminal.
cpanm
cpanm RTF::Tokenizer
CPAN shell
perl -MCPAN -e shell install RTF::Tokenizer
For more information on module installation, please visit the detailed CPAN module installation guide.