The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

RTF::Parser - An event-driven RTF Parser

DESCRIPTION

An event-driven RTF Parser

PUBLIC SERVICE ANNOUNCEMENT

This is the second developer release I've made of RTF::Parser. I took over RTF::Parser with the aim of documenting, refactoring, and unit-testing it - this is a work still in progress.

There are four components of the RTF::Parser package that need reworking.

RTF/Parser.pm

This file. This file now provides a light-weight wrapper to RTF::Tokenizer. It is almost fully-documented, and completely refactored. The only thing left is the inclusion of tests

RTF/Control.pm

This is the next file in my sights. A lot of the source is documented, and some POD documentation is provided, as are some tests. However, tests, refactoring, and documentation are still a long long way from being finished.

RTF/HTML/Converter.pm RTF/TEXT/Converter.pm

Work has yet to begin on these two modules.

GENTLE INTRODUCTION

RTF::Parser has gone for over 5 years without any documentation, and its internal workings have confused the hell out of a lot of people, myself included.

RTF::Parser is intended to be sub-classed, and, in fact, in the last release, could only be sub-classed by RTF::Control, included in the RTF::Parser distribution. RTF::Control would then be subclassed by a module such as RTF::HTML::Converter, which would be invoked by a script such as rtf2html...

As such, RTF::Parser and RTF::Control had a fairly close relationship - RTF::Parser was actually calling routines in RTF::Control. This release will emulate that behaviour if RTF::Control is loaded, which it'll check by looking for 'RTF/Control.pm' in %INC, otherwise you'll be able to use the interface that actually existed anyway to export your event-table, but, we're getting ahead of ourselves here. If for some insane reason you need to pretend you're using RTF::Control when you're not, or you need to pretend you're not using it when you are, you can use the rtf_control_emulation method described below to do this.

RTF::Parser isn't a lot of use by itself - in fact, it's a lot like the RTF::Tokenizer module it wraps, with an extra bit of syntactic sugar - the real magic goes on in RTF::Control, and RTF::Control expects RTF::Parser to look a certain way. So if you're planning on actually using RTF::Parser for anything useful, read this document to give you an overview of what RTF::Parser does, and then really dive into RTF::Control's docs (which don't yet exist :-).

Subclassing RTF::Parser

When you subclass RTF::Parser, you'll want to do two things. You'll firstly want to overwrite the methods below described as the API. This describes what we do when we have tokens that aren't control words (except 'symbols' - see below).

Then you'll want to create a hash that maps control words to code references that you want executed. They'll get passed a copy of the RTF::Parser object, the name of the control word (say, 'b'), any arguments passed with the control word, and then 'start'. RTF::Control, when it gets to the end of a group, appears to go through all the controls it has seen to issue the same thing, except with 'end' instead of 'start'. That'll be covered more in the RTF::Control docs though, because it isn't particularly relevant here.

An example...

The following code removes bold tags from RTF documents, and then spits back out RTF.

  {
  
    # Create our subclass
      
      package UnboldRTF;

    # We'll be doing lots of printing without newlines, so don't buffer output

      $|++;

    # Subclassing magic...
    
      use RTF::Parser;
      @RTF2RTF::ISA = ( 'RTF::Parser' );
                        
    # Redefine the API nicely
        
      sub parse_start { print STDERR "Starting...\n"; }
      sub group_start { print '{' }
      sub group_end   { print '}' }
      sub text        { print "\n" . $_[1] }
      sub char        { print "\\\'$_[1]" }
      sub symbol      { print "\\$_[1]" }
      sub parse_end   { print STDERR "All done...\n"; }

  }

  my %do_on_control = (

        # What to do when we see any control we don't have
        #   a specific action for... In this case, we print it.

    '__DEFAULT__' => sub {

      my ( $self, $type, $arg ) = @_;
      $arg = "\n" unless defined $arg;
      print "\\$type$arg";

     },
     
   # When we come across a bold tag, we just ignore it.
     
     'b' => sub {},

  );

  # Grab STDIN...

    my $data = join '', (<>);

  # Create an instance of the class we created above

    my $parser = UnboldRTF->new();

  # Prime the object with our control handlers...
 
    $parser->control_definition( \%do_on_control );
  
  # Don't skip undefined destinations...
  
    $parser->dont_skip_destinations(1);

  # Start the parsing!

    $parser->parse_string( $data );

METHODS

new

Creates a new RTF::Parser object. Doesn't accept any arguments.

parse_stream( \*FH )

This function used to accept a second parameter - a function specifying how the filehandle should be read. This is deprecated, because I could find no examples of people using it, nor could I see why people might want to use it.

Pass this function a reference to a filehandle (or, now, a filename! yay) to begin reading and processing.

parse_string( $string )

Pass this function a string to begin reading and processing.

control_definition

The code that's executed when we trigger a control event is kept in a hash. We're holding this somewhere in our object. Earlier versions would make the assumption we're being subclassed by RTF::Control, which isn't something I want to assume. If you are using RTF::Control, you don't need to worry about this, because we're grabbing %RTF::Control::do_on_control, and using that.

Otherwise, you pass this method a reference to a hash where the keys are control words, and the values are coderefs that you want executed. This sets all the callbacks... The arguments passed to your coderefs are: $self, control word itself (like, say, 'par'), any parameter the control word had, and then either 'start' or 'end' to say if we've come across it, or it's about to go out of scope.

If you don't pass it a reference, you get back the reference of the current control hash we're holding.

rtf_control_emulation

If you pass it a boolean argument, it'll set whether or not it thinks RTF::Control has been loaded. If you don't pass it an argument, it'll return what it thinks...

dont_skip_destinations

The RTF spec says that we skip any destinations that we don't have an explicit handler for. You could well not want this. Accepts a boolean argument, true to process destinations, 0 to skip the ones we don't understand.

API

These are some methods that you're going to want to over-ride if you subclass this modules. In general though, people seem to want to subclass RTF::Control, which subclasses this module.

parse_start

Called before we start parsing...

parse_end

Called when we're finished parsing

group_start

Called when we encounter an opening {

group_end

Called when we encounter a closing }

text

Called when we encounter plain-text. Is given the text as its first argument

char

Called when we encounter a hex-escaped character. The hex characters are passed as the first argument.

symbol

Called when we come across a control character. This is interesting, because, I'd have treated these as control words, so, I'm using Philippe's list as control words that'll trigger this for you. These are -_~:|{}*'\. This needs to be tested.

bitmap

Called when we come across a command that's talking about a linked bitmap file. You're given the file name.

binary

Called when we have binary data. You get passed it.

AUTHOR

Peter Sergeant rtf.parser@clueball.com, originally by Philippe Verdret

COPYRIGHT

Copyright 2004 Pete Sergeant.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

CREDITS

This work was carried out under a grant generously provided by The Perl Foundation - give them money!