The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

XML::Filter::Dispatcher - Path based event dispatching with DOM support

SYNOPSIS

    use XML::Filter::Dispatcher qw( :all );

    my $f = XML::Filter::Dispatcher->new(
        Rules => [
           'foo'            => \&handle_foo_start_tag,
           '@bar'           => \&handle_bar_attr_at_start_tag,

           ## Send any <foo> elts and their contents to $handler
           'foo'            => $handler,

           ## Print the text of all <description> elements
           'string( description )' => sub { print xresult },
       ],
       Vars => [
           "id" => [ string => "12a" ],
       ],
    );

DESCRIPTION

WARNING: Beta code alert.

A SAX2 filter that dispatches SAX events based on "EventPath" patterns as the SAX events arrive. The SAX events are not buffered or converted to an in-memory document representation like a DOM tree. This provides for low lag operation because the actions associated with each pattern are executed as soon as possible, usually in an element's start_element() event method.

This differs from traditional XML pattern matching tools like XPath and XSLT (which is XPath-based) which require the entire document to be built in memory (as a "DOM tree") before queries can be executed. In SAX terms, this means that they have to build a DOM tree from SAX events and delay pattern matching until the end_document() event method is called.

Rules

A rule is composed of a pattern and an action. Each XML::Filter::Dispatcher instance has a list of rules; all rules with patterns that match a particular SAX event fire their actions when that SAX event is received.

Patterns

Note: this section describes EventPath and discusses differences between EventPath and XPath. If you are not familiar with XPath you may want to skim those bits; they're provided for the benefit of people coming from an XPath background but hopefully don't hinder others. A working knowledge of SAX is necessary for the advanced bits.

EventPath patterns may match the document, elements, attributes, text nodes, comments, processing instructions, and (not yet implemented) namespace nodes. Patterns like this are referred to as "location paths" and resemble Unix file paths or URIs in appearance and functionality.

Location paths describe a location (or set of locations) in the document much the same way a filespec describes a location in a filesystem. The path /a/b/c could refer to a directory named c on a filesystem or a set of e<ltc>> elements in an XML document. In either case, the path indicates that c must be a child of b, b must be <a>'s, and <a> is a root level entity. More examples later.

EventPath patterns may also extract strings, numbers and boolean values from a document. These are called "expression patterns" and are only said to match when the values they extract are "true" according to XPath semantics (XPath truth-ness differs from Perl truth-ness, see EventPath Truth below). Expression patterns look like string( /a/b/c ) or number( part-number ), and if the result is true, the action will be executed and the result can be retrieved using the xresult method.

TODO: rename xresult to be ep_result or something.

We cover patterns in more detail below, starting with some examples.

If you'd like to get some experience with pattern matching in an interactive XPath web site, there's a really good XPath/XSLT based tutorial and lab at http://www.zvon.org/xxl/XPathTutorial/General/examples.html.

Actions

Two kinds of actions are supported: Perl subroutine calls and dispatching events to other SAX processors. When a pattern matches, the associated action

Examples

This is perhaps best introduced by some examples. Here's a routine that runs a rather knuckleheaded document through a dispatcher:

    use XML::SAX::Machines qw( Pipeline );

    sub run { Pipeline( shift )->parse_string( <<XML_END ) }
      <stooges>
        <stooge name="Moe" hairstyle="bowl cut">
          <attitude>Bully</attitude>
        </stooge>
        <stooge name="Shemp" hairstyle="mop">
          <attitude>Klutz</attitude>
          <stooge name="Larry" hairstyle="bushy">
            <attitude>Middleman</attitude>
          </stooge>
        </stooge>
        <stooge name="Curly" hairstyle="bald">
          <attitude>Fool</attitude>
          <stooge name="Shemp" repeat="yes">
            <stooge name="Joe" hairstyle="bald">
              <stooge name="Curly Joe" hairstyle="bald" />
            </stooge>
          </stooge>
        </stooge>
      </stooges>
    XML_END
Counting Stooges

Let's count the number of stooge characters in that document. To do that, we'd like a rule that fires on almost all <stooge> elements:

    my $count;

    run(
        XML::Filter::Dispatcher->new(
            Rules => [
                'stooge' => sub { ++$count },
            ],
        )
    );

    print "$count\n";  ## 7

Hmmm, that's one too many: it's picking up on Shemp twice since the document shows that Shemp had two periods of stoogedom. The second node has a convenient repeat="yes" attribute we can use to ignore the duplicate.

We can ignore the duplicate element by adding a "predicate" expression to the pattern to accept only those elements with no repeat attribute. Changing that rule to

                'stooge[not(@repeat)]' => ...

or even the more pedantic

                'stooge[not(@repeat) or not(@repeat = "yes")]' => ...

yields the expected answer (6).

Hairstyles and Attitudes

Now let's try to figure out the hairstyles the stooges wore. To extract just the names of hairstyles, we could do something like:

    my %styles;

    run(
        XML::Filter::Dispatcher->new(
            Rules => [
                'string( @hairstyle )' => sub { $styles{xresult()} = 1 },
            ],
        )
    );

    print join( ", ", sort keys %styles ), "\n";

which prints "bald, bowl cut, bushy, mop". That rule extracts the text of each hairstyle attribute and the xresult() returns it.

The text contents of elements like <attitudes> can also be sussed out by using a rule like:

                'string( attitude )' => sub { $styles{xresult()} = 1 },

which prints "Bully, Fool, Klutz, Middleman".

Finally, we might want to correlate hairstyles and attitudes by using a rule like:

    my %styles;

    run(
        XML::Filter::Dispatcher->new(
            Rules => [
                'concat(@hairstyle,"=>",attitude)' => sub {
                    $styles{$1} = $2 if xresult() =~ /(.+)=>(.+)/;
                },
            ],
        )
    );

    print map "$_ => $styles{$_}\n", sort keys %styles;

which prints:

    bald => Fool
    bowl cut => Bully
    bushy => Middleman
    mop => Klutz

Examples that need to be written

  • Examples of dispatching to other SAX handlers

  • Examples for accumulating data

  • Advanced pattern matching examples

EventPath Dialect

"EventPath" patterns are that large subset of XPath patterns that can be run in a SAX environment without a DOM. There are a few crucial differences between the environments that EventPath and XPath each operate in.

XPath operates on a tree of "nodes" where each entity in an XML document has only one corresponding node. The tree metaphor used in XPath has a literal representation in memory. For instance, an element <foo> is represented by a single node which contains other nodes.

EventPath operates on a series of events and both documents and elements, which are each represented by single nodes in DOM trees, are both represented by two event method calls, start_...() and end_...(). This means that EventPath patterns may match in a start_...() method or an end_...() method, or even both if you try hard enough. Not all patterns have this dual nature; comment matches occur only in comment() event methods for instance.

EventPath patterns match as early in the document as possible. The only times an EventPath pattern will match in an end_...() method are when the pattern refers to an element's contents or it uses the is-end-event() function (described below) to do so intentionally.

The tree metaphor is used to arrange and describe the relationships between events. In the DOM trees an XPath engine operates on, a document or an element is represented by a single entity, called a node. In the event streams that EventPath operates on, documents and element

Why EventPath and not XPath?

EventPath is not a standard of any kind, but XPath can't cope with situations where there is no DOM and there are some features that EventPath need (start_element() vs. end_element() processing for example) that are not compatible with XPath.

Some of the features of XPath require that the source document be fully translated in to a DOM tree of nodes before the features can be evaluated. (Nodes are things like elements, attributes, text, comments, processing instructions, namespace mappings etc).

These features are not supported and are not likely to be, you might want to use XML::Filter::XSLT for "full" XPath support (tho it be in an XSLT framework) or wait for XML::TWIG to grow SAX support.

Rather than build a DOM, XML::Filter::Dispatcher only keeps a bare minimum of nodes: the current node and it's parent, grandparent, and so on, up to the document ("root") node (basically the /ancestor-or-self:: axis). This is called the "context stack", although you may not need to know that term unless you delve in to the guts.

EventPath Truth

EventPath borrows a lot from XPath including it's notion of truth. This is different from Perl's notion of truth; presumably to make document processing easier. Here's a table that may help, the important differences are towards the end:

    Expression      EventPath  XPath    Perl
    ==========      =========  =====    ====
    false()         FALSE      FALSE    n/a (not applicable)
    true()          TRUE       TRUE     n/a
    0               FALSE      FALSE    FALSE
    -0              FALSE**    FALSE    n/a
    NaN             FALSE**    FALSE    n/a (not fully, anyway)
    1               TRUE       TRUE     TRUE
    ""              FALSE      FALSE    FALSE
    "1"             TRUE       TRUE     TRUE

    "0"             TRUE       TRUE     FALSE

 * To be regarded as a bug in this implementation
 ** Only partially implemented/supported in this implementation

Note: it looks like XPath 2.0 is defining a more workable concept for document processing that uses something resembling Perl's empty lists, (), to indicate empty values, so "" and () will be distinct and "0" can be interpreted as false like in Perl. XPath2 is not provided by this module yet and won't be for a long time (patches welcome ;).

EventPath Examples

All of this means that only a portion of XPath is available. Luckily, that portion is also quite useful. Here are examples of working XPath expressions, followed by known unimplemented features.

TODO: There is also an extension function available to differentiate between start_... and end_... events. By default

Examples

 Expression          Event Type      Description (event type)
 ==========          ==========      ========================
 /                   start_document  Selects the document node
 /[is-end-event()]   end_element        "     "     "      "

 /a                  start_element   Root elt, if it's "<a ...>"
 /a[is-end-event()]  end_element       "   "   "  "       "

 a                   start_element   All "a" elements
 a[is-end-event()]   end_element      "   "     "

 b//c                start_element   All "c" descendants of "b" elt.s

 @id                 start_element   All "id" attributes

 string( foo )       end_element     fires at each </foo> or <foo/>;
                                     xresult() returns the
                                     text contained in "<foo>...</foo>"

 string( @name )     start_element   All "name" attributes;
                                     xresult() returns the
                                     text of the attribute.

Methods

new
    my $f = XML::Filter::Dispatcher->new(
        Rules => [   ## Order is significant
            "/foo/bar" => sub {
                ## Code to execute
            },
        ],
    );
xresult
    "string( foo )" => sub { xresult, "\n" }, # if imported
    "string( foo )" => sub { print shift->xresult, "\n" },

Returns the result of the last EventPath evaluated; this is the result that fired the current rule. The example prints all text node children of <foo> elements, for instance.

xset_var
    "foo" => sub { xset_var( bar => string => "bingo" ) }, # if imported
    "foo" => sub { shift->xset_var( bar => boolean => 1 ) },

Sets an XPath variables visible in the current context and all child contexts. Will not be visible in parent contexts or sibling contexts.

Legal types are boolean, number, and string. Node sets and nodes are unsupported at this time, and "other" types are not useful unless you work in your own functions that handle them.

Variables are visible as $bar variable references in XPath expressions and using xget_var in Perl code. Setting a variable to a new value temporarily overrides any existing value, somewhat like using Perl's local.

xget_var
    "bar" => sub { print xget_var( "bar" ) }, # if imported
    "bar" => sub { print shift->xget_var( "bar" ) },

Retrieves a single variable from the current context. This may have been set by a parent or by a previous rule firing on this node, but not by children or preceding siblings.

Returns undef if the variable is not set (or if it was set to undef).

xget_var_type
    "bar" => sub { print xget_var_type( "bar" ) }, # if imported
    "bar" => sub { shift->xget_var_type( "bar" ) },

Retrieves the type of a variable from the current context. This may have been set by a parent or by a previous rule firing on this node, but not by children or preceding siblings.

Returns undef if the variable is not set.

Notes for XPath Afficianados

This section assumes familiarity with XPath in order to explain some of the particulars and side effects of the incremental XPath engine.

  • Much of XPath's power comes from the concept of a "node set". A node set is a set of nodes returned by many XPath expressions. Event XPath fires a rule once for each node the rule applies to. If there is a location path in the expression, the rule will fire once for each document element (perhaps twice if both start and end SAX events are trapped, see is-start-event() and is-end-event() below.

    Expressions like 0, false(), 1, and 'a' have no location path and apply to all nodes (including namespace nodes and processing instructions).

  • Because of the implied set membership operation on node set expressions, foo, ./foo, .//foo and //foo are all equivalent rules; they all fire for every element node named "foo" in the document. This is because the context is always that of the current node for the SAX event (except for attributes, which SAX doesn't have an event for, but we act like it did; ie each attr gets it's own context to operate in). This is a lot like the match= expression in XSLT <xsl:template> constructs.

  • The XPath parser catches some simple mistakes Perlers might make in typing XPath expressions, such as using && or == instead of and or =.

  • SAX does not define events for attributes; these are passed in to the start_element (but not end_element) methods as part of the element node. XML::Filter::Dispatcher does allow selecting attribute nodes and passes in just the selected attribute node, see the examples above.

  • Axes in path steps (/foo::...)

    Only some axes can be reasonably supported within a SAX framework without building a DOM and/or queueing SAX events for in-document-order delivery.

  • text node aggregation

    SAX does not guarantee that characters events will be aggregated as much as possible, as text() nodes do in XPath. Generally, however, this is not a problem; instead of writing

        "quotation/text()" => sub {
            ## BUG: may be called several times within each quotation elt.
            my $self = shift;
            print "He said '", $self->current_node->{Data}, "'\n'";
        },

    write

        "string( quotation )" => sub {
            my $self = shift;
            print "He said '", $self->expression_result, "'\n'";
        },

    The former is unsafe; consider the XML:

        <quotation>I am <!-- bs -->GREAT!<!-- bs --></quotation>

    Rules like .../text() will fire twice, which is not what is needed here.

    Rules like string( ... ) will fire once, at the end_element event, with all descendant text of quotation as the expression result.

    You can also place an XML::Filter::BufferText instance upstream of XML::Filter::Dispatcher if you really want to use the former syntax (but the GREAT! example will still generate more than one event due to the comment).

  • Axes

    o

    self (yes)

    o

    descendant (yes)

    o

    descendant-or-self (yes)

    o

    child (yes)

    o

    attribute (yes)

    o

    namespace (todo)

    o

    ancestor (todo, will be limited)

    o

    ancestor-or-self (todo, will be limited)

    o

    parent (todo, will be limited)

    parent/ancestor paths will not allow you to descend the tree, that would require DOM building and SAX event queueing.

    o

    preceding (no: reverse axis, would require DOM building)

    o

    preceding-sibling (no: reverse axis, would require DOM building)

    o

    following (no: forward axis, would require DOM building and rule activation queueing)

    o

    following-sibling (no: forward axis, would require DOM building and rule activation queueing)

  • Implemented XPath Features

    Anything not on this list or listed as unimplemented is a TODO. Ring me up if you need it.

    • String Functions

      o

      concat( string, string, string* )

      o

      contains( string, string )

      o

      normalize-space( string )

      o

      starts-with( string, string )

      o

      string( object )

      Object may be a number, boolean, string, or the result of a location path:

          string( 10 );
          string( /a/b/c );
          string( @id );

      Unlike normal DOM oriented XPath, calling string on a location path causes the string to be calculated once each time the location path matches. So a run like:

          "string(@part-number)" => sub {
              my $self = shift;
              print "Part number: ", $self->expression_result, "\n";
          }

      will print as many times as there are part-number attributes in the document. This is true anywhere an XPath node set is used as an argument to a function or logical operator.

      o

      string-length( string )

      string-length() not supported; can't stringify the context node without keeping all of the context node's children in mempory. Could enable it for leaf nodes, I suppose, like attrs and #PCDATA containing elts. Drop me a line if you need this (it's not totally trivial or I'd have done it).

      o

      substring( string, number, number? )

      o

      substring-after( string, string )

      o

      substring-before( string, string )

      o

      translate( string, string, string )

    • Boolean Functions, Operators

      o

      boolean( object )

      See notes about node sets for the string() function above.

      o

      false()

      o

      not( object )

      See notes about node sets for the string() function above.

      o

      true()

    • Number Functions, Operators

      o

      ceil( number )

      o

      floor( number )

      o

      number( object )

      Converts strings, numbers, booleans, or the result of a location path (number( /a/b/c )). See the string( object ) description above for more information on location paths.

      Unlike real XPath, this dies if the object cannot be cleanly converted in to a number. This is due to Perl's varying level of support for NaN, and may change in the future.

    • All relational operators

      No support for nodesets, though.

    • All logical operators

      Supports limited nodesets, see the string() function description for details.

    • Additional Functions

      o

      is-end-event()

      This is en extension function that returns true when an end_element or end_document event is being processed.

      o

      is-start-event()

      This is en extension function that returns true when handling and SAX event other than end_element or end_document.

  • Missing Features

    Some features are entirely or just currently missing due to the lack of nodesets or the time needed to work around their lack. This is an incomplete list; it's growing as I find new things not to implement.

    o

    count()

    No nodesets => no count() of nodes in a node set.

    o

    last()

    With SAX, you can't tell when you are at the end of what would be a node set in XPath.

    o

    position()

    I will implement pieces of this as I can. None are implemented as yet.

  • Todo features

    o

    id()

    o

    lang()

    o

    local-name()

    May not be able to handle local-name( arg ), just argless local-name().

    o

    name()

    May not be able to handle name( arg ), just argless name().

    o

    namespace-uri()

    May not be able to handle namespace-uri( arg ), just argless namespace-uri().

    o

    sum( node-set )

  • Extensions

    o

    is-start-event(), is-end-event()

    XPath has no concept of time; it's meant to operate on a tree of nodes. SAX has start_element and end_element events and start_document and end_document events.

    By default, XML::Filter::Dispatcher acts on start events and not end events (note that all rules are evaluated on both, but the actions are not run on end_ events by default).

    By including a call to the is-start-event() or is-end-event() functions in a predicate the rule may be forced to fire only on end events or on both start and end events (using a [is-start-event() or is-end-event()] idiom).

TODO

  • Namespace support.

  • Text node aggregation so text() handlers fire once per text node instead of once per characters() event.

  • Nice messages on legitimate but unsupported axes.

  • /../ (parent node)

  • add_rule(), remove_rule(), set_rules() methods.

LIMITATIONS

  • NaN is not handled properly due to mediocre support in perl, especially across some platforms that it apparently isn't easily supported on.

  • -0 (negative zero) is not provided or handled properly

  • +/- Infinity is not handled properly due to mediocre support in perl, especially across some platforms that it apparently isn't easily supported on.

This is more of a frustration than a limitation, but this class requires that you pass in a type when setting variables (in the Vars ctor parameter or when calling xset_var). This is so that the engine can tell what type a variable is, since string(), number() and boolean() all treat the Perlian 0 differently depending on it's type. In Perl the digit 0 means false, 0 or '0', depending on context, but it's a consistent semantic. When passing a 0 from Perl lands to XPath-land, we need to give it a type so that string() can, for instance, decide whether to convert it to '0' or 'false'.

THANKS

...to Kip Hampton, Robin Berjon and Matt Sergeant for sanity checks and to James Clark (of Expat fame) for posting a Yacc XPath grammar where I could snarf it years later and add lots of Perl code to it.

AUTHOR

    Barrie Slaymaker <barries@slaysys.com>

COPYRIGHT

    Copyright 2002, Barrie Slaymaker, All Rights Reserved.

You may use this module under the terms of the Artistic or GNU Pulic licenses your choice. Also, a portion of XML::Filter::Dispatcher::Parser is covered by:

        The Parse::Yapp module and its related modules and shell scripts are
        copyright (c) 1998-1999 Francois Desarmenien, France. All rights
        reserved.

        You may use and distribute them under the terms of either the GNU
        General Public License or the Artistic License, as specified in the
        Perl README file.

Note: Parse::Yapp is only needed if you want to modify lib/XML/Filter/Dispatcher/Grammar.pm

1 POD Error

The following errors were encountered while parsing the POD:

Around line 225:

You forgot a '=back' before '=head2'