The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

TPath - general purpose path languages for trees

VERSION

version 0.008

SYNOPSIS

  # we define our trees
  package MyTree;
  
  use overload '""' => sub {
      my $self     = shift;
      my $tag      = $self->{tag};
      my @children = @{ $self->{children} };
      return "<$tag/>" unless @children;
      local $" = '';
      "<$tag>@children</$tag>";
  };
  
  sub new {
      my ( $class, %opts ) = @_;
      die 'tag required' unless $opts{tag};
      bless { tag => $opts{tag}, children => $opts{children} // [] }, $class;
  }
  
  sub add {
      my ( $self, @children ) = @_;
      push @{ $self->{children} }, $_ for @children;
  }
  
  # teach TPath::Forester how to get the information it needs
  package MyForester;
  use Moose;
  use MooseX::MethodAttributes;    # needed for @tag attribute below
  with 'TPath::Forester';
  
  # implement required methods
  
  sub children {
      my ( $self, $n ) = @_;
      @{ $n->{children} };
  }
  
  sub tag : Attr {                 # also an attribute!
      my ( $self, $n ) = @_;
      $n->{tag};
  }
  
  package main;
  
  # make the tree
  #      a
  #     /|\
  #    / | \
  #   b  c  \
  #  /\  |   d
  #  e f |  /|\
  #      h / | \
  #     /| i j  \
  #    l | | |\  \
  #      m n o p  \
  #     /|    /|\  \
  #    s t   u v w  k
  #                / \
  #               q   r
  #                  / \
  #                 x   y
  #                     |
  #                     z
  my %nodes = map { $_ => MyTree->new( tag => $_ ) } 'a' .. 'z';
  $nodes{a}->add($_) for @nodes{qw(b c d)};
  $nodes{b}->add($_) for @nodes{qw(e f)};
  $nodes{c}->add( $nodes{h} );
  $nodes{d}->add($_) for @nodes{qw(i j k)};
  $nodes{h}->add($_) for @nodes{qw(l m)};
  $nodes{i}->add( $nodes{n} );
  $nodes{j}->add($_) for @nodes{qw(o p)};
  $nodes{k}->add($_) for @nodes{qw(q r)};
  $nodes{m}->add($_) for @nodes{qw(s t)};
  $nodes{p}->add($_) for @nodes{qw(u v w)};
  $nodes{r}->add($_) for @nodes{qw(x y)};
  $nodes{y}->add( $nodes{z} );
  my $root = $nodes{a};
  
  # make our forester
  my $rhood = MyForester->new;
  
  # index our tree (not necessary, but efficient)
  my $index = $rhood->index($root);
  
  # try out some paths
  my @nodes = $rhood->path('//r')->select( $root, $index );
  print scalar @nodes, "\n";    # 1
  print $nodes[0], "\n";        # <r><x/><y><z/></y></r>
  print $_
    for $rhood->path('leaf::*[@tag > "o"]')->select( $root, $index )
    ;                           # <s/><t/><u/><v/><w/><q/><x/><z/>
  print "\n";
  print $_->{tag}
    for $rhood->path('//*[@tsize = 3]')->select( $root, $index );    # bm
  print "\n";
  @nodes = $rhood->path('/>~[bh-z]~')->select( $root, $index );
  print $_->{tag} for @nodes;                                        # bhijk
  print "\n";
  
  # we can map nodes back to their parents
  @nodes = $rhood->path('//*[parent::~[adr]~]')->select( $root, $index );
  print $_->{tag} for @nodes;                                        # bcijxykd
  print "\n";

DESCRIPTION

TPath provides an XPath-like language for arbitrary trees. You implement a minimum of two methods -- children and tag -- and then you can explore your trees via concise, declarative paths.

In TPath, "attributes" are node attributes of any sort and are implemented as methods that return these attributes, or undef if the attribute is undefined for the node.

The object in which the two required methods are implemented is a "forester" (TPath::Forester), something that understands your trees. In general to use TPath you instantiate a forester and then call the forester's methods.

Forester objects make use of an index (TPath::Index), which caches information not present in, or not cheaply from, the nodes themselves. If no index is explicitly provided it is created, but one can gain some efficiency by reusing an index when select paths from a tree. One can use a forester's index method to produce a TPath::Index.

The paths themselves are compiled into reusable TPath::Expression objects that can be applied to multiple trees. One use's a forester's path method to produce a TPath::Expression.

ALGORITHM

TPath works by representing an expression as a pipeline of selectors and filters. Each pair of a selector and some set of filters is called a "step". At each step one has a set of context nodes. One applies the selectors to each context node, returning a candidate node set, and then one passes these candidates through the filtering predicates. The remainder becomes the context node set for the next step. If this is the last step, the surviving candidates are the nodes selected by the expression. A node will only occur once among those returned and the order of their return will be the order of their discovery. Search is depth-first post-ordered -- children returned before parents.

SYNTAX

Sub-Paths

A tpath expression has one or more sub-paths.

    //a/b|preceding::d/*

    //a/b|preceding::d/*

Sub-paths are separated by the pipe symbol | and optional space.

The nodes selected by a path is the union of the nodes selected by each sub-path in the order of their discovery. The search is left-to-right and depth first. If a node and its descendants are both selected, the descendants will be listed first.

Steps

    //a/b[0]/>c[@d]

    //a/b[0]/>c[@d]

    //a/b[0]/>c[@d]

Each step consists of a separator (optional on the first step), a tag selector, and optionally some number of predicates.

Separators

    a/b/c/>d

    /a/b//c/>d

    //a/b//c/>d

    />a/b//c/>d

null separator

    a/b/c/>d

The null separator is simply the absence of a separator and can only occur before the first step. It means "relative to the context node". Thus is it essentially the same as the file path formalism, where /a means the file a in the root directory and a means the file a in the current directory.

/

    /a/b//c/>d

The single slash separator means "search among the context node's children", or if it precedes the first step it means that the context node is the root node.

// select among descendants

    //a/b//c/>d

The double slash separator means "search among the descendants of the context node" or, if the context node is the root, "search among the root node and its descendants".

/> select closest

    />a/b//c/>d

The /> separator means "search among the descendants of the context node (or the context node and its descendants if the context node is root), but omit from consideration any node dominated by a node matching the selector". Written out like this this may be confusing, but it is a surprisingly useful separator. Consider the following tree

         a
        / \
       b   a
       |   | \
       a   b  a
       |      |
       b      b

The expression />b when applied to the root node will select all the b nodes except the leftmost leaf b, which is screened from the root by its grandparent b node. That is, going down any path from the context node />b will match the first node it finds matching the selector -- the matching node closest to the context node.

Selectors

Selectors select a candidate set for later filtering by predicates.

literal

    a

A literal selector selects the nodes whose tag matches, in a tree-appropriate sense of "match", a literal expression.

Any string may be used to represent a literal selector, but certain characters may have to be escaped with a backslash. The expectation is that the literal with begin with a word character, _, or $ and any subsequent character is either one of these characters, a number character or a hyphen or colon followed by one of these or number character. The escape character, as usual, is a backslash. Any unexpected character must be escaped. So

    a\\b

represents the literal a\b.

~a~ regex

    ~a~

A regex selector selects the nodes whose tag matches a regular expression delimited by tildes. Within the regular expression a tilde must be escaped, of course. A tilde within a regular expression is represented as a pair of tildes. The backslash, on the other hand, behaves as it normally does within a regular expression.

@a attribute

Any attribute may be used as a selector so long as it is preceded by something other than the null separator -- in other words, @ cannot be the first character in a path. This is because attributes may take arguments and among other things these arguments can be both expressions and other attributes. If @foo were a legitimate path expression it would be ambiguous how to compile @bar(@foo). Is the argument an attribute or a path with an attribute selector. You can produce the effect of an attribute selector with the null separator, however, in two ways

    child::@foo

    ./@foo

the second of these will be normalized in parsing to precisely what one would expect with a @foo path.

complement selectors

The ^ character before a literal, regex, or attribute selector will convert it into an attribute selector.

//^foo

//^~foo~

//^@foo

Complement selectors select nodes not selected by the unmodified selector; //^foo will select any node without the foo tag, and so forth.

* wildcard

The wildcard selector selects all the nodes an the relevant axis. The default axis is child, so //b/* will select all the children of b nodes.

Axes

To illustrate the nodes on various axes I will using the following tree, showing which nodes are selected from the tree relative the the c node. Selected nodes will be in capital letters.

         root
          |
          a
         /|\
        / | \
       /  |  \
      /   |   \
     /    |    \
    b     c     d
   /|\   /|\   /|\
  e f g h i j l m n
    |     |     |
    o     p     q
ancestor
  //c/ancestor::*

         ROOT
          |
          A
         /|\
        / | \
       /  |  \
      /   |   \
     /    |    \
    b     c     d
   /|\   /|\   /|\
  e f g h i j l m n
    |     |     |
    o     p     q
ancestor-or-self
         ROOT
          |
          A
         /|\
        / | \
       /  |  \
      /   |   \
     /    |    \
    b     C     d
   /|\   /|\   /|\
  e f g h i j l m n
    |     |     |
    o     p     q
child
  //c/child::*

         root
          |
          a
         /|\
        / | \
       /  |  \
      /   |   \
     /    |    \
    b     c     d
   /|\   /|\   /|\
  e f g H I J l m n
    |     |     |
    o     p     q
descendant
  //c/descendant::*

         root
          |
          a
         /|\
        / | \
       /  |  \
      /   |   \
     /    |    \
    b     c     d
   /|\   /|\   /|\
  e f g H I J l m n
    |     |     |
    o     P     q
descendant-or-self
  //c/descendant-or-self::*

         root
          |
          a
         /|\
        / | \
       /  |  \
      /   |   \
     /    |    \
    b     C     d
   /|\   /|\   /|\
  e f g H I J l m n
    |     |     |
    o     P     q
following
  //c/following::*

         root
          |
          a
         /|\
        / | \
       /  |  \
      /   |   \
     /    |    \
    b     c     D
   /|\   /|\   /|\
  e f g h i j L M N
    |     |     |
    o     p     Q
following-sibling
  //c/following-sibling::*

         root
          |
          a
         /|\
        / | \
       /  |  \
      /   |   \
     /    |    \
    b     c     D
   /|\   /|\   /|\
  e f g h i j l m n
    |     |     |
    o     p     q
leaf
  //c/leaf::*

         root
          |
          a
         /|\
        / | \
       /  |  \
      /   |   \
     /    |    \
    b     c     d
   /|\   /|\   /|\
  e f g H i J l m n
    |     |     |
    o     P     q
parent
  //c/parent::*

         root
          |
          A
         /|\
        / | \
       /  |  \
      /   |   \
     /    |    \
    b     c     d
   /|\   /|\   /|\
  e f g h i j l m n
    |     |     |
    o     p     q
preceding
  //c/preceding::*

         root
          |
          a
         /|\
        / | \
       /  |  \
      /   |   \
     /    |    \
    B     c     d
   /|\   /|\   /|\
  E F G h i j l m n
    |     |     |
    O     p     q
preceding-sibling
  //c/preceding-sibling::*

         root
          |
          a
         /|\
        / | \
       /  |  \
      /   |   \
     /    |    \
    B     c     d
   /|\   /|\   /|\
  e f g h i j l m n
    |     |     |
    o     p     q
self
  //c/self::*

         root
          |
          a
         /|\
        / | \
       /  |  \
      /   |   \
     /    |    \
    b     C     d
   /|\   /|\   /|\
  e f g h i j l m n
    |     |     |
    o     p     q
sibling
  //c/sibling::*

         root
          |
          a
         /|\
        / | \
       /  |  \
      /   |   \
     /    |    \
    B     c     D
   /|\   /|\   /|\
  e f g h i j l m n
    |     |     |
    o     p     q
sibling-or-self
  //c/sibling-or-self::*

         root
          |
          a
         /|\
        / | \
       /  |  \
      /   |   \
     /    |    \
    B     C     D
   /|\   /|\   /|\
  e f g h i j l m n
    |     |     |
    o     p     q

Predicates

    //a/b[0]/>c[@d][@e < 'string'][@f or @g]

    //a/b[0]/>c[@d][@e < 'string'][@f or @g]

    //a/b[0]/>c[@d][@e < 'string'][@f or @g]

    //a/b[0]/>c[@d][@e < 'string'][@f or @g]

Predicates are the sub-expressions in square brackets after selectors. They represents tests that filter the candidate nodes selected by the selectors.

There may be space inside the square brackets.

Index Predicates

  //foo/bar[0]

An index predicate simply selects the indexed item out of a list of candidates. The first index is 0, unlike in XML, so the expression above selects the first bar under every foo.

The index rules are the same as those for Perl arrays: 0 is the first item; negative indices count from the end, so -1 retrieves the last item.

Attributes

  //foo[@bar]
  //foo[@bar(1, 'string', path, @attribute, @attribute = 'test')]

Attributes identify callbacks that evaluate the context node to see whether the respective attribute is defined for it. If the callback returns a defined value, the predicate is true and the candidate is accepted; otherwise, it is rejected.

As the second example above demonstrates, attributes may take arguments and these arguments may be numbers, strings, paths, other attributes, or attribute tests (see below). Paths are evaluated relative to the candidate node being tested, as are attributes and attribute tests. A path arguments represents the nodes selected by this path relative to the candidate node.

Attribute parameters are enclosed within parentheses. Within these parentheses, they are delimited by commas. Space is optional around parameters.

For the standard attribute set available to all expressions, see TPath::Attributes::Standard. For the extended set that can be composed in, see TPath::Attributes::Extended. There are various ways one can add bespoke attributes but the easiest is to add them to an individual forester via the add_attribute method:

  my $forester = MyForester->new;
  $forester->add_attribute( 'foo' => sub {
     my ( $self, $node, $index, $collection, @params) = @_;
     ...
  });

TODO: explain other means to define new attributes.

Attribute Tests

Attribute tests compare attributes to literals, numbers, or other attributes

TODO: complete this section.

Boolean Predicates

Boolean predicates combine various terms -- attributes, attribute tests, or tpath expressions -- via boolean operators:

! or not

True iff the attribute is undefined, the attribute test returns false, the expression returns no nodes, or the boolean expression is false.

& or and

True iff all conjoined operands are true.

|| or or

True iff any of the conjoined operands is true.

Note that boolean or is two pipe characters. This is to disambiguate the path expression a|b from the boolean expression a||b.

^ or xor

True if one and only one of the conjoined operands is true. The expression

  @a ^ @b

Behaves like ordinary exclusive or. But if more than two operands are conjoined this way, the entire expression is a uniqueness test.

( ... )

Groups the contained boolean operations. True iff they evaluate to true.

The normal precedence rules of logical operators applies to these:

  () < ! < & < ^ < ||

Space is required around operators only where necessary to prevent their being interpreted as part of a path or attribute.

Special Selectors

There are three special selectors that cannot occur with predicates.

. : Select Self

This is an abbreviation for self::*.

.. : Select Parent

This is an abbreviation for parent::*.

id(foo) : Select By Index

This selector selects the node, if any, with the given id. This same node can also be selected by //*[@id = 'foo'] but this is much less efficient.

Hiding Nodes

In some cases there may be nodes -- spaces, comments, hidden directories and files -- that you want your expressions to treat as invisible. To do this you add invisibility tests to the forester object that generates expressions.

  my $forester = MyForester->new;
  $forester->add_test( sub {
     my ($forester, $node, $index) = @_;
     ... # return true if the node should be invisible
  });

One can put this in the forester's BUILD method to make them invisible to all instances of the class.

HISTORY

I wrote TPath initially in Java (http://dfhoughton.org/treepath/) because I wanted a more convenient way to select nodes from parse trees. I've re-written it in Perl because I figured it might be handy and why not?

ACKNOWLEDGEMENTS

Thanks to Damian Conway for Regexp::Grammars, which makes it pleasant to write complicated parsers, and the Moose Cabal, who make it pleasant to write elaborate object oriented Perl. Without the use of roles I don't think I would have tried this.

AUTHOR

David F. Houghton <dfhoughton@gmail.com>

COPYRIGHT AND LICENSE

This software is copyright (c) 2013 by David F. Houghton.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.

2 POD Errors

The following errors were encountered while parsing the POD:

Around line 333:

Unknown directive: =over2

Around line 341:

=back without =over