The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

HTML::Object::XPath - HTML Object XPath Class

SYNOPSIS

    use HTML::Object;
    use HTML::Object::XQuery;
    use HTML::Object::XPath;
    my $this = HTML::Object::XPath->new || die( HTML::Object::XPath->error, "\n" );

    my $p = HTML::Object->new;
    my $doc = $p->parse_file( $path_to_html_file ) || die( $p->error );
    # Returns a list of HTML::Object::Element objects matching the select, which is
    # converted into a xpath
    my @nodes = $doc->find( 'p' );

    # or directly:
    use HTML::Object::XPath;
    my $xp = use HTML::Object::XPath->new;
    my @nodes = $xp->findnodes( $xpath, $element_object );

VERSION

    v0.2.0

DESCRIPTION

This module implements the XPath engine used by HTML::Object::XQuery to provide a jQuery-like interface to query the parsed DOM object.

METHODS

clear_namespaces

Clears all previously set namespace mappings.

exists

Provided with a path and a context and this returns true if the given path exists.

findnodes

Provided with a path and a context this returns a list of nodes found by path, optionally in context context.

In scalar context it returns an HTML::Object::XPath::NodeSet object.

findnodes_as_string

Provided with a path and a context and this returns the nodes found as a single string. The result is not guaranteed to be valid HTML though (it could for example be just text if the query returns attribute values).

findnodes_as_strings

Provided with a path and a context and this returns the nodes found as a list of strings, one per node found.

findvalue

Provided with a path and a context and this returns the result as a string (the concatenation of the values of the result nodes).

findvalues

Provided with a path and a context and this returns the values of the result nodes as a list of strings.

matches($node, $path, $context)

Provided with a node object, path and a context and this returns true if the node matches the path.

find

Provided with a path and a context and this returns either a HTML::Object::XPath::NodeSet object containing the nodes it found (or empty if no nodes matched the path), or one of HTML::Object::XPath::Literal (a string), HTML::Object::XPath::Number, or HTML::Object::XPath::Boolean. It should always return something - and you can use ->isa() to find out what it returned. If you need to check how many nodes it found you should check $nodeset->size.

See HTML::Object::XPath::NodeSet.

get_namespace ($prefix, $node)

Provided with a prefix and a node object and this returns the uri associated to the prefix for the node (mostly for internal usage)

get_var

Provided with a variable name, and this returns the value of the XPath variable (mostly for internal usage)

getNodeText

Provided with a path and this returns the text string for a particular node. It returns a string, or undef if the node does not exist.

namespaces

Sets or gets an hash reference of namespace attributes.

new_expr

Create a new HTML::Object::XPath::Expr, passing it whatever argument was provided, and returns the newly instantiated object, or undef upon error

new_function

Create a new HTML::Object::XPath::Function object, passing it whatever argument was provided, and returns the newly instantiated object, or undef upon error

new_literal

Create a new HTML::Object::XPath::Literal object, passing it whatever argument was provided, and returns the newly instantiated object, or undef upon error

new_location_path

Create a new HTML::Object::XPath::LocationPath object, passing it whatever argument was provided, and returns the newly instantiated object, or undef upon error

new_nodeset

Create a new HTML::Object::XPath::NodeSet object, passing it whatever argument was provided, and returns the newly instantiated object, or undef upon error

new_number

Create a new HTML::Object::XPath::Number object, passing it whatever argument was provided, and returns the newly instantiated object, or undef upon error

new_root

Create a new HTML::Object::XPath::Root object, passing it whatever argument was provided, and returns the newly instantiated object, or undef upon error

new_step

Create a new HTML::Object::XPath::Step object, passing it whatever argument was provided, and returns the newly instantiated object, or undef upon error

new_variable

Create a new HTML::Object::XPath::Variable object, passing it whatever argument was provided, and returns the newly instantiated object, or undef upon error

set_namespace

Provided with a prefix and an uri and this sets the namespace prefix mapping to the uri.

Normally in HTML::Object::XPath the prefixes in XPath node tests take their context from the current node. This means that foo:bar will always match an element <foo:bar> regardless of the namespace that the prefix foo is mapped to (which might even change within the document, resulting in unexpected results). In order to make prefixes in XPath node tests actually map to a real URI, you need to enable that via a call to the set_namespace method of your HTML::Object::XPath object.

parse

Provided with an XPath expression and this returns a new HTML::Object::XPath::Expr object that can then be used repeatedly.

You can create an XPath expression from a CSS selector expression using HTML::selector::XPath

set_strict_namespaces

Takes a boolean value.

By default, for historical as well as convenience reasons, HTML::Object::XPath has a slightly non-standard way of dealing with the default namespace.

If you search for //tag it will return elements tag. As far as I understand it, if the document has a default namespace, this should not return anything. You would have to first do a set_namespace, and then search using the namespace.

Passing a true value to set_strict_namespaces will activate this behaviour, passing a false value will return it to its default behaviour.

set_var

Provided with a variable name and its value and this sets an XPath variable (that can be used in queries as $var)

NODE STRUCTURE

All nodes have the same first 2 entries in the array: node_parent and node_pos. The type of the node is determined using the ref() function.

The node_parent always contains an entry for the parent of the current node - except for the root node which has undef in there. And node_pos is the position of this node in the array that it is in (think: $node == $node->[node_parent]->[node_children]->[$node->[node_pos]] )

Nodes are structured as follows:

Root Node

The root node is just an element node with no parent.

    [
      undef, # node_parent - check for undef to identify root node
      undef, # node_pos
      undef, # node_prefix
      [ ... ], # node_children (see below)
    ]

Element Node

    [
      $parent, # node_parent
      <position in current array>, # node_pos
      'xxx', # node_prefix - namespace prefix on this element
      [ ... ], # node_children
      'yyy', # node_name - element tag name
      [ ... ], # node_attribs - attributes on this element
      [ ... ], # node_namespaces - namespaces currently in scope
    ]

Attribute Node

    [
      $parent, # node_parent - the element node
      <position in current array>, # node_pos
      'xxx', # node_prefix - namespace prefix on this element
      'href', # node_key - attribute name
      'ftp://ftp.com/', # node_value - value in the node
    ]

Text Nodes

    [
      $parent,
      <pos>,
      'This is some text' # node_text - the text in the node
    ]

Comment Nodes

    [
      $parent,
      <pos>,
      'This is a comment' # node_comment
    ]

AUTHOR

Jacques Deguest <jack@deguest.jp>

SEE ALSO

HTML::Object::XPath::Boolean, HTML::Object::XPath::Expr, HTML::Object::XPath::Function, HTML::Object::XPath::Literal, HTML::Object::XPath::LocationPath, HTML::Object::XPath::NodeSet, HTML::Object::XPath::Number, HTML::Object::XPath::Root, HTML::Object::XPath::Step, HTML::Object::XPath::Variable

Mozilla documentation

COPYRIGHT & LICENSE

Copyright(c) 2021 DEGUEST Pte. Ltd.

You can use, copy, modify and redistribute this package and associated files under the same terms as Perl itself.