The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

HTML::HTML5::Microdata::Parser - Parse HTML5 Microdata with Perl

VERSION

0.01

SYNOPSIS

  use HTML::HTML5::Microdata::Parser;
  
  my $parser = HTML::HTML5::Microdata::Parser->new($html, $baseURI);
  $parser->consume;
  my $graph  = $parser->graph

DESCRIPTION

This package aims to have a roughly compatible API to RDF::RDFa::Parser.

Microdata is an experimental metadata format, not in wide use. Use this module at your own risk.

$p = HTML::HTML5::Microdata::Parser->new($html, $baseuri, \%options, $storage)

This method creates a new HTML::HTML5::Microdata::Parser object and returns it.

The $xhtml variable may contain an XHTML/XML string, or a XML::LibXML::Document. If a string, the document is parsed using HTML::HTML5::Parser and HTML::HTML5::Sanity, which may throw an exception. HTML::HTML5::Microdata::Parser does not catch the exception.

The base URI is used to resolve relative URIs found in the document.

Options [default in brackets]:

  * alt_stylesheet  - Magic rel="alternate stylesheet". [1]
  * auto_config     - See section "Auto Config" [0]
  * mhe_lang        - Process <meta http-equiv=Content-Language>.
                      [1]
  * prefix_empty    - URI prefix for itemprops of untyped items.
                      [undef]
  * tdb_service     - thing-described-by.org when possible. [0] 
  * xhtml_base      - Process <base href> element. [1]
  * xhtml_lang      - Process @lang. [1]
  * xhtml_time      - Process <time> element nicely. [0]
  * xml_lang        - Process @xml:lang. [1]

$storage is an RDF::Trine::Storage object. If undef, then a new temporary store is created.

$p->xhtml

Returns the HTML source of the document being parsed.

$p->uri

Returns the base URI of the document being parsed. This will usually be the same as the base URI provided to the constructor, but may differ if the document contains a <base> HTML element.

Optionally it may be passed a parameter - an absolute or relative URI - in which case it returns the same URI which it was passed as a parameter, but as an absolute URI, resolved relative to the document's base URI.

This seems like two unrelated functions, but if you consider the consequence of passing a relative URI consisting of a zero-length string, it in fact makes sense.

$p->dom

Returns the parsed XML::LibXML::Document.

$p->set_callbacks(\&func1, \&func2)

Set callbacks for handling RDF triples extracted from the document. The first function is called when a triple is generated taking the form of (resource, resource, resource). The second function is called when a triple is generated taking the form of (resource, resource, literal).

The parameters passed to the first callback function are:

  • A reference to the HTML::HTML5::Microdata::Parser object

  • A reference to the XML::LibXML element being parsed

  • Subject URI or bnode

  • Predicate URI

  • Object URI or bnode

The parameters passed to the second callback function are:

  • A reference to the HTML::HTML5::Microdata::Parser object

  • A reference to the XML::LibXML element being parsed

  • Subject URI or bnode

  • Predicate URI

  • Object literal

  • Datatype URI (possibly undef or '')

  • Language (possibly undef or '')

In place of either or both functions you can use the string 'print' which sets the callback to a built-in function which prints the triples to STDOUT as Turtle. Either or both can be set to undef, in which case, no callback is called when a triple is found.

Beware that for literal callbacks, sometimes both a datatype *and* a language will be passed. (This goes beyond the normal RDF data model.)

set_callbacks (if used) must be used before consume.

$p->consume

The document is parsed for Microdata. Nothing of interest is returned by this function, but the triples extracted from the document are passed to the callbacks as each one is found.

$p->graph()

This method will return an RDF::Trine::Model object with all statements of the full graph.

It makes sense to call consume before calling graph. Otherwise you'll just get an empty graph.

AUTO CONFIG

HTML::HTML5::Microdata::Parser has a lot of different options that can be switched on and off. Sometimes it might be useful to allow the page being parsed to control some of the options. If you switch on the 'auto_config' option, pages can do this.

A page can set options using a specially crafted <meta> tag:

  <meta name="http://search.cpan.org/dist/HTML-HTML5-Microdata-Parser/#auto_config"
     content="alt_stylesheet=0&amp;prefix_empty=http://example.net/" />

Note that the content attribute is an application/x-www-form-urlencoded string (which must then be HTML-escaped of course). Semicolons may be used instead of ampersands, as these tend to look nicer:

  <meta name="http://search.cpan.org/dist/HTML-HTML5-Microdata-Parser/#auto_config"
     content="alt_stylesheet=0;prefix_empty=http://example.net/" />

Any option allowed in the constructor may be given using auto config, except 'auto_config' itself.

SEE ALSO

XML::LibXML, RDF::Trine, RDF::RDFa::Parser, HTML::HTML5::Parser, HTML::HTML5::Sanity.

http://www.perlrdf.org/.

AUTHOR

Toby Inkster <tobyink@cpan.org>.

COPYRIGHT AND LICENSE

Copyright (C) 2009-2010 by Toby Inkster

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.1 or, at your option, any later version of Perl 5 you may have available.