The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

XML::RSS::Tools - A tool-kit providing a wrapper around a HTTP client, a RSS parser, and a XSLT engine.

VERSION

This documentation refers to XML::RSS::Tools version 0.33

SYNOPSIS

  use XML::RSS::Tools;
  my $rss_feed = XML::RSS::Tools->new;
  $rss_feed->rss_uri( 'http:://foo/bar.rdf' );
  $rss_feed->xsl_file( '/my/rss_transformation.xsl' );
  $rss_feed->transform;
  say $rss_feed->as_string;

DESCRIPTION

RSS/RDF feeds are commonly available ways of distributing or syndicating the latest news about a given web site. Weblog (blog) sites in particular are prolific generators of RSS feeds. This module provides a VERY high level way of manipulating them. You can easily use LWP, the XML::RSS and XML::LibXSLT do to this yourself, but this module is a wrapper around these modules, allowing for the simple creation of a RSS client.

When working with XML if the file is invalid for some reason this module will croak bringing your application down. When calling methods that deal with XML manipulation you should enclose them in an eval statement should you wish your program to fail gracefully.

Otherwise method calls will return true on success and false on failure. For example after loading a URI via HTTP, you may wish to check the error status before proceeding with your code:

  unless ( $rss_feed->rss_uri( 'http://this.goes.nowhere/' ) ) {
    say "Unable to obtain file via HTTP", $rss_feed->as_string( 'error' );
    # Do what else
    # you have to.
  } else {
    # carry on...
  }

Check the HTML documentation for extra examples, and background.

CONSTRUCTOR

new

  my $rss_object = XML::RSS::Tools->new;

Or with optional parameters.

  my $rss_object = XML::RSS::Tools->new(
    version     => 0.91,
    http_client => "lwp",
    auto_wash   => 1,
    debug       => 1);

The module will die if it's created with invalid parameters.

SUBROUTINES/METHODS

Source RSS feed

  $rss_object->rss_file( '/my/file.rss' );
  $rss_object->rss_uri( 'http://my.server.com/index.rss' );
  $rss_object->rss_uri( 'file:/my/file.rss' );
  $rss_object->rss_string( $xml_file );
  $rss_object->rss_fh( $file_handle );

All return true on success, false on failure. If an XML file was provided but was invalid XML the parser will fail fatally at this time. The input RSS feed will automatically be normalised to the preferred RSS version at this time. Chose your version before you load it!

As of version URI version 1.32 the way that URIs are mapped has changed slightly, this may result in erroneous file location. The variable $URI::file::DEFAULT_AUTHORITY should be set to undef in versions later than 1.32 to revert their behaviour to that of the older version, see the URI changes file for more details.

Source XSL Template

  $rss_object->xsl_file( '/my/file.xsl' );
  $rss_object->xsl_uri( 'http://my.server.com/index.xsl' );
  $rss_object->xsl_uri( 'file:/my/file.xsl' );
  $rss_object->xsl_string( $xml_file );
  $rss_object->xsl_fh( $file_handle );

All return true on success, false on failure. The XSLT file is NOT parsed or verified at this time.

Other Methods

transform

  $rss_object->transform( );

Performs the XSL transformation on the source RSS file with the loaded XSLT file.

as_string

  $rss_object->as_string;

Returns the RSS file after it's been though the XSLT process. Optionally you can pass this method one additional parameter to obtain the source RSS, XSL Template and any error message:

  $rss_object->as_string( 'xsl' );
  $rss_object->as_string( 'rss' );
  $rss_object->as_string( 'error' );

If there is nothing to stringify you will get nothing.

debug

  $rss_object->debug( 1 );

A simple switch that control the debug status of the module. By default debug is off. Returns the current status. With debug on you will get more warnings sent to stderr.

set_auto_wash and get_auto_wash

  $rss_object->set_auto_wash( 1 );
  $rss_object->get_auto_wash;

If auto_wash is true, then all RSS files are cleaned before RSS normalisation to replace known entities by their numeric value and fix known invalid XML constructs. By default auto_wash is set to true.

set_version

  $rss_object->set_version(0.92);

All incoming RSS feeds are automatically converted to one default RSS version. If RSS version is set to 0 then normalisation is not performed. The default RSS version is 0.91.

get_version

  $rss_object->get_version;

Return the default RSS version.

set_http_client and get_http_client

  $rss_object->set_http_client('lwp');
  $rss_object->get_http_client;

These methods set the HTTP client to use and get back the one selected. Acceptable values are:

  • auto

    Will use attempt to use the HTTP client modules in order of performance.

  • curl

    Balint Szilakszi's libcurl based WWW::Curl::Easy.

  • ghttp

    Matt Sergeant's libghttp based HTTP::GHTTP.

  • lite

    Roy Hooper's pure Perl HTTP::Lite client. Slower than ghttp, but still faster than lwp.

  • lwp

    LWP is the Rolls-Royce solution, it can do everything, but it's rather big, so it's slow to load, and it's not exactly fast. It is however far more common and is by far the most complete.

If set to auto the module will first try WWW::Curl::Easy, HTTP::GHTTP then HTTP::Lite then LWP, to retrieve files on the Internet. Though LWP is the slowest option but it is far more common than all the others, so this method allows you to specify which client to use if you wish to.

set_http_proxy and get_http_proxy

If you are connected to the Internet via a HTTP proxy, then you can pass your HTTP Proxy details to the HTTP clients.

  $rss_object->set_http_proxy(proxy_server => "http://proxy.server.com:3128/");

You may also pass BASIC authentication details through if you need.

  $rss_object->set_http_proxy(
    proxy_server => "http://proxy.server.com:3128/",
    proxy_user   => "username",
    proxy_pass   => "password");

If you need to recover the proxy settings there is also the get_http_proxy command which returns the proxy and BASIC authentication details as a single URI.

    say $rss_object->get_http_proxy;
    # username:password@http://proxy.server.com:3128/

set_xml_catalog

Set the XML catalog. See below.

get_xml_catalog

Return the XML catalog in use.

XML Catalog

To speed up large scale XML processing it is advised to create an XML Catalog (sic) so that the XML parser does not have to make slow and expensive requests to files on the Internet. The catalogue contains details of the DTD and external entities so that they can be retrieved from the local file system quicker and at lower load that from the Internet. If XML processing is being carried out on a system not connected to the Internet, the libxml2 parser will still attempt to connect to the Internet which will add a delay of about 60 seconds per XML file. If a catalogue is created then the process will be much quicker as the libxml2 parser will use the local information stored in the catalogue.

    $rss_object->set_xml_catalog( $xml_catalog_file);

This will pass the specified file to the XML parsers to use as a local XML Catalog. If your version of XML::LibXML does not support XML Catalogs it will die if you attempt to use this method (see below).

    $rss_object->get_xml_catalog;

This will return the file name of the XML Catalog in use.

Depending upon how your core libxml2 library is compiled, you should also be able to use pre-configured XML Catalog files stored in your /etc/xml/catalog.

XML Catalog support was introduced in version 2.4.3 of libxml2, and significantly revised in version 2.4.7. Support for XML Catalog was introduced into version 1.53 of the XML::LibXML module. Therefore for XML Catalog support your libxml2 library should be version 2.4.3 or better and your XML::LibXML should be version 1.5.3 or better. However there appears to be bugs in some of the later version of XML::LibXML, at this time I do not know which versions work correctly and which do not. Please bear this in mind if you wish to use XML Catalogs.

PREREQUISITES

To function you must have URI installed. If you plan to normalise your RSS data before transforming you must also have XML::RSS installed. To transform any RSS files to HTML you will also need to use XML::LibXSLT and XML::LibXML.

One of HTTP::GHTTP, HTTP::Lite or LWP will bring this module to full functionality. GHTTP is much faster than LWP, but is it not as widely available as LWP. By default GHTTP will be used if it is available, then Lite, finally LWP. If you have two or more installed you may manually select which one you wish to use.

Any OS able to run the core requirements.

EXPORT

None.

HISTORY

0.33 More minor changes, tested on more modern perls. More modern build.

0.32 Minor build and kwalitee tweaks. Mo actual module code changes since version 0.30

...

0.01 Initial Build. Shown to the public on PerlMonks May 2002, for feedback.

See CHANGES file.

BUGS AND LIMITATIONS

  • External Entities

    If an RSS or XSLT file is passed into LibXML and it contains references to external files, such as a DTD or external entities, LibXML will automatically attempt to obtain the files, before performing the transformation. If the files referred to are on the public INTERNET and you do not have a connection when this happens you may find that the process waits around for several minutes until LibXML gives up. If you plan to use this module in an asynchronous manner, you should setup an XML Catalog for LibXML using the xmlcatalog command. See: http://www.xmlsoft.org/catalog.html for more details. You can pass your catalog into the module and a local copy will then be used rather than the one on the Internet.

  • Defective XML

    Many commercial RSS feeds are derived from the Content Management System in use at the site. Often the RSS feed is not well formed and is thus invalid. This will prevent the RSS parser and/or XSLT engine from functioning and you will get no output. The auto_wash option attempts to fix these errors, but it is is neither perfect nor ideal. Some people report good success with complaining to the site. Mark Pilgrim estimates that about 10% of RSS feeds have defective XML.

  • XML::RSS Limitations

    XML::RSS up-to and including version 0.96 has a number of defects. The module is currently being maintained by Shlomi Fish. See http://perl-rss.sourceforge.net/ and http://svn.perl.org/modules/XML-RSS/

    Since version 1.xx most problems have been fixed, please upgrade if you can.

  • Build Problems

    There are alas quite a lot of differences between differing versions of libxml2/libxslt and XML::LibXML/LibXSLT which makes writing definitive tests hard. Some failures are false positives, some successes may be false negatives. Feedback welcomed.

To Do

  • Support Atom feeds.

  • Import Proxy settings from environment.

  • Turn on proxy support for WWW::Curl

AUTHOR

Adam Trickett, <atrickett@cpan.org>

This module contains the direct and indirect input of a number of friendly Perl Hackers on Perlmonks/use.perl: Ovid; Matts; Merlyn; hfb; link; Martin and more...

SEE ALSO

perl, XML::RSS, XML::LibXSLT, XML::LibXML, XML::RSS::LibXML, URI, LWP, XML::Feed.

This module is not an aggregator tool for that I suggest you investigate Plagger

LICENSE AND COPYRIGHT

This version as XML::RSS::Tools, Copyright Adam John Trickett 2002-2014

OSI Certified Open Source Software. Free Software Foundation Free Software.

This program is free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU Lesser General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.

DEDICATION

This module is dedicated to my beloved mother who believed in me, even when I didn't.