The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Article - Processing RSS Files with XSLT

SYNOPSIS

Rich Site Summary format files are traditionally converted to HTML for display in a browser by a brute force approach. XML files are converted into a Perl data structure - or any other language - manipulated, and converted back to XML. A more efficient approach is to use XSLT as it is a language designed to transform XML. Then Perl deals only with the logistics of obtainning the source file and displaying the results.

INTRODUCTION

A number of years ago I lived in Southern California. At the time the best place to get news was via National Public Radio, Public Broadcasting Service or the BBC World Service. Unfortunately, none of them had particularly good reception where I lived, and so I was forced to turn to the web for news of home. In those boom-time days, it seemed every web site was a portal, whether it made business sense or not. This helped me build a custom news page with British and European news, along with world business and technology news.

Returning to Europe I abandoned my academic training and rushed headlong into enterprise web content management working for one of the pioneering XML companies. We were so pioneering that we never made a profit and as the boom turned to bust the company imploded. My thoughts again turned to syndicated news and portals--How was it done? Was it easy? Could we use it to build a web site with multiple-content channels that was easy for a small company to build and maintain?

RSS Basics

Rich Site Summary (RSS) files allow people to syndicate a web site. RSS is an initialisation of one of several possible phrases: Rich Site Summaries, Really Simple Syndication, or Resource Description Framework Site Summaries. As RSS evolved, the meaning of RSS has shifted to match it's evolving abilities.

A RSS file is written in the eXtensible Mark-up Language (XML) http://www.w3.org/XML/ and gives a summary of the content for a section of or complete web site. The XML specification is a descendent mark-up language of Standard Generalised Mark-up Language (SGML) http://www.iso.org/iso/en/CatalogueDetailPage.CatalogueDetail?CSNUMBER=16387 developed in the 1970s. XML was designed to be simple like Hyper-Text Markup Language (HTML) http://www.w3.org/MarkUp/ which had proved popular as the basis of web sites, while SGML is complex to use and less popular. Typically, a site's content management system constructs RSS dynamically as stories and articles show up on the web site. Most webmasters place their RSS files on their web sites, and so the files are easy to download. Mirroring tools are ideal for the task, as they only download files if they have changed.

XML Stylesheet Language Transformation

The XML Stylesheet Language Transformation (XSLT) http://www.w3.org/TR/xslt is a World Wide Web Consortium (W3C) http://www.w3.org/ standard for converting XML documents to another format. XSLT engines use a stylesheet written in XML and consisting of a number of rules to convert the source XML to another format. A full introduction is beyond the scope of this paper, and I list many references at the end.

At the simplest level XSLT takes one XML document as a document-tree, and converts it to another format. The XSLT file is a list of transformation templates that apply to specific parts of the input tree. Each individual template may operate in isolation of other rules, to give a result tree.

As an example I shall transform a simple XML document into a HTML fragment using XSLT. Code listing 1 shows a simple XML document with a <statement> tag and a <footer> tag. I want to apply a stylesheet to it to produce the output in code listing 3.

In code listing 2, the first rule of my style sheet tells the engine to start at the root of the tree </>. It then outputs a <div> tag. Then the second rule tells the engine to look in the tree for a path than matches <root/statement> from the current context. If it finds a match it calls the second template, if no match is found flow proceeds to the next line. The second template outputs a <p> tag, followed by the content of the current input tree node value Hello World!, then a </p> tag. Flow returns to the calling rule, and outputs a </div> tag.

1: An Example XML document

        <?xml version="1.0"?>
        <root>
          <statement>Hello World!</statement>
          <footer>Foo</footer>
        </root>

2: An Example Stylesheet Fragment

        <xsl:template match="/">
          <div>
          <xsl:apply-templates select="root/statement"/>
          </div>
        </xsl:template>
        <xsl:template match="statement">
          <p>
          <xsl:value-of select="."/>
          </p>
        </xsl:template>

3: Result of Transformation

        <div><p>Hello World!</p></div>

As with Perl, XSLT has "more than one way to do it", which can intimidate new users. Like Perl, XSLT is a very flexible language, so it is easy to write this stylesheet in a totally different manner and get exactly the same result. I often find other people's XSLT stylesheets very confusing, just as I did with other people's Perl scripts, but with time they do start to make sense.

Using Perl

Code listing 4 uses the LWP::Simple module to retrieve XML files, and the libxslt-based XML::LibXSLT to transform them. The first command line argument to the script specifies the file to fetch, and the second specifies the XSLT template to use. Once LWP::Simple fetches the XML file, XML::LibXML and XML::LibXSLT convert it to HTML via XSLT. Perl provides the framework for the download and conversion, and the XSLT stylesheet provides the rules of the conversion.

4: Perl Example

        #!/usr/bin/perl
        use strict;
        use LWP::Simple;
        use XML::LibXML;
        use XML::LibXSLT;
        my $site = shift;
        my $xslt = shift;
        my $rss  = get($site);
        my $xslt   = XML::LibXSLT->new;
        my $parser = XML::LibXML->new;
        my $source_xml  = $parser->parse_string($rss);
        my $style_xsl   = $parser->parse_file($xsl);
        my $stylesheet  = $xslt->parse_stylesheet($style_xsl);
        my $transformed = $stylesheet->transform($source_xml);
        print $stylesheet->output_string($transformed);

Problems with RSS

There are two RSS families, and they are different so that I cannot easily use the same XSLT style sheet on all of them. In theory, I should be able to convert one RSS file into another one, however it is not that simple.

Netscape Communications developed the original RSS format, version 0.9, and UserLand later simplified it to create version 0.91. Independently, the RSS-DEV Working Group developed version 1.0, a new and incompatible format based on the W3C Resource Description Format (RDF) http://www.w3.org/RDF/ core. UserLand http://backend.userland.com/rss , unhappy with the RDF-based RSS, continued to extend and expand RSS up to its current version, 2.0.

The RDF-based RSS format uses XML Namespace http://www.w3.org/TR/REC-xml-names/ , which has its advantages, but makes the document much more verbose and more difficult to transform with XSLT. A number of XSLT processor specific extensions simplify the transformation process, however extensions are not universally supported, and therefore are not portable.

Many RSS files are automatically generated from badly written HTML by content managements systems, thus the RSS file is often malformed. Some site editors correct their RSS feeds, but all too often there is nothing to be done but to accept that the incoming feed will be wrong.

The W3C requires an XML parser to abort processing if it encounters a not well-formed document. The major Perl XML parsers comply and will die in those cases. If a document format is invalid, the parser cannot convert it to a tree, so transformation cannot start. This is a deliberate feature of XML to prevent ambiguity of on-the-fly second-guessing, characteristic of HTML parsers. Most web browsers will read and display almost any form of HTML no matter how badly formed it is.

The XML::RSS module

To make it easier to transform any RSS file with a single stylesheet, I first convert all RSS files to my preferred version of RSS.

The XML::RSS http://perl-rss.sourceforge.net/ module interconverts Perl structures and RSS files. I can use it to convert a file in one RSS version into another version; however, the module has a number of problems and limitations, including one fatal flaw as of version 0.97, it does not output properly escaped XML. Thus any '&amp;amp;' is incorrectly outputted as '&amp;', a special character in XML signalling the start of an entity encoding. The module should encode any literal '&amp;' as an '&amp;amp;'.

As a result of an email I sent to <tt>brian d foy</tt> regarding his recent article in this journal using the XML::RSS, he took it upon himself to fix the module, and another project on SourceForge was born. In January 2003 the group released a much revised 1.x version, fixing most of the problems in the module.

XML::RSS::Tools

I wrote a module to fully automate RSS file conversion to HTML, while addressing the discussed problems of poor source XML, multiple RSS versions and XML::RSS escaping defect.

The XML::RSS::Tools module incorporates HTTP tools and "The XML C library for Gnome" based XSLT processor, giving a complete tool-kit. Code listing 5 uses the module to download the file, then to transform and to output the result in one step. It has the same command line arguments as the earlier example - an RSS file location and an XSLT stylesheet.

I create an XML::RSS::Tools object by initialising the module to its default configuration. Inside an eval, I use the object to load the source file and the XSLT file, to transform the source, and to output the result as a string. I use an eval block in case an invalid RSS file causes the XML parser to die.

5: Using XML::RSS::Tools

        #!/usr/bin/perl
        use strict;
        use XML::RSS::Tools;
        my $rss = XML::RSS::Tools->new;
        eval { 
          print $rss->rss_file(shift)->xsl_file(shift)->transform->as_string; 
        };
        print $rss->as_string('error') if ($@);

The XSLT stylesheet in code listing 6 converts a single RSS feed code listing 7 into a XHTML fragment. It starts with the standard XML and XSLT header details. I tell the processor to turn off the XML declaration to make the fragment easier to directly incorporate in a XHTML document. I select XML output and turn on indents to give a neater document.

The first template rule selects the XML root of the document, outputs a literal <div> tag, then applies the <rss/channel> rule, and outputs a </div> tag.

The <rss/channel> rule is where I process the details of the channel. I start by creating a number of variables, and populating them with the details to create an image link and the heading. I use an <xsl:if> to check if there is a image to link to, and if so populate an <img> tag with it. I create <h3> and <a> tags which link to the originating site. I output an <hr/> tag to separate the title, and then create an un-ordered list to put the individual story titles in. Inside the <ul</ul>> tags I place an <xsl:apply-templates> command which inserts the content of each item. One of the many nice things about the XSLT language is that I do not need to know how many items there are in a given story - the simple rule will find them all.

The <item> rule creates a pair of variables for the link, outputs a <li> tag, constructs an <a> tag, and closes with a literal </li> since this is XML.

Code listing 8 shows the XHTML fragment generated by this stylesheet.

6: Stylesheet to transform RSS to XHTML

        <?xml version="1.0" encoding="UTF-8"?>
        <xsl:stylesheet version="1.0"
           xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
           exclude-result-prefixes="xsl">
        <xsl:output method="xml" omit-xml-declaration="yes" indent="yes"/>
        <xsl:template match="/">
          <div>
          <xsl:apply-templates select="rss/channel"/>
          </div>
        </xsl:template>
        <xsl:template match="rss/channel">
          <xsl:variable name="link" select="link"/>
          <xsl:variable name="description" select="description"/>
          <xsl:variable name="image" select="image/url"/>
          <xsl:if test="$image">
            <img src="{$image}" style="float: right; margin: 2px;" />
          </xsl:if>
          <h3>
            <a href="{$link}" title="{$description}"><xsl:value-of select="title" /></a>
          </h3>
          <hr/>
          <ul><xsl:apply-templates select="item"/></ul>
        </xsl:template>
        <xsl:template match="item">
          <xsl:variable name="item_link" select="link"/>
          <xsl:variable name="item_title" select="description"/>
          <li>
            <a href="{$item_link}" title="{$item_title}"><xsl:value-of select="title"/></a>
          </li>
        </xsl:template>
        </xsl:stylesheet>

7: Sample Source RSS File

        <?xml version="1.0" encoding="UTF-8"?>
        <!DOCTYPE rss PUBLIC "-//Netscape Communications//DTD RSS 0.91//EN"
                    "http://my.netscape.com/publish/formats/rss-0.91.dtd">
        <rss version="0.91">
        <channel>
          <title>search.cpan.org</title>
          <link>http://search.cpan.org</link>
          <description>The CPAN search site</description>
          <language>en</language>
          <image>
            <title>searchDOTcpan</title>
            <url>http://search.cpan.org/s/img/cpanrdf.gif</url>
            <link>http://search.cpan.org</link>
            <description>All Modules, All the time</description>
          </image>
          <item>
            <title>Apache-Dynagzip-0.09</title>
            <link>http://search.cpan.org/author/SLAVA/Apache-Dynagzip-0.09</link>
          </item>
          <item>
            <title>MIME-Base64-2.16</title>
            <link>http://search.cpan.org/author/GAAS/MIME-Base64-2.16</link>
          </item>
          <item>
            <title>Test-MockObject-0.10</title>
            <link>http://search.cpan.org/author/CHROMATIC/Test-MockObject-0.10</link>
          </item>
        </channel>
        </rss>

8: XHTML result of RSS transformation

        <div>
          <img src="http://search.cpan.org/s/img/cpanrdf.gif" style="float: right; margin: 2px;" />
          <h3>
          <a href="http://search.cpan.org" title="The CPAN search site">search.cpan.org</a>
          </h3>
          <hr />
          <ul>
            <li>
              <a href="http://search.cpan.org/author/SLAVA/Apache-Dynagzip-0.09" title="">Apache-Dynagzip-0.09</a>
            </li>
            <li>
              <a href="http://search.cpan.org/author/GAAS/MIME-Base64-2.16" title="">MIME-Base64-2.16</a>
            </li>
            <li>
              <a href="http://search.cpan.org/author/CHROMATIC/Test-MockObject-0.10" title="">Test-MockObject-0.10</a>
            </li>
          </ul>
        </div>

Conclusion

Perl is a powerful language for collecting, downloading and manipulating data. The XML::RSS::Tools module works around some problems historically found in XML::RSS and incorporates the XSLT processing. Thus the application code is separate from the markup details.

See Also

  • XML In A Nutshell, 2nd edition by Harold & Means, O'Reilly and Associates. http://www.oreilly.com/catalog/xmlnut2/

  • XSLT Quickly by Bob DuCharme, Manning Publications. http://www.manning.com/ducharme/

  • XSLT by Doug Tidwell, O'Reilly and Associates. http://www.oreilly.com/catalog/xslt/

  • Beginning XSLT by Jeni Tennison, Wrox Press Ltd. 2

  • XSLT Programmer's Reference 2nd Edition by Michael Kay, Wrox Press Ltd. 2 http://www.wrox.com/books/0764543814.shtml

  • XSLT Cookbook by Sal Mangano O'Reilly and Associates. http://www.oreilly.com/catalog/xsltckbk/

  • Content Syndication with RS/a by Ben Hammersley, O'Reilly and Associates. http://www.oreilly.com/catalog/consynrss

  • What is RSS? http://www.xml.com/lpt/a/2002/12/18/dive-into-xml.html and Parsing RSS At All Costs http://www.xml.com/lpt/a/2003/01/22/dive-into-xml.html by Mark Pilgrim http://www.diveintomark.org/ , XML.com.

  • Never Mind the Namespaces: An XSLT RSS Client http://www.xml.com/lpt/a/2003/01/02/tr.html by Bob DuCharme, XML.com.

  • XML::RSS, XML::LibXML and XML::LibXSLT are both available on CPAN.

  • http://www.xmlsoft.org/ for the underlying c libraries of LibXML and LibXSLT.

Credits

  • Dr V. E. Kerguelen

  • brian d foy http://www.panix.com/~comdog/

  • Bob DuCharme http://www.snee.com/bob/

AUTHOR

Dr A. J. Trickett (atrickett AT cpan DOT org)

FOOTNOTES

  1. This article http://www.theperlreview.com/Articles/v0i7/xslt.pdf first appeared in http://www.theperlreview.com/ in January 2003. This version will be continually updated as appropriate. Revision 4.2, January 2004.

  2. Wrox Press Ltd have gone out of business since the article was written, I'm therefore not sure of the status of any Wrox books reference in this article.