The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

XML::RSS::FromHTML - simple framework for making RSS out of HTML

SYNOPSIS

  ### create your own sub-class, with these four methods
  package MyModule;
  use base XML::RSS::FromHTML;
  
  sub init {
      my $self = shift;
      # set your configurations here
      $self->name('MyRSS');
      $self->url('http://foo.com/headlines.html');
  }
  
  sub defineRSS {
      my $self = shift;
      my $xmlrss  = shift;
      # define your RSS using XML::RSS->channel method
      $xmlrss->channel(
          title => 'foo.com headlines feed',
          description => 'generated from http://foo.com headlines'
      );
  }
  
  sub makeItemList {
      my $self = shift;
      my $html = shift;
      # parse HTML and make an item list
      my @list;
      while ($html =~ m|<li><a href="(.+?)">(.+?)</a></li>|g){
          push(@list,{
              link  => $1,
              title => $2
          });
      }
      return \@list;
  }
  
  sub addNewItem {
      my $self = shift;
      my ($xmlrss,$eachItem) = @_;
      # make your item using XML::RSS->add_item method
      $xmlrss->add_item(
          title => $eachItem->{title},
          link  => $eachItem->{link},
          description => 'this is '. $eachItem->{title},
      );
  }
  
  #### and from your main routine...
  package main;
  use MyModule;
  my $rss = MyModule->new;
  $rss->update;
  # an updated RSS file './MyRSS.xml' will be created.
  # run this script every day, and your RSS will always 
  # be up-to-date.

DESCRIPTION

This module is a simple framework for creating RSS out of HTML periodically. There are still plenty of web sites that doesn't supply RSS feeds, which we think it would be nice if they did. This module helps you create RSS feeds for those sites by your-own-hand, and maintain the contents up to date. The core features are as follows:

  • retrieving HTML text from url

  • restraining short interval access to url

  • caching of update records (cause minimum access to url)

  • framework that offers minimum coding to developers

It's mostly focused on trying not to be an annoyance to the target url/web site (and of course, developer-friendliness). We don't want to be seen as spams, but would be nice if we could tell them the value of RSS feeds.

USAGE

BASIC

This module is not intended to work by itself. You will need to create a sub class of it, and define these four methods with customization for your target url/web site.

FOUR METHODS

init()

  sub init {
      my $self = shift;
      # set your configurations here
      $self->name('Test');
      $self->url('http://foo.com/headlines.html');
      $self->cacheDir('./cache');
      $self->feedDir('./feed');
      return 1;
  }

Called with-in the constructor, this method should initialize property values of your choice. See the PROPERTIES section for description of available properties.

defineRSS()

Define your RSS feed descriptions and informations here, using the XML::RSS->channel method.

  sub defineRSS {
      my $self = shift;
      my $xmlrss = shift;
      # define your RSS using XML::RSS->channel method
      $xmlrss->channel(
          title => 'foo.com headlines feed',
          description => 'generated from http://foo.com headlines'
      );
      # you can also define images with XML::RSS->image method
      $xmlrss->image(
          title  => 'foo.com headlines feed',
          url    => 'http://mysite/image/logo.gif',
          link   => 'http://foo.com/headlines.html'
      );
      return 1;
  }

makeItemList()

With the whole html string (supplied as argument), use whatever mean (i.e. regexp) to create a data structure of items. Later on, you'll be using these information to create feed items.

  sub makeItemList {
      my $self = shift;
      my $html = shift;
      # parse HTML and make an item list
      my @list;
      while ($html =~ m| .. some mumbling regexp here .. |g){
          push(@list,{
              link     => $1,
              title    => $2,
              category => $3,
              id       => $4,
              ...
          });
      }
      return \@list;
  }

addNewItem()

From the list created with above method (makeItemList), the framework will check for updates, and will call this method for each new items. Thus, the argument $eachItem represents the iterator (each element of @list created with $self->makeItemList) object. Use XML::RSS->add_item method to add a new item to the RSS feed. You can also fetch any additional information about the item, like from the description page, and add them to the feed too.

  sub addNewItem {
      my $self = shift;
      my ($xmlrss,$eachItem) = @_;
      # fetch additional information if you want to
      require LWP::Simple;
      my $html = get("http://foo.com/archives/$eachItem->{id}.html");
      my ($desc) = ($html =~ m|<p class="desc">(.+?)</p>|);
      # make your rss item using XML::RSS->add_item method
      $xmlrss->add_item(
          title => $eachItem->{title},
          link  => $eachItem->{link},
          category => $eachItem->{cateogry},
          description => $desc,
      );
      return 1;
  }

HOW TO USE

Basically, all you need to do is load your sub-class module, create new instance, and call the update method. The return value of update method is a boolean value, representing:

  • 1 : RSS feed re-written. There were some updates.

  • 0 : No update, for some reason.

And with $self->updateStatus method, you'll be informed with a status message.

  use MyModule;
  my $rss = MyModule->new;
  my $hasNewItem = $rss->update;
  if($hasNewItem){
    print "RSS updated with some new items";
    return 1;
  }else{
    # i.e. "still under check interval time period"
    print $rss->updateStatus; 
    return undef;
  }

PROPERTIES

These are all the properties available for configuration within $self->init method.

  • name

    Identification string, used for feed file name and cache file name. Default value is 'myrss'.

  • url

    The URL of the target web page.

  • cacheDir

    Directory path to where the cache files are stored. Default is '.' (current dir).

  • feedDir

    Directory path to where the RSS feed file will be saved. Default is '.' (current dir).

  • minInterval

    Minimum interval period in seconds. If $self->update is called more than once with-in this interval period, the call will silently be ignored, thus restricting un-necessary access to the target url. Default is 300 (=5minutes).

  • maxItemCount

    The maximum number of items the RSS feed contains. If exceeded, older items will be deleted from the feed. Default is 30.

  • unicodeDowngrade

    [depricated since v0.04] pre-requisity module XML::Parser v2.34 no longer creates utf-8 flagged strings, so this feature is not need by japanese and other multi-byte character languages.

    Parsing of RSS files with XML::RSS (actually XML::Parser) results in utf-8 flagged strings. Setting this to a true value will take all these utf-8 flags off, which is sometimes helpfull for non-ascii language codes without using the 'encoding' pragma.

  • passthru

    Should supply a hashref data, containing optional values you would want to pass to XML::RSS->new() method. Default is {} (empty). For example, setting this:

      $self->passthru({ version => '2.0' });

    will work as

      XML::RSS->new( version => '2.0' );

    in every place XML::RSS->new is called internally.

  • outFileName

    If supplied, the name of the out file (feed xml file) will use this one instead of $self->name. (Intended for custom usage only).

  • debug

    If set to a true value, each time $self->update method is called, some useful debugging information (files) will be created in the $self->cacheDir directory.

OTHER USEFUL PROPERTIES

updateStatus

As described above (section HOW TO USE), this property contains some helpful message about the update sequence. Currently there are:

  • 'update not executed yet'

    default message before $self->update is called.

  • 'still under check interval time period'

    $self->minInterval seconds hasn't passed yet since the last update.

  • 'makeItemList returned with 0 item - html parse failure'

    parsing logic is not working right. Must be a change in the html structure.

  • 'updated with $n new items'

    successfully updated with $n new items.

  • 'there was no new item'

    the HTML hasn't changed a bit.

newItems

An array reference to all the items that were counted as new item. Sometimes usefull after $self->update method call.

  $rss->update;
  print "there were " scalar @{ $rss->newItems } . " items new.\n";
  foreach (@{ $rss->newItems }){
      print "title: $_->{title}\n";
  }

OTHER USEFUL METHODS

as_string()

Will return RSS feed as XML string.

as_object()

Will return XML::RSS object of the current RSS feed.

getDateTime()

Will return the current date + time in a RFC 1123 styled GMT Ascii format, like this:

  Sun, 06 Nov 1994 08:49:37 GMT

Useful for date/time related elements within RSS feed (i.e. pubDate). Also, if passed with some kind of a date-time string as an argument, it'll try it's best to parse the string and return as GMT Ascii format string as well.

  print $self->getDateTime('19940203T141529Z');
  # will print 'Thu, 03 Feb 1994 14:15:29 GMT'

It uses HTTP::Date internally, so see HTTP::Date's parse_date() method documentation for available (parse-able) formats.

TIPS

RETRIEVING HTML FROM SESSION REQUIRED WEB SITE

With some web sites, they require a valid session-id in your browser cookie or query string in order to retrieve their contents. The session id is usually given to you the first time you visit their TOP PAGE, or of course, when you go through the LOGIN process.

If you want/need to retrieve some HTML from pages that require these session id's, you should override the $self->getHTML method with your own customization. For example, assuming a web site that gives you session-id's when you access their top.cgi page, the getHTML method will be like this:

  sub getHTML {
      my $self = shift;
      my $url = shift;
      my $ua = LWP::UserAgent->new;
      $ua->cookie_jar({ file => $self->cacheDir.'/'.$self->name.'.cookie' });
      $ua->get('http://foo.com/top.cgi'); # set session-id in cookie
      my $res = $ua->get($url); # send with session-id cookie
      return $res->content;
  }

BUGS

Nothing that I'm aware of, yet.

AUTHOR

  Toshimasa Ishibashi
  CPAN ID: BASHI
  bashi@cpan.org
  http://iandeth.dyndns.org/mt/ian/

COPYRIGHT

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

The full text of the license can be found in the LICENSE file included with this module.

SEE ALSO

perl(1). XML::RSS