The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

RSS2Leafnode -- post RSS or Atom feeds and web pages to newsgroups

SYNOPSIS

 rss2leafnode [--options]

DESCRIPTION

RSS2Leafnode downloads RSS or Atom feeds and posts items to an NNTP news server. It's designed to make simple text entries available in local newsgroups (not propagated anywhere, though that's not enforced).

Desired feeds are given in a configuration file .rss2leafnode.conf in your home directory. For example to post a feed to group "r2l.surfing"

    fetch_rss ('r2l.surfing',
               'http://www.coastalwatch.com/rss/cwreports_330.xml');

This is actually Perl code, so you can put comment lines with # or write some conditionals, etc. The target newsgroup must exist (see "Leafnode" below). With that done, run rss2leafnode as

    rss2leafnode

You can automate with cron or similar. If you do it under user news it could be just after a normal news fetch. The --config option below lets you run different config files at different times, etc. A sample config file is included in the RSS2Leafnode sources.

Messages are added to the news spool using NNTP "POST" commands. When a feed is re-downloaded any items previously added are not repeated. Multiple feeds can be put into a single newsgroup. Feeds are inserted as they're downloaded, so the first articles appear while the rest are still in progress.

The target newsgroup can also be a news: or nntp: URL for a server on a different host, or for a different port number if running a personal server on a high port.

    fetch_rss('news://somehost.mydomain.org:8119/r2l.weather',
              'http://feeds.feedburner.com/PTCC');

Web Pages

Plain web pages can be downloaded too. Each time the page changes a new article is injected. This is good for news or status pages which don't have an RSS feed. For example

    fetch_html ('r2l.finance,
                'http://www.baresearch.com/free/index.php?category=1');

If you've got URI::Title then it's used for the message "Subject". It has special cases for good parts of some unhelpful sites. Otherwise it's the HTML <title>.

Re-Downloading

HTTP ETag and Last-Modified headers are used, if provided by the server, to avoid re-downloading unchanged content (feeds or web pages). Values seen from the last run are saved in .rss2leafnode.status in your home directory.

If you've got XML::RSS::Timing then it's used for RSS ttl, updateFrequency, etc from a feed. This means the feed is not re-downloaded until its specified update times. But only a few feeds have useful timing info, most merely give a ttl advising for instance 5 minutes between rechecks.

With --verbose the next calculated update time is printed, in case you're wondering why nothing is happening. The easiest way to force a re-download is to delete the .rss2leafnode.status file. Old status file entries are automatically dropped if you don't fetch a particular feed for a while, so that file should otherwise need no maintenance.

Leafnode

rss2leafnode was originally created with the leafnode program in mind, but can be used with any server accepting posts. It's your responsibility to be careful where a target newsgroup propagates. Don't make automated postings to the world!

For leafnode see its README on "LOCAL NEWSGROUPS" for creating local-only groups. Basically you add a line to /etc/news/leafnode/local.groups like

    r2l.stuff   y       My various feeds

The group name is arbitrary and the description is optional, but note there must be a tab character between the name and the "y" and between the "y" and any description. "y" means posting is allowed.

COMMAND LINE OPTIONS

The command line options are

--config=/some/filename

Read the specified configuration file instead of ~/.rss2leafnode.conf.

--help

Print some brief help information.

--verbose

Print some diagnostics about what's being done. With --verbose=2 print various technical details.

--version

Print the program version number and exit.

CONFIG OPTIONS

The following variables can be set in the configuration file

If true then download the "link" in each item and include it in the news message. For example,

    $rss_get_links = 1;
    fetch_rss ('r2l.finance',
      'http://au.biz.yahoo.com/financenews/htt/financenews.xml');

Not all feeds have interesting things at their link, but for those which do this can make the full article ready to read immediately, instead of having to click through from the message.

Only the immediate link target URL is retrieved, no images within the page are downloaded (which is often a good thing), and you'll probably have trouble if the link uses frames (a set of HTML pages instead of just one).

$render (default 0)

If true then render HTML to text for the news messages. Normally item text, any $rss_get_links downloaded parts, and fetch_html pages are all presented as MIME text/html. But if your newsreader doesn't handle HTML very well then $render is a good way to see just the text. Setting 1 uses HTML::FormatText

    $render = 1;
    fetch_rss ('r2l.weather',
      'http://xml.weather.yahoo.com/forecastrss?p=ASXX0001&u=f');

Setting "WithLinks" uses the HTML::FormatText::WithLinks variant (you must have that module), which shows links as footnotes.

    $render = 'WithLinks';
    fetch_rss ('r2l.finance',
               'http://movies.hsx.com/hsx_news.xml');

Settings elinks, lynx or w3m dump through the respective external program (you must have HTML::FormatExternal and the program).

    $render = 'lynx';
    $rss_get_links = 1;
    fetch_rss ('r2l.sport',
               'http://fr.news.yahoo.com/rss/rugby.xml');
$render_width (default 60)

The number of columns to use when rendering HTML to plain text or when wrapping Atom plain text. You can set this to whatever you find easiest to read, or any special width needed by a particular feed.

Obscure Options

$rss_charset_override (default undef)

If set then force RSS content to be interpreted in this charset, irrespective of what the document says. Use this if the document is wrong or has no charset specified and isn't the XML default utf-8. Usually you'll only want this for a particular offending feed. For example,

    # AIR is latin-1, but doesn't have a <?xml> saying that
    $rss_charset_override = 'iso-8859-1';
    fetch_rss ('r2l.finance', 'http://www.aireview.com.au/rss.php');
    $rss_charset_override = undef;

An attempt is made to cope with bad non-ascii by re-coding to the document's claimed charset. If that works then the text will have substituted characters (U+FFFD or question marks "?") and a warning is given,

    Warning, recoded utf-8 to parse http://example.org/feed.xml
      expect substitutions for bad non-ascii
      (line 214, column 75, byte 13196)

Nose around the raw feed at the given location to find what's actually wrong. See "ENCODINGS" in XML::Parser for the charsets supported by the parser (.enc files under /usr/lib/perl5/XML/Parser/Encodings/ plus some builtins).

$html_charset_from_content (default 0)

If true then the charset used for fetch_html content is taken from the HTML itself, rather than the server's HTTP headers. Normally the server should be believed, but if a particular server is misconfigured then you can try this.

    $html_charset_from_content = 1;
    fetch_rss ('r2l.stuff',
               'http://www.somebadserver.com/newspage.html');

Variable Extent

Variables take effect from the point they're set, through to the end of the file, or until a new setting. The Perl local feature and a braces block can be used to confine a setting to a particular few feeds. Eg.

    { local $rss_get_links = 1;
      fetch_rss ('r2l.finance',
                 'http://www.debian.org/News/weekly/dwn.en.rdf');
    }

DETAILS

RSS and Atom text is encoded as utf-8 in the generated messages, so you'll need a newsreader which supports that if reading non-ascii. Unrendered downloaded web pages are left in the charset the server gave, to ensure it matches any <meta http-equiv> in the document. HTML rendered to text is recoded to utf-8, the same as the RSS text. In all cases of course the charset is specified in the MIME message headers or attachment parts.

Google Groups mailing list feeds such as http://groups.google.com/group/cake-php/feed/rss_v2_0_msgs.xml get a "List-Post" header pointing to the list like "foolist@googlegroups.com". This may let you followup to the list, depending on your newsreader. (A followup to the newsgroup goes nowhere.)

Yahoo Finance items repeated in different feeds are noticed using a special match of the link in each item, so just one copy of each is posted. (Yahoo's items don't offer RSS guid identifiers which normally protect against duplication.)

Links are shown for basic RSS and Atom <link>, but only HTML targets, not other feeds. <comments>, <wfw:comment> and <wiki:diff> are shown too. Currently comments and replies links are not downloaded by $rss_get_links, but a diff link is. Comment or reply counts are shown from <thr:total>, <thr:count> or <slash:comments>.

<thr:in-reply-to> is used for an In-Reply-To header if present, which might help threading in a news reader, but of course only if the parent item was downloaded too.

Some pre-releases of leafnode 2 have trouble with posts to local newsgroups while a fetchnews run is in progress. The local articles don't show up until after a subsequent further fetchnews.

FILES

~/.rss2leafnode.conf

Configuration file.

~/.rss2leafnode.status

Status file, recording "last modified" dates for downloads. This can be deleted if something bad seems to have happened to it; the next rss2leafnode run will recreate it.

SEE ALSO

leafnode(8), HTML::FormatText, HTML::FormatText::WithLinks, HTML::FormatExternal, lynx(1), URI::Title, XML::Parser

HOME PAGE

http://user42.tuxfamily.org/rss2leafnode/index.html

LICENSE

Copyright 2007, 2008, 2009, 2010 Kevin Ryde

RSS2Leafnode is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 3, or (at your option) any later version.

RSS2Leafnode is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with RSS2Leafnode. If not, see http://www.gnu.org/licenses/.