The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

News Clipper - downloads and integrates dynamic information into web pages

SYNOPSIS

 Using the input and output files specified in either the system-wide
 NewsClipper.cfg file, or the personal NewsClipper.cfg file in
 ~/.NewsClipper

 $ NewsClipper.pl [-anrv] [-c configfile]

 Override the input and output files

 $ NewsClipper.pl [-anrv] [-c configfile] \
   -i inputfile -o outputfile

 Provide a sequence of News Clipper commands on the command line

 $ NewsClipper.pl [-anrv] [-c configfile] \
   -e "handlername, handlername, handlername"

DESCRIPTION

News Clipper grabs dynamic information from the internet and integrates it into your webpage. Features include modular extensibility, timeouts to handle dead servers without hanging the script, user-defined update times, and automatic installation of modules.

News Clipper takes an input HTML file, which includes special tags of the form:

  <!--newsclipper
    <input name=X>
    <filter name=Y>
    <output name=Z>
  -->

where X represents a data source, such as "yahootopstories", "slashdot", etc. When such a tag is encountered, News Clipper attempts to load and execute the handler to acquire the data. Then the data is sent to the filter named by Y, and then on to the output handler named by Z. If the handler can not be found, the script asks for permission to attempt to download it from the central repository.

HANDLERS

News Clipper has a modular architecture, in which handlers implement the acquisition and output of data gathered from the internet. To use new data sources, first locate an interesting one at http://www.newsclipper.com/handlers.html, then place the News Clipper tag in your input file. Then run News Clipper once manually, and it will prompt you for permission to download and install the handler.

You can control, at a high level, the format of the output data by using the built-in filters and handlers described on the handlers web page. For more control over the style of output data, you can write your own handlers in Perl.

To help handler developers, a utility called MakeHandler.pl is included with the News Clipper distribution. It is a generator that asks several questions, and then creates a basic handler. Handler development is supported by two APIs, AcquisitionFunctions and HTMLTools. For a complete description of these APIs, as well as suggestions on how to write handlers, visit http://www.newsclipper.com/handlers.html.

News Clipper has the ability to automatically download handlers whose functionality did not change relative to the currently installed version. This means that you can safely download the update and be guaranteed that it will not break your existing News Clipper commands. These "bugfix updates" are controlled by the auto_download_bugfix_updates value in the NewsClipper.cfg file.

You can also tell News Clipper to download "functional updates", which are handlers whose interface has changes relative to the version you have. These updates are the most recent versions of the handler, but they contain changes that may break existing News Clipper commands.

OPTIONS AND ARGUMENTS

-i inputfile

Override the input file specified in the configuration file. The special filename "STDIN" gets input from standard input (useful for piping commands to News Clipper).

-o outputfile

Override the output file specified in the configuration file. The special filename "STDOUT" sends output to standard output instead of a file.

-e commands

Run the specified handler using the default filters and output handlers, and output the result to STDOUT. This option overrides -i and -o. Commands can be in the form of a normal News Clipper bracket syntax, or as a comma-separated list. For example, the following are equivalent:

 $ echo '<!-- newsclipper <input name=date style=day><output name=string> -->' | \
   NewsClipper.pl -i STDIN -o STDOUT

 $ NewsClipper.pl -e 'date style=day,string'

 $ NewsClipper.pl -e '<input name=date style=day><output name=string>'

Note that commas can not be escaped -- commas that appear in quotes, for example, will be interpreted as delimiters between commands.

-c

Use the specified file as the configuration file, instead of NewsClipper.cfg.

-n

Check for new bugfix and functional updates to any handlers encountered.

-a

Automatically download any bugfix or functional updates to handlers News Clipper processes. Use the auto_download_bugfix_updates in the configuration file to always download bugfix versions, but not functional updates. This flag should only be used when News Clipper is run interactively, since functional updates can break web pages that rely on the older functionality.

-P

Pause after News Clipper has completed execution. (This is useful when running News Clipper in a window that automatically closes upon program exit.)

-r

Reload the content from the proxy server even on a cache hit. This prevents News Clipper from using stale data when constructing the output file.

-d

Enable debug mode, which prints extra information about the execution of News Clipper. Output is sent to the screen instead of the output file.

-v

Verbose output. Output a copy of the information sent to the output file to standard output. Does not work on Windows or DOS.

-H

Use the specified path as the user's home directory, instead of auto-detecting the path. This is useful for specifying the location of the .NewsClipper directory.

-C

Clear the News Clipper cache, handler-specific state, or News Clipper state. The cache contains information acquired by acquisition handlers. Handler-specific state is any information that handlers store between runs. News Clipper state is any information that News Clipper stores between runs, such as the last time a handler was checked for an update.

Clearing the cache significantly slows down News Clipper and increases network traffic on remote servers---use with care. Similarly, clearing News Clipper state forces News Clipper to check for updates to handlers.

Configuration

The file NewsClipper.cfg contains the configuration. News Clipper will first look for this file in the system-wide location specified by the NEWSCLIPPER environment variable. News Clipper will then load the user's NewsClipper.cfg from $home/.NewsClipper. Any options that appear in the personal configuration file override those in the system-wide configuration file, except for the module_path option. In this file you can specify the following:

$ENV{TZ}

The timezone for Windows. (This is automatically detected on Unix-like platforms.)

email

The user's email address. This is used for registration for the commercial version.

registration_key

The registration key. This is used for registration for the commercial version.

input_files, output_files

Multiple input and output files. The first input file is transformed into the first output file, the second input file to the second output file, etc.

handler_locations

The locations of handlers. For example, ['dir1','dir2'] would look for handlers in dir1/NewsClipper/Handler/ and dir2/NewsClipper/Handler/. Note that while installing handlers, the first directory is used. This can be used to provide a location for a single repository of handlers, which can be shared by all users.

module_path

The location of News Clipper's modules, in case the aren't in the standard Perl module path. (Set during installation.) For pre-compiled versions of News Clipper, this setting also includes extra directories, separated by whitespace, which are paths in which to search for any additional Perl modules.

cache_location

The location of the cache in the file system.

max_cache_size

The maximum size of the cache in megabytes. It should be at least 5.

script_timeout

The timeout value for the script. This puts a limit on the total time the script can execute, which prevents it from hanging. This does not work on Windows or DOS.

socket_timeout

The timeout value for socket connections. This allows the script to recover from unresponsive servers.

socket_tries

The number of times to try a connection before giving up.

proxy

Your proxy host. For example, "http://proxy.host.com:8080/"

proxy_username
proxy_password

Your proxy username and password.

auto_download_bugfix_updates

Set to "yes" to automatically download bugfix updates to handlers.

tag_text

The keyword to indicate News Clipper commands. The default is "newsclipper", which results in <!-- newsclipper ... --> as the default command comment.

make_output_files_executable

Set to "yes" to make output files executable.

debug_log_file
run_log_file

The file (with path) to which the debug and run logs should be appended.

max_number_of_log_files
max_log_file_size

The maximum number of log files to maintain, and the maximum size of any log file.

NewsClipper.cfg also contains handler-specific configuration options. These options are generally documented in the handler's syntax documentation.

The NewsClipper.cfg that comes with the distribution contains default configuration information for the cacheimages handler:

imgcachedir

The location in the filesystem of the image cache. This location should be visible from the web.

imgcacheurl

The URL that corresponds to the image cache directory specified by imgcachedir.

maximgecacheage

The maximum age of images in the image cache. Old images will be removed from the cache.

RUNNING

You can run NewsClipper.pl from the command line. The -e, -i, and -o flags allow you to test your input files. When you are happy with the way things are working, you should run News Clipper as a cron job. To do this, create a .crontab file with something similar to the following:

    0 7,10,13,16,19,22 * * * /path/NewsClipper.pl

"man cron" for more information.

PREREQUISITES

This script requires the Time::CTime, Time::ParseDate, LWP::UserAgent (part of libwww), URI, HTML-Tree, and HTML::Parser modules, in addition to others that are included in the standard Perl distribution. See the News Clipper distribution's README file for more information.

Handlers that you download may require additional modules.

NOTES

News Clipper has 2 web sites: the open source homepage at http://newsclipper.sourceforge.net, and the commercial homepage at http://www.newsclipper.com/ The open source homepage has instructions for getting the source via CVS, and has documentation aimed at developers. The commercial web site contains a FAQ, information for buying the commercial version, and more.

AUTHOR

David Coppit, <david@coppit.org>, http://coppit.org/ Spinnaker Software, Inc.