The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

SWISH::Prog - build Swish-e programs

SYNOPSIS

  # create a Prog module by subclassing SWISH::Prog and SWISH::Prog::Doc
  package MyProg;
  use base qw( SWISH::Prog );
  
  sub ok
  {
    my $prog = shift;
    my $doc = shift;
    
    # index everything
    1;
  }
  
  1;
  
  package MyProg::Doc;
  use base qw( SWISH::Prog::Doc );
  
  # pass content untouched
  
  1;
  
  # elsewhere:
  use MyProg;
  use Carp;
  
  my $prog = My::Prog->new(
                name    => 'myindex',
                opts    => '-W1 -v0',
                config  => 'some/swish/config/file',
                );
                
  $prog->find('some/dir');
          

DESCRIPTION

SWISH::Prog is a framework for indexing document collections with Swish-e. This module is a collection of utility methods for writing your own applications.

The API is a work in progress and subject to change.

METHODS

All of the following methods may be overridden when subclassing this module.

new( %opts )

Instantiate a new SWISH::Prog object. %opts may include:

ua

User Agent for fetching remote files. By default this is a LWP::UserAgent object with default settings, but you could pass in your own LWP::UserAgent with your own settings, or any other user agent that supports the same methods.

name

Full path name of the index. Ignored if fh is passed.

config

Either a full path name to a Swish-e config file, or a SWISH::Prog::Config object. See Swish-e and SWISH::Prog::Config documentation.

fh

A filehandle reference. If set, fh will override any options related to the swish-e command and instead write all output to the filehandle.

Example:

 fh => *STDOUT{IO}
 

will write all output to stdout and will not open a pipe to the swish-e command.

The default filehandle is named SWISH and is tied via a piped open() to the swish-e -S prog -i stdin command.

If set to 0 or undef, the filehandle will default to STDOUT as in the example above. Use this feature to cache all output in a file for later indexing or debugging.

See fh().

debug

Print stuff on stderr.

strict

Perform sanity checks on content types for documents retrieved using User Agent. You might want this if you want to verify the content type against the actual content of the file.

You may pass any other key/value pairs you want and deal with them by overriding init().

You probably don't want to override new(). See init() and init_indexer() instead.

init

Called within new() after the object is blessed and internal initialization is done.

This method is designed to be overridden in your subclass. Only the object is passed. Return value is ignored.

The basic initialization order is:

_init()

Private internal method. Blesses object and sets up sane defaults based on args to new().

init()

Public method. Initalize your object beyond the default _init(). The base init() does nothing.

init_indexer

Public method. Sets indexer() and fh(). Override this method in your subclass to customize config() or anything else that should be done after init() and before the swish-e index is opened.

init_indexer

Creates and caches a SWISH::Prog::Index object in indexer(), and sets fh(). You can override this method if you want to customize the order of when the index is opened for writing, or want to pass specific options to the S::P::Index new() method.

DESTROY

The default DESTROY method simply calls close() on the fh() value. If you override this method, you should call

 $self->SUPER::DESTROY();

as well. See SWISH::Prog::DBI for an example.

fh( [ filehandle ] )

Get/set filehandle reference.

CAUTION: Only do this if you know what you're doing. The default filehandle is a pipe to the swish-e indexer and you could botch things royally if you changed that filehandle.

Examples of possible use include printing documents to different filehandles based on some criteria of your design. You would not need to override index(), but just change which filehandle index() will print to. Think of it like Perl's built-in select() function. You might open multiple swish-e indexers, for example, one per index, and thus create multiple indexes simultaneously from a single source. (This author would love to see good examples of doing that!)

ua( user agent )

By default this is a LWP::UserAgent object. Get/set it to taste.

config

Get/set SWISH::Prog::Config object.

strict

Get/set strict flag. See new().

debug

Get/set debug flag. See new().

indexer

SWISH::Prog::Index object. Set in init_indexer().

remote( URL )

Returns true (1) if URL matches a pattern that looks like a URI scheme (http://, ftp://, etc.). Otherwise, returns false (0).

NOTE: This will match file:// but LWP::UserAgent should fetch file:// URLs just like any other URL.

fetch( URL [, stat_ref, file_ext] )

Retrieve URL either via HTTP or from filesystem.

Returns a Doc object. See SWISH::Prog::Doc documentation for how to subclass SWISH::Prog::Doc.

find( @paths )

Use SWISH::Prog::Find to traverse @paths.

spider NOT YET IMPLEMENTED

Returns a SWISH::Prog::Spider object.

ok( doc_object )

Returns true (1) if doc_object is acceptable for indexing.

The default is simply to call content_ok(). This method is a prime candidate for overriding in your subclass.

content_ok( doc_object )

Perform tests on doc_object content().

Return false (0) if any of the tests fails. A test can be anything: a regexp check, a size check, whatever.

The default test is simply that length() > 0.

url_ok( URL )

Check URL before fetch()ing it.

Returns 0 if URL should be skipped.

Returns file extension of URL if URL should be processed.

dir_ok( directory )

Called by find() for all directories. You can control the recursion into directory via the config() params

 TODO
 

index( doc_object )

Pass doc_object to the indexer.

Runs filter() and ok(), in that order, before handing to the indexer.

filter( doc_object )

Filter doc_object before indexing. filter() is called by index() just before ok().

Think of filter() as a last-chance global filter opportunity similar to the *_filter() methods available in SWISH::Prog::Doc. The individual *_filter() methods are called at the time the doc_object is first created. The filter() method is called later, just before indexing starts.

elapsed

Returns the elapsed time in seconds since object was created.

SEE ALSO

http://swish-e.org/

SWISH::Prog::Doc, SWISH::Prog::Headers, SWISH::Prog::Index, SWISH::Prog::Config

AUTHOR

Peter Karman, <perl@peknet.com>

Many of the API ideas here are gleaned from Bill Moseley's DirTree.pl and spider.pl scripts in Swish-e 2.x.

COPYRIGHT AND LICENSE

Copyright 2006 by Peter Karman

Thanks to Atomic Learning for sponsoring some of the development of this module.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.