The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

FileArchiveIndexer::IndexingRun - abstraction of an indexing run

SYNOPSIS

   use FileArchiveIndexer::IndexingRun;

   my $i = new FileArchiveIndexer::IndexingRun({
      DBNAME => $dbname,
      DBPASSWORD => $dbpassword,
      DBHOST => $dbhost,
      SCP_USER => $scpuser,
      SCP_HOST => $scphost,
      use_ocr => 1,   
   });
   
   $i->run();
   
   exit;

DESCRIPTION

this module is an abstraction to an indexer run this module uses FileArchiveIndexer as base, all its methods are present.

new()

   my $i = new FileArchiveIndexer::IndexingRun({
      DBNAME => $dbname,
      DBPASSWORD => $dbpassword,
      DBHOST => $dbhost,
      SCP_USER => $scpuser,
      SCP_HOST => $scphost,
      use_ocr => 1,
      run_max => 100,
      abs_log => '/var/log/faindex.log',
      running_as_remote_indexer => 0,   
   });

Arguments

abs_conf

If you have a YAML config file..

   my $i = FileArchiveIndexer::IndexingRun({ abs_conf => '/etc/faindex.conf' });
run_max

When you call run() the maximum number of files indexed is run_max, see also run_max().

use_ocr

This module was designed to index pdf files with ocr, by default this is disabled, see use_ocr().

running_as_remote_indexer, SCP_USER, SCP_HOST

By default we expect that the indexer is running locally on the server which hosts the database and the files. If this is not the case, running_as_remote_indexer sould be set to 1, the files are retrieved via scp, so SCP_USER and SCP_HOST need to be set- see "RUNNING AS REMOTE INDEXER".

abs_log

Certain errors can be logged. For example, if we are running as remote indexer, and the file is not properly retrieved, then we can log that. Also if we are using ocr and the file does not check ok for pdf standards, we can log that too. To enable logging you must set the parameter 'abs_log'.

Also a summary is logged at the end of each run.

            $self->_log("no md5sum [$md5sum] or abs path [$abs_path]");
            
            debug(" - STEP ENDS, missing abs path or md5sum for indexing\n\n");
            
            next INDEXFILE;         
         } 
      };

run()

this initiates the actual indexing run it will keep running until the run_count() matches run_max() or no more files are in pending. returns true after the run.

You can make your own indexer if you like. You do not have to use run().

run_count()

returns number how many we have indexed so far this does not include files skipped files may be skipped because we can't lock or for errors the count is only the count of successfully indexed files

run_max()

maximum files to index in this run argument is max number of files to index to set or you can also set via argument to constructor via 'run_max' default is 100

   # set to 45
   
   $self->run_max(45);
   
   $self->run;

no_pending_files_left()

returns boolean if no files are pending only returns true if get_pending_next() has already been called and get_indexpending returned no more files

use_ocr()

argument is 1/0 returns boolean

if you want to use ocr for paper documents stored as scans you will require PDF::OCR package installed and all its dependencies. set to 0 by default see PDF::OCR::Thorough::Cached

can also be passed as argument 'use_ocr' to constructor

RUNNING AS REMOTE INDEXER

One of the crucial goals of FileArchiveIndexer is to be able to index a vast ammount of documents, possibly in a very time consumming manner. For example, using PDF::OCR::Thorough, we can turn pdf scans of documents into text for the indexer.

The process is to cpu intensive that it can take one computer many weeks to index a large archive. Thus, the option run multiple indexing machines for one archive and one database is a wonderful option to have.

Running as remote indexer, Digest::MD5::File is required.

HOW IT WORKS

The local indexer, being remote to the file archive, asks the database for a list of pending files. For each file, we download the file, and ask what it's md5sum is supposed to be. After downloading, we get an md5sum and check it against what the file archive server thinks it should be, if it is the same then we index

REQUIREMENTS

You must configure the file archive machine's mysql server to accept connections from the remote machines. You will need to add a user and host to the mysql server to be able to make changes to it remotely.

On the server hosting your file archive and database :

mysql -p

GRANT ALL PRIVILIGES ON *.* TO '$DBUSER'@'$USERHOSTIP' IDENTIFIED BY '$password' WITH GRANT OPTION;

for example to grant on the network

GRANT ALL PRIVILIGES ON *.* TO '$DBUSER'@'192.168.0.%' IDENTIFIED BY '$password' WITH GRANT OPTION;

ADDITIONAL ARGUMENTS TO CONSTRUCTOR

In addition to all the normal arguments to constructor, you must also provide these parameters:

get_file()

argument is abs_path of remote file returns abs_path of local file will test for md5sum local being same as remote if fails warns and returns undef

This is ONLY used for a remote indexer that is, if the indexer is running on a different machine then the server holding the files and database

uses scp

_running_as_remote_indexer()

argument is 0/1 returns boolean if set, then it will use get_file() to retrieve also, additional arguments to constructor must be provided, see "RUNNING AS REMOTE INDEXER". It is suggested not to use this method, instead to set it via the constructor.

EXAMPLE 1

Running as remote indexer with the default run() method

   use FileArchiveIndexer::IndexingRun;
   
   my $f = new FileArchiveIndexer::IndexingRun({ 
      DBNAME => $dbname,
      DBPASSWORD => $dbpassword,
      DBHOST => $dbhost,
      SCP_USER => $scpuser,
      SCP_HOST => $scphost,
      running_as_remote_indexer => 1,         
      use_ocr => 1,
      run_max => 200,
      abs_log => '/var/log/faindex.log'
   });
   
   $f->run;
   
   exit;

EXAMPLE 2

Running as remote indexer with your own indexer

   use FileArchiveIndexer::IndexingRun;
   
   my $f = new FileArchiveIndexer::IndexingRun({ 
      DBNAME => $dbname,
      DBPASSWORD => $dbpassword,
      DBHOST => $dbhost,
      SCP_USER => $scpuser,
      SCP_HOST => $scphost,
      runing_as_remote_indexer => 1,         
   });
   
   my $pending = $f->get_indexpending(20); # do 20
   
   for (@$pending) {
      my ($filesid, $abs_remote) = @$_;
   
      my $abs_local = $f->get_file($abs_remote) or next;
   
      my $md5sumid = $f->indexing_lock($abs_remote);
   
      my $text = your_method_for_getting_text_out_of($abs_local) or next;
      
      $f->insert_record($md5sumid,$text);
   
      $f->indexing_lock_release($filesid);
   
   }

DEBUG FLAG

   $FileArchiveIndexer::IndexingRun::DEBUG = 1;

This is an lvalue sub.

SEE ALSO

FileArchiveIndexer PDF::OCR::Thorough::Cached

AUTHOR

Leo Charre leocharre at cpan dot org