FileArchiveIndexer::IndexingRun - abstraction of an indexing run
use FileArchiveIndexer::IndexingRun; my $i = new FileArchiveIndexer::IndexingRun({ DBNAME => $dbname, DBPASSWORD => $dbpassword, DBHOST => $dbhost, SCP_USER => $scpuser, SCP_HOST => $scphost, use_ocr => 1, }); $i->run(); exit;
this module is an abstraction to an indexer run this module uses FileArchiveIndexer as base, all its methods are present.
my $i = new FileArchiveIndexer::IndexingRun({ DBNAME => $dbname, DBPASSWORD => $dbpassword, DBHOST => $dbhost, SCP_USER => $scpuser, SCP_HOST => $scphost, use_ocr => 1, run_max => 100, abs_log => '/var/log/faindex.log', running_as_remote_indexer => 0, });
If you have a YAML config file..
my $i = FileArchiveIndexer::IndexingRun({ abs_conf => '/etc/faindex.conf' });
When you call run() the maximum number of files indexed is run_max, see also run_max().
This module was designed to index pdf files with ocr, by default this is disabled, see use_ocr().
By default we expect that the indexer is running locally on the server which hosts the database and the files. If this is not the case, running_as_remote_indexer sould be set to 1, the files are retrieved via scp, so SCP_USER and SCP_HOST need to be set- see "RUNNING AS REMOTE INDEXER".
Certain errors can be logged. For example, if we are running as remote indexer, and the file is not properly retrieved, then we can log that. Also if we are using ocr and the file does not check ok for pdf standards, we can log that too. To enable logging you must set the parameter 'abs_log'.
Also a summary is logged at the end of each run.
$self->_log("no md5sum [$md5sum] or abs path [$abs_path]"); debug(" - STEP ENDS, missing abs path or md5sum for indexing\n\n"); next INDEXFILE; } };
this initiates the actual indexing run it will keep running until the run_count() matches run_max() or no more files are in pending. returns true after the run.
You can make your own indexer if you like. You do not have to use run().
returns number how many we have indexed so far this does not include files skipped files may be skipped because we can't lock or for errors the count is only the count of successfully indexed files
maximum files to index in this run argument is max number of files to index to set or you can also set via argument to constructor via 'run_max' default is 100
# set to 45 $self->run_max(45); $self->run;
returns boolean if no files are pending only returns true if get_pending_next() has already been called and get_indexpending returned no more files
argument is 1/0 returns boolean
if you want to use ocr for paper documents stored as scans you will require PDF::OCR package installed and all its dependencies. set to 0 by default see PDF::OCR::Thorough::Cached
can also be passed as argument 'use_ocr' to constructor
One of the crucial goals of FileArchiveIndexer is to be able to index a vast ammount of documents, possibly in a very time consumming manner. For example, using PDF::OCR::Thorough, we can turn pdf scans of documents into text for the indexer.
The process is to cpu intensive that it can take one computer many weeks to index a large archive. Thus, the option run multiple indexing machines for one archive and one database is a wonderful option to have.
Running as remote indexer, Digest::MD5::File is required.
The local indexer, being remote to the file archive, asks the database for a list of pending files. For each file, we download the file, and ask what it's md5sum is supposed to be. After downloading, we get an md5sum and check it against what the file archive server thinks it should be, if it is the same then we index
You must configure the file archive machine's mysql server to accept connections from the remote machines. You will need to add a user and host to the mysql server to be able to make changes to it remotely.
On the server hosting your file archive and database :
mysql -p
GRANT ALL PRIVILIGES ON *.* TO '$DBUSER'@'$USERHOSTIP' IDENTIFIED BY '$password' WITH GRANT OPTION;
for example to grant on the network
GRANT ALL PRIVILIGES ON *.* TO '$DBUSER'@'192.168.0.%' IDENTIFIED BY '$password' WITH GRANT OPTION;
In addition to all the normal arguments to constructor, you must also provide these parameters:
argument is abs_path of remote file returns abs_path of local file will test for md5sum local being same as remote if fails warns and returns undef
This is ONLY used for a remote indexer that is, if the indexer is running on a different machine then the server holding the files and database
uses scp
argument is 0/1 returns boolean if set, then it will use get_file() to retrieve also, additional arguments to constructor must be provided, see "RUNNING AS REMOTE INDEXER". It is suggested not to use this method, instead to set it via the constructor.
Running as remote indexer with the default run() method
use FileArchiveIndexer::IndexingRun; my $f = new FileArchiveIndexer::IndexingRun({ DBNAME => $dbname, DBPASSWORD => $dbpassword, DBHOST => $dbhost, SCP_USER => $scpuser, SCP_HOST => $scphost, running_as_remote_indexer => 1, use_ocr => 1, run_max => 200, abs_log => '/var/log/faindex.log' }); $f->run; exit;
Running as remote indexer with your own indexer
use FileArchiveIndexer::IndexingRun; my $f = new FileArchiveIndexer::IndexingRun({ DBNAME => $dbname, DBPASSWORD => $dbpassword, DBHOST => $dbhost, SCP_USER => $scpuser, SCP_HOST => $scphost, runing_as_remote_indexer => 1, }); my $pending = $f->get_indexpending(20); # do 20 for (@$pending) { my ($filesid, $abs_remote) = @$_; my $abs_local = $f->get_file($abs_remote) or next; my $md5sumid = $f->indexing_lock($abs_remote); my $text = your_method_for_getting_text_out_of($abs_local) or next; $f->insert_record($md5sumid,$text); $f->indexing_lock_release($filesid); }
$FileArchiveIndexer::IndexingRun::DEBUG = 1;
This is an lvalue sub.
FileArchiveIndexer PDF::OCR::Thorough::Cached
Leo Charre leocharre at cpan dot org
To install FileArchiveIndexer, copy and paste the appropriate command in to your terminal.
cpanm
cpanm FileArchiveIndexer
CPAN shell
perl -MCPAN -e shell install FileArchiveIndexer
For more information on module installation, please visit the detailed CPAN module installation guide.