The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

SD_SQL

DESCRIPTION

Reimplementation of sd.pl SD.pm and SDQ.pm using MySQL contains both recyc and guard

Basic idea is to have a table (urldb) that contains most URLs ever inserted into the system together with a lock (the guard function) and a boolean harvest-flag. Also in this table is the host part together with its lock. URLs are selected from this table based on urllock, netloclock and harvest and inserted into a queue (table que). URLs from this queue are then given out to harvesters. The queue is implemented as: # The admin table can be used to generate sequence numbers like this: #mysql> update admin set queid=LAST_INSERT_ID(queid+1); # and used to extract the next URL from the queue #mysql> select host,url from que where queid=LAST_INSERT_ID(); # When the queue is empty it is filled from table urldb. Several different algorithms can be used to fill it (round-robin, most urls, longest time since harvest, ...). Since the harvest-flag and guard-lock are not updated until the actual harvest is done it is OK to delete the queue and regenerate it anytime.

########################## #Questions, ideas, TODOs, etc #Split table urldb into 2 tables - one for urls and one for hosts??? #Less efficient when filling que; more efficient when updating netloclock #Datastruktur TABLE hosts: create table hosts( host varchar(50) not null default '', netloclock int not null, retries int not null default 0, ant int not null default 0, primary key (host), key (ant), key (netloclock) );

############# Handle to many retries?

    algorithm takes an url from the host that was accessed longest ago
    ($hostid,$url)=SELECT host,url,id FROM hosts,urls WHERE 
         hosts.hostlock < UNIX_TIMESTAMP()
         hosts.host=urls.host AND 
         urls.urllock < UNIX_TIMESTAMP() AND 
         urls.harvest=1 ORDER BY hostlock LIMIT 1;

    algorithm takes an url from the host with most URLs
    ($hostid,$url)=SELECT host,url,id FROM hosts,urls WHERE 
         hosts.hostlock < UNIX_TIMESTAMP()
         hosts.host=urls.host AND 
         urls.urllock < UNIX_TIMESTAMP() AND 
         urls.harvest=1 ORDER BY host.ant DESC LIMIT 1;

    algorithm takes an url from any available host
    ($hostid,$url)=SELECT host,url,id FROM hosts,urls WHERE 
         hosts.hostlock < UNIX_TIMESTAMP()
         hosts.host=urls.host AND 
         urls.urllock < UNIX_TIMESTAMP() AND 
         urls.harvest=1 LIMIT 1;

AUTHOR

Anders Ardö <anders.ardo@it.lth.se>

COPYRIGHT AND LICENSE

Copyright (C) 2005,2006 Anders Ardö

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.

See the file LICENCE included in the distribution at http://combine.it.lth.se/

1 POD Error

The following errors were encountered while parsing the POD:

Around line 413:

Non-ASCII character seen before =encoding in 'Ardö'. Assuming CP1252