Combine - Focused Web crawler framework
combine --jobname <name> --logname <id>
OPTIONS AND ARGUMENTS
jobname is used to find the appropriate configuration (mandatory)
logname is used as identifier in the log (in MySQL table log)
Does crawling, parsing, optional topic-check and stores in MySQL database Normally started with the
combineCtrl command. Briefly it get's an URL from the MySQL database, which acts as a common coordinator for a Combine job. The Web-page is fetched, provided it passes the robot exclusion protocoll. The HTML ic cleaned using
Tidy and parsed into metadata, headings, text, links and link achors. Then it is stored (optionaly provided a topic-check is passed to keep the crawler focused) in the MySQL database in a structured form.
A simple workflow for a trivial crawl job might look like:
Initialize database and configuration combineINIT --jobname aatest Enter some seed URLs from a file with a list of URLs combineCtrl load --jobname aatest < seedURLs.txt Start 2 crawl processes combineCtrl start --jobname aatest --harvesters 2 For some time occasionally schedule new links for crawling combineCtrl recyclelinks --jobname aatest or look at the size of the ready queue combineCtrl stat --jobname aatest When satisfied kill the crawlers combineCtrl kill --jobname aatest Export data records in a highly structured XML format combineExport --jobname aatest
For more complex jobs you have to edit the job configuration file.
Combine configuration documentation in /usr/share/doc/combine/.
Anders Ardö, <email@example.com>
COPYRIGHT AND LICENSE
Copyright (C) 2005 Anders Ardö
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.
See the file LICENCE included in the distribution at http://combine.it.lth.se/
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 392:
Non-ASCII character seen before =encoding in 'Ardö,'. Assuming ISO8859-1