NAME
combine - main crawling machine in the Combine system
SYNOPSIS
combine --jobname <name> --logname <id>
OPTIONS AND ARGUMENTS
jobname is used to find the appropriate configuration (mandatory)
logname is used as identifier in the log (in MySQL table log)
DESCRIPTION
Does crawling, parsing, optional topic-check and stores in MySQL database Normally started with the combineCtrl
command. Briefly it get's an URL from the MySQL database, which acts as a common coordinator for a Combine job. The Web-page is fetched, provided it passes the robot exclusion protocoll. The HTML ic cleaned using Tidy
and parsed into metadata, headings, text, links and link achors. Then it is stored (optionaly provided a topic-check is passed to keep the crawler focused) in the MySQL database in a structured form.
A simple workflow for a trivial crawl job might look like:
Initialize database and configuration
combineINIT --jobname aatest
Enter some seed URLs from a file with a list of URLs
combineCtrl load --jobname aatest < seedURLs.txt
Start 2 crawl processes
combineCtrl start --jobname aatest --harvesters 2
For some time occasionally schedule new links for crawling
combineCtrl recyclelinks --jobname aatest
or look at the size of the ready que
combineCtrl stat --jobname aatest
When satisfied kill the crawlers
combineCtrl kill --jobname aatest
Export data records in a highly structured XML format
combineExport --jobname aatest
For more complex jobs you have to edit the job configuration file.
SEE ALSO
combineINIT, combineCtrl
Combine configuration documentation in /usr/share/doc/combine/.
AUTHOR
Anders Ardö, <anders.ardo@it.lth.se>
COPYRIGHT AND LICENSE
Copyright (C) 2005 Anders Ardö
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.
See the file LICENCE included in the distribution at http://combine.it.lth.se/
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 390:
Non-ASCII character seen before =encoding in 'Ardö,'. Assuming CP1252