Combine - Focused Web crawler framework


combine --jobname <name> --logname <id>


jobname is used to find the appropriate configuration (mandatory)

logname is used as identifier in the log (in MySQL table log)


Does crawling, parsing, optional topic-check and stores in MySQL database Normally started with the combineCtrl command. Briefly it get's an URL from the MySQL database, which acts as a common coordinator for a Combine job. The Web-page is fetched, provided it passes the robot exclusion protocoll. The HTML ic cleaned using Tidy and parsed into metadata, headings, text, links and link achors. Then it is stored (optionaly provided a topic-check is passed to keep the crawler focused) in the MySQL database in a structured form.

A simple workflow for a trivial crawl job might look like:

    Initialize database and configuration
  combineINIT --jobname aatest
    Enter some seed URLs from a file with a list of URLs
  combineCtrl  load --jobname aatest < seedURLs.txt
    Start 2 crawl processes
  combineCtrl  start --jobname aatest --harvesters 2

    For some time occasionally schedule new links for crawling
  combineCtrl recyclelinks --jobname aatest
    or look at the size of the ready queue
  combineCtrl stat --jobname aatest

    When satisfied kill the crawlers
  combineCtrl kill --jobname aatest
    Export data records in a highly structured XML format
  combineExport --jobname aatest

For more complex jobs you have to edit the job configuration file.


combineINIT, combineCtrl

Combine configuration documentation in /usr/share/doc/combine/.


