NAME
combineCtrl - controls a Combine crawling job
SYNOPSIS
combineCtrl <action> --jobname <name>
where action can be one of start, kill, load, recyclelinks, reharvest, stat, howmany, records, hosts, initMemoryTables, open, stop, pause, continue
OPTIONS AND ARGUMENTS
jobname is used to find the appropriate configuration (mandatory)
Actions starting/killing crawlers
- start
-
takes an optional switch
--harvesters n
wheren
is the number of crawler processes to start - kill
-
kills all active crawlers (and their associated combineRun monitors) for jobname
Actions loading or recycling URLs for crawling
- load
-
Read a list of URLs from STDIN (one per line) and schedules them for crawling
- recyclelinks
-
Schedule all newly found (since last invocation of recyclelinks) links in crawled pages for crawling
- reharvest
-
Schedules all pages in the database for crawling again (in order to check if they have changed)
Actions for controlling scheduling of URLs
- open
-
opens database for URL scheduling (maybe after a stop)
- stop
-
stops URL scheduling
- pause
-
pauses URL scheduling
- continue
-
continues URL scheduling after a pause
Misc actions
- stat
-
prints out rudimentary status of the ready queue (ie eligible now) of URLs to be crawled
- howmany
-
prints out rudimentary status of all URLs to be crawled
- records
-
prints out the number of ercords in the SQL database
- hosts
-
prints out rudimentary status of all hosts that have URLs to be crawled
- initMemoryTables
-
initializes the administrative MySQL tables that are kept in memory
DESCRIPTION
Implements various control functionality to administer a crawling job, like starting and stoping crawlers, injecting URLs into the crawl queue, scheduling newly found links for crawling, controlling scheduling, etc.
This is the preferred way of controling a crawl job.
EXAMPLES
echo 'http://www.yourdomain.com/' | combineCtrl load --jobname aatest
-
Seed the crawling job
aatest
with a URL combineCtrl start --jobname aatest --harvesters 3
-
Start 3 crawling processes for job
aatest
combineCtrl recyclelinks --jobname aatest
-
Schedule all new links crawling
combineCtrl stat --jobname aatest
-
See how many URLs that are eligible for crawling right now.
SEE ALSO
combine
Combine configuration documentation in /usr/share/doc/combine/.
AUTHOR
Anders Ardö, <anders.ardo@it.lth.se>
COPYRIGHT AND LICENSE
Copyright (C) 2005 Anders Ardö
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.4 or, at your option, any later version of Perl 5 you may have available.
See the file LICENCE included in the distribution at http://combine.it.lth.se/
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 251:
Non-ASCII character seen before =encoding in 'Ardö,'. Assuming CP1252