scrape - command-line frontend to HTML::ListScraper
scrape --core=all sample.html scrape --core=list [ --min-count=10 ] [ --detail=all ] [ --shapeless ] [ --ignore=b,i,em,strong,wbr ] [ --export=seq.txt ] sample.html scrape --core=item --import=seq.txt sample.html scrape --whole sample.html scrape --core=all --detail=all --acquire=Perl.html 'http://search.yahoo.com/search?p=Perl'
This script processes a HTML page with
HTML::ListScraper and shows the result, as YAML (down to the tag sequences, which are YAML scalars formatted by HTML::ListScraper::Interactive). It's meant for interactive exploration of HTML::ListScraper results and fine-tuning of its settings for a specific scraping application.
For every invocation, the single source file or URL is mandatory. URLs are recognized by their
http scheme - source names that don't start with
http:// are normally interpreted as file names. All other command-line switches are optional and mutually independent. Note that with no switches, the script doesn't output anything. The switches are as follows:
Show found repeats. Value is a string, one of
- item (or just "i")
Show only the first sequence instance.
- list (or just "l")
Show all instances of the first sequence.
- all (or just "a")
Show all instances of all found sequences.
By default, no matches are shown. When they are shown, a YAML document, corresponding to a HTML::ListScraper::Sequence, has the sequence length as YAML field
len, the repeat count as
count and a YAML sequence with items corresponding to HTML::ListScraper::Instance. Each item starts with a field, keyed by the value of
HTML::ListScraper::Instance::match, whose value is the start position, followed by
score (for approximate matches only) and
inst with the actual tag sequence. The tag sequence is formatted by
HTML::ListScraper::Interactive::format_tags, with formatting options depending on the value of the
--detail command line switch.
Boolean switch, sets
HTML::ListScraper::shapeless to true.
Value is an integer bigger than 1, used to set
Specifies formatting of found tag sequences. Value is a string, one of
Don't show the matches at all. This is useful to see just how many sequences were found, how many instances they have and where.
Show just the tags, without text and links. This is the default value.
Show tags and text.
Show tags with links.
Show all content fields of HTML:ListScraper::Tag: tags, text and links.
Boolean switch. When used,
scrape outputs, as the first YAML document containing a single YAML scalar, the whole sequence maintained by
HTML::ListScraper. Note that the sequence is formatted without attributes, without text and with tag positions, irrespective of the value of
A comma-separated list of tags the HTML parser should ignore. The list items shouldn't contain any slashes nor angle brackets. For every name in the list, both opening and closing tag are ignored. Default is
b, i, em, strong; when specifying the value explicitly, you probably want to include these tags in it.
scrape to dump the first found sequence into the file specified by the option's value. If the file already exists, it's overwritten. When no sequence is found, nothing is dumped. Note that the sequence is formatted with just tags, irrespective of the value of
scrape to call
HTML::ListScraper::find_known_sequence instead of
HTML::ListScraper::find_sequences, with arguments read from the file specified by the option's value. Lines of that file are converted to tag names by
scrape to save the downloaded HTML into the file specified by the option's value. If the file already exists, it's overwritten. Using this switch causes
scrape to interpret the source as a URL, irrespective of its scheme, and pass it to LWP.
COPYRIGHT & LICENSE
Copyright 2007 Vaclav Barta, all rights reserved.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.