HTML::ExtractText::Extra - extra useful HTML::ExtractText
At its simplest; use CSS selectors:
# Same usage as HTML::ExtractText, but now we have extra # optional options (default values are shown): use HTML::ExtractText::Extra; my $ext = HTML::ExtractText::Extra->new( whitespace => 1, # strip leading/trailing whitespace nbsp => 1, # replace non-breaking spaces with regular ones ); $ext->extract( { page_title => 'title', # same extraction as HTML::ExtractText links => ['a', qr{http://|www\.} ], # strip what matches bold => ['b', sub { "<$_[0]>"; } ], # wrap what's found in <> }, $html, ) or die "Error: $ext"; print "Page title is $ext->{page_title}\nLinks are: $ext->{links}";
The module offers extra options and post-processing that the vanilla HTML::ExtractText does not provide.
HTML::ExtractText
This module offers all the standard methods and behaviour HTML::ExtractText provides. See its documentation for details.
->new
my $ext = HTML::ExtractText::Extra->new( whitespace => 1, # strip leading/trailing whitespace nbsp => 1, # replace non-breaking spaces with regular ones );
whitespace
my $ext = HTML::ExtractText::Extra->new( whitespace => 1, );
Optional. Defaults to: 1. When set to a true value, leading and trailing whitespace will be trimmed from the results.
1
nbsp
my $ext = HTML::ExtractText::Extra->new( nbsp => 1, );
Optional. Defaults to: 1. When set to a true value, non-breaking spaces in the results will be converted into regular spaces. Note that this does not affect how the normal white-space folding operates, so foo bar will end up having 3 spaces between foo and bar.
foo bar
foo
bar
->extract
$ext->extract( { page_title => 'title', # same extraction as HTML::ExtractText links => ['a', qr{http://|www\.} ], # strip what matches bold => ['b', sub { "<$_[0]>"; } ], # wrap what's found in <> }, $html, ) or die "Error: $ext";
This module extends possible values in the hashref given as the first argument to ->extract method. They are given by changing the string containing the selector to an arrayref, where the first element is the selector you want to match and the rest of the elements are as follows:
$ext->extract({ links => ['a', qr{http://|www\.} ] }, $html )
When second element of the arrayref is a regex reference, any text that matches the regex will be stripped from the text that is being extracted.
$ext->extract({ links => ['a', sub { "<$_[0]>"; } ] }, $html )
When second element of the arrayref is a code reference, it will be called for each found bit of text we're extracting and its @_ will contain that text as the first element. Whatever the sub returns will be used as the result of extraction.
@_
$ext->whitespace(0);
Accessor method for the whitespace argument to ->new.
$ext->nbsp(0);
Accessor method for the nbsp argument to ->new.
HTML::ExtractText - a basic version of this extractor
Mojo::DOM, Text::Balanced, HTML::Extract
Fork this module on GitHub: https://github.com/zoffixznet/HTML-ExtractText-Extra
To report bugs or request features, please use https://github.com/zoffixznet/HTML-ExtractText-Extra/issues
If you can't access GitHub, you can email your request to bug-html-extracttext-extra at rt.cpan.org
bug-html-extracttext-extra at rt.cpan.org
You can use and distribute this module under the same terms as Perl itself. See the LICENSE file included in this distribution for complete details.
LICENSE
To install HTML::ExtractText::Extra, copy and paste the appropriate command in to your terminal.
cpanm
cpanm HTML::ExtractText::Extra
CPAN shell
perl -MCPAN -e shell install HTML::ExtractText::Extra
For more information on module installation, please visit the detailed CPAN module installation guide.