HTML::Feature - Extract Feature Sentences From HTML Documents


    use HTML::Feature;

    my $f = HTML::Feature->new(enc_type => 'utf8');
    my $result = $f->parse('');

    print "Title:"        , $result->title(), "\n";
    print "Description:"  , $result->desc(),  "\n";
    print "Featured Text:", $result->text(),  "\n";

    # you can get a HTML::Element object
    my $f = HTML::Feature->new();
    my $result = $f->parse('',{element_flag => 1});
    print "HTML Element:",  $result->element->as_HTML, "\n";

    # a simpler method is, 

    use HTML::Feature qw(feature);
    print scalar feature('');

    # very simple!


This module extracst blocks of feature sentences out of an HTML document.

Unlike other modules that performs similar tasks, this module by default extracts blocks without using morphological analysis, and instead it uses simple statistics processing.

Because of this, HTML::Feature has an advantage over other similar modules in that it can be applied to documents in any language.



    my $f = HTML::Feature->new(%param);
    my $f = HTML::Feature->new(
        engine => $class, # backend engine module (default: 'TagStructure') 
        max_bytes => 5000, # max number of bytes per node to analyze (default: '')
        min_bytes => 10, # minimum number of bytes per node to analyze (default is '')
        enc_type => 'euc-jp', # encoding of return values (default: 'utf-8')
        user_agent => 'my-agent-name', # LWP::UserAgent->agent (default: 'libwww-perl/#.##') 
        http_proxy => 'http://proxy:3128', # http proxy server (default: '')
        timeout => 10, # set the timeout value in seconds. (default: 180)
        element_flag => 1, # flag of HTML::Element object as returned value (default: '') 

Instantiates a new HTML::Feature object. Takes the following parameters


Specifies the class name of the engine that you want to use.

HTML::Feature is designed to accept different engines to change its behavior. If you want to customize the behavior of HTML::Feature, specify your own engine in this parameter

The rest of the arguments are directly passed to the HTML::Feature::Engine object constructor.


    my $result = $f->parse($url);
    # or
    my $result = $f->parse($html_ref);
    # or
    my $result = $f->parse($http_response);

Parses the given argument. The argument can be either a URL, a string of HTML (must be passed as a scalar reference), or an HTTP::Response object. HTML::Feature will detect and delegate to the appropriate method (see below)


Parses an URL. This method will use LWP::UserAgent to fetch the given url.


Parses a string containing HTML.


Parses an HTTP::Response object.


    $data = $f->extract(url => $url);
    # or
    $data = $f->extract(string => $html);

HTML::Feature::extract() has been deprecated and exists for backwards compatiblity only. Use HTML::Feature::parse() instead.

extract() extracts blocks of feature sentences from the given document, and returns a data structure like this:

    $data = {
        title => $title,
        description => $desc,
        block => [
                contents => $contents,
                score => $score


feature() is a simple wrapper that does new(), parse() in one step. If you do not require complex operations, simply calling this will suffice. In scalar context, it returns the feature text only. In list context, some more meta data will be returned as a hash.

This function is exported on demand.

    use HTML::Feature qw(feature);
    print scalar feature($url);  # print featured text

    my %data = feature($url); # wantarray(hash)
    print $data{title};
    print $data{desc};
    print $data{text};


Takeshi Miki <>

Special thanks to Daisuke Maki


Copyright (C) 2007 Takeshi Miki This library is free software; you can redistribute it and/or modifyit under the same terms as Perl itself, either Perl version 5.8.8 or,at your option, any later version of Perl 5 you may have available.