The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Dancer::SearchApp::HTMLSnippet - HTML snippet extractor

SYNOPSIS

    my @document_snippets = Dancer::SearchApp::HTMLSnippet->extract_highlights(
        html => $html,
        hl_tag => '<em>',
        hl_end => '</em>',
        snippet_length => 150,
        max_snippets => 8,
    );

METHODS

Dancer::SearchApp::HTMLSnippet->extract_highlights

    my @document_snippets = Dancer::SearchApp::HTMLSnippet->extract_highlights(
        html => $html,
        hl_tag => '<em>',
        hl_end => '</em>',
        snippet_length => 150,
        max_snippets => 8,
    );

This extract the highlight snippets and metadata from the HTML as prepared by Tika and highlightedd by Elasticsearch. It returns a list of hash references, each containing a (well-formed) HTML snippet containing the highlights and a page entry noting the original page number if the snippet originated from within a <p class="page\d+"> section (or crosses that)

  {
      html => 'this is a <b>result</b> you searched for',
      page => 42,
  }

Dancer::SearchApp::HTMLSnippet->extract_highlights

  my @hits = Dancer::SearchApp::HTMLSnippet->extract_highlights(
      html => $html,
      max_length => 300,
  );
  
  for my $entry (@hits) {
    print "Match: $entry->{start} ($entry->{length} bytes)\n";
  };

Dancer::SearchApp::HTMLSnippet->cleanup_tika

  my $content = Dancer::SearchApp::HTMLSnippet->cleanup_tika( $html );

Cleans up HTML output from Apache Tika.

BUG TRACKER

Please report bugs in this module via the RT CPAN bug queue at https://rt.cpan.org/Public/Dist/Display.html?Name=Dancer-SearchApp or via mail to dancer-searchapp-Bugs@rt.cpan.org.

AUTHOR

Max Maischein corion@cpan.org

COPYRIGHT (c)

Copyright 2014-2016 by Max Maischein corion@cpan.org.

LICENSE

This module is released under the same terms as Perl itself.