KSx::Highlight::Summarizer - KinoSearch Highlighter subclass that provides more comprehensive summaries
0.06 (beta)
use KSx::Highlight::Summarizer; my $summarizer = new KSx::Highlight::Summarizer searchable => $searcher, query => 'foo bar', field => 'content', # optional: pre_tag => '<b>', post_tag => '</b>', encoder => sub { my $str = shift; $str =~ s/([&'"<])/'&#'.ord($1).';'/eg; $str }, page_handler => sub { "<h3>Page $_[1]:</h3>" }, ellipsis => "\x{2026}", # default: ' ... ' excerpt_length => 150, # default: 200 summary_length => 400, ; my $excerpt = $summarizer->create_excerpt( $hit );
This module extends KinoSearch::Highlight::Highlighter (which provides an excerpt for a search result, with search words highlighted) to provide various customisations, especially summaries, i.e., multiple excerpts joined together with ellipses.
The superclass finds the best location with the text of a search result, takes a single piece of text surrounding it, and then formats it, highlighting words as appropriate. This module will also take the second best location and create an excerpt for that (removing overlap), and so on until the summary_length is reached or exceeded.
summary_length
This is the constructor. It takes hash-style arguments, as shown in the "SYNOPSIS". The various arguments are as follows:
A reference to an object that isa KinoSearch::Search::Searchable (e.g., a KinoSearch::Searcher)
A query string or object
The name of the field for which to make a summary
These two are strings of text to be inserted around highlighted words, such as HTML tags. The defaults are '<strong>' and '</strong>'.
An code ref that is expected to encode the text fed to it, e.g., with HTML entities
A coderef. If this is provided, it will be called for every page break (form feed; ASCII character 12) in the summary, and its return value substituted for that page break. The arguments will be (0) the hit (a KinoSearch::Doc::HitDoc object) and (1) the page number.
The ellipsis mark to use. The default is three ASCII dots surrounded by spaces: ' ... '
The length of each excerpt (default is 200), not including ellipses. Actually, an excerpt may end up being shorter than this, because the start is trimmed to the nearest sentence boundary or page break, and the end is trimmed to the nearest word boundary.
The approximate length of the summary, not including ellipses. Excerpts are collected together until the lengths of the excerpts (before trimming) equal or exceed the number passed to this argument. If this is omitted, only one excerpt will be made.
This requires a KinoSearch::Doc::HitDoc object as its sole argument. It creates and returns a summary.
A very long custom ellipsis, or two page breaks a few characters apart, can break the page-counting algorithm.
This module requires perl and the following modules, which available from the CPAN:
Number::Range
Hash::Util::FieldHash::Compat
The development version of KinoSearch available at http://www.rectangular.com/svn/kinosearch/trunk, revision 4604 or later. It has only been tested with revision 4625.
Copyright (C) 2008-9 Father Chrysostomos <sprout at, um, cpan.org>
This program is free software; you may redistribute or modify it (or both) under the same terms as perl.
Much of the code in this module is based on revision 3122 of Marvin Humphrey's KinoSearch::Highlight::Highlighter, of which this is a subclass.
KinoSearch::Highlight::Highlighter
To install KSx::Highlight::Summarizer, copy and paste the appropriate command in to your terminal.
cpanm
cpanm KSx::Highlight::Summarizer
CPAN shell
perl -MCPAN -e shell install KSx::Highlight::Summarizer
For more information on module installation, please visit the detailed CPAN module installation guide.