The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Search::Fulltext::Tokenizer::Ngram - Character n-gram tokenizer for Search::Fulltext

VERSION

version 0.01

SYNOPSIS

  use utf8;
  use Search::Fulltext;
  use Search::Fulltext::Tokenizer::Bigramm;
  
  my $searcher = Search::Fulltext->new(
      docs => [
          'ハンプティ・ダンプティ 塀の上',
          'ハンプティ・ダンプティ 落っこちた',
          '王様の馬みんなと 王様の家来みんなでも',
          'ハンプティを元に 戻せなかった',
      ],
      tokenizer => q/perl 'Search::Fulltext::Tokenizer::Bigram::get_tokenizer'/,
  );
  my $hit_document_ids = $searcher->search('ハンプティ');  # [0, 1, 3]

DESCRIPTION

This module provides character N-gram tokenizers for Search::Fulltext.

By default {1,2,3}-gram tokenzers are available.

CREATING A N(> 3)-GRAM TOKENIZER

If you wish to use other N-grams where N > 3, you can create it by inheriting Search::Fulltext::Tokenizer::Ngram:

  package My::Tokenizer::42gram;
  
  use parent qw/Search::Fulltext::Tokenizer::Ngram/;
  
  my $iterator_generator = __PACKAGE__->new(42);
  
  sub get_tokenizer {
      sub { $iterator_generator->create_token_iterator(@_) };
  }

SEE ALSO

Search::Fulltext::Tokenizer::Unigram Search::Fulltext::Tokenizer::Bigram Search::Fulltext::Tokenizer::Trigram

AUTHOR

Koichi SATOH <sekia@cpan.org>

COPYRIGHT AND LICENSE

This software is Copyright (c) 2014 by Koichi SATOH.

This is free software, licensed under:

  The MIT (X11) License

1 POD Error

The following errors were encountered while parsing the POD:

Around line 63:

Non-ASCII character seen before =encoding in ''ハンプティ・ダンプティ'. Assuming UTF-8