The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Lingua::Stem::Snowball - Perl interface to Snowball stemmers.

SYNOPSIS

    my @words = qw( horse hooves );

    # OO interface:
    my $stemmer = Lingua::Stem::Snowball->new( lang => 'en' );
    $stemmer->stem_in_place( \@words ); # qw( hors hoov )

    # plain interface:
    my @stems = stem( 'en', \@words );

DESCRIPTION

Stemming reduces related words to a common root form. For instance, "horse", "horses", and "horsing" all become "hors". Most commonly, stemming is deployed as part of a search application, allowing searches for a given term to match documents which contain other forms of that term.

This module is very similar to Lingua::Stem -- however, Lingua::Stem is pure Perl, while Lingua::Stem::Snowball is an XS module which provides a Perl interface to the C version of the Snowball stemmers. (http://snowball.tartarus.org).

Supported Languages

The following stemmers are available (as of Lingua::Stem::Snowball 0.94):

    |-----------------------------------------------------------|
    | Language   | ISO code | default encoding | also available |
    |-----------------------------------------------------------|
    | Danish     | da       | ISO-8859-1       | UTF-8          | 
    | Dutch      | nl       | ISO-8859-1       | UTF-8          | 
    | English    | en       | ISO-8859-1       | UTF-8          |
    | Finnish    | fi       | ISO-8859-1       | UTF-8          | 
    | French     | fr       | ISO-8859-1       | UTF-8          |
    | German     | de       | ISO-8859-1       | UTF-8          | 
    | Italian    | it       | ISO-8859-1       | UTF-8          | 
    | Norwegian  | no       | ISO-8859-1       | UTF-8          | 
    | Portuguese | pt       | ISO-8859-1       | UTF-8          | 
    | Spanish    | es       | ISO-8859-1       | UTF-8          | 
    | Swedish    | sv       | ISO-8859-1       | UTF-8          | 
    | Russian    | ru       | KOI8-R           | UTF-8          | 
    |-----------------------------------------------------------|

Benchmarks

Here is a comparison of Lingua::Stem::Snowball and Lingua::Stem, using The Works of Edgar Allen Poe, volumes 1-5 (via Project Gutenberg) as source material. It was produced on a 3.2GHz Pentium 4 running FreeBSD 5.3 and Perl 5.8.7. (The benchmarking script is included in this distribution: bin/benchmark_stemmers.plx.)

    |--------------------------------------------------------------------|
    | total words: 454285 | unique words: 22748                          |
    |--------------------------------------------------------------------|
    | module                        | config        | avg secs | rate    |
    |--------------------------------------------------------------------|
    | Lingua::Stem 0.81             | no cache      | 2.029    | 223881  |
    | Lingua::Stem 0.81             | cache level 2 | 1.280    | 355025  |
    | Lingua::Stem::Snowball 0.94   | stem          | 1.426    | 318636  |
    | Lingua::Stem::Snowball 0.94   | stem_in_place | 0.641    | 708495  |
    |--------------------------------------------------------------------|

METHODS / FUNCTIONS

new

    my $stemmer = Lingua::Stem::Snowball->new(
        lang     => 'es', 
        encoding => 'UTF-8',
    );
    die $@ if $@;

Create a Lingua::Stem::Snowball object. new() accepts the following hash style parameters:

  • lang: An ISO code taken from the table of supported languages, above.

  • encoding: A supported character encoding.

Be careful with the values you supply to new(). If lang is invalid, Lingua::Stem::Snowball does not throw an exception, but instead sets $@. Also, if you supply an invalid combination of values for lang and encoding, Lingua::Stem::Snowball will not warn you, but the behavior will change: stem() will always return undef, and stem_in_place() will be a no-op.

stem

    @stemmed = $stemmer->stem( WORDS, [IS_STEMMED] );
    @stemmed = stem( ISO_CODE, WORDS, [LOCALE, IS_STEMMED] );

Return lowercased and stemmed output. WORDS may be either an array of words or a single scalar word.

In a scalar context, stem() returns the first item in the array of stems:

    $stem       = $stemmer->stem($word);
    $first_stem = $stemmer->stem(\@words); # probably wrong

LOCALE has no effect; it is only there as a placeholder for backwards compatibility (see Changes). IS_STEMMED must be a reference to a scalar; if it is supplied, it will be set to 1 if the output differs from the input in some way, 0 otherwise.

stem_in_place

    $stemmer->stem_in_place(\@words);

This is a high-performance, streamlined version of stem() (in fact, stem() calls stem_in_place() internally). It has no return value, instead modifying each item in an existing array of words. The words must already be in lower case.

lang

    my $lang = $stemmer->lang;
    $stemmer->lang($iso_language_code);

Accessor/mutator for the lang parameter. If there is no stemmer for the supplied ISO code, the language is not changed (but $@ is set).

encoding

    my $encoding = $stemmer->encoding;
    $stemmer->encoding($encoding);

Accessor/mutator for the encoding parameter.

stemmers

    my @iso_codes = stemmers();
    my @iso_codes = $stemmer->stemmers();

Returns a list of all valid language codes.

REQUESTS & BUGS

Please report any requests, suggestions or bugs via the RT bug-tracking system at http://rt.cpan.org/ or email to bug-Lingua-Stem-Snowball@rt.cpan.org.

http://rt.cpan.org/NoAuth/Bugs.html?Dist=Lingua-Stem-Snowball is the RT queue for Lingua::Stem::Snowball. Please check to see if your bug has already been reported.

AUTHORS

Lingua::Stem::Snowball was originally developed to provide access to stemming algorithms for the OpenFTS (full text search engine) project (http://openfts.sourceforge.net), by Oleg Bartunov, <oleg at sai dot msu dot su> and Teodor Sigaev, <teodor at stack dot net>.

Currently maintained by Marvin Humphrey <marvin at rectangular dot com>. Previously maintained by Fabien Potencier <fabpot at cpan dot org>.

COPYRIGHT

Copyright 2004-2006

This software may be freely copied and distributed under the same terms and conditions as Perl.

Snowball files and stemmers are covered by the BSD license.

SEE ALSO

http://snowball.tartarus.org, Lingua::Stem.