++ed by:
KEEDI EMAZEP NWELLNHOF CKRAS MARKELLIS

9 PAUSE users
5 non-PAUSE users.

Logan Bell
and 1 contributors

NAME

Lucy::Analysis::Normalizer - Unicode normalization, case folding and accent stripping

Normalizer is an Analyzer which normalizes tokens to one of the Unicode normalization forms.

SYNOPSIS

    my $normalizer = Lucy::Analysis::Normalizer->new;
    
    my $polyanalyzer = Lucy::Analysis::PolyAnalyzer->new(
        analyzers => [ $normalizer, $tokenizer, $stemmer ],
    );

DESCRIPTION

Optionally, it performs Unicode case folding and converts accented characters to their base character.

If you use highlighting, Normalizer should be run after tokenization because it might add or remove characters.

CONSTRUCTORS

new( [labeled params] )

    my $normalizer = Lucy::Analysis::Normalizer->new(
        normalization_form => 'NFKC',
        case_fold          => 1,
        strip_accents      => 0,
    );
  • normalization_form - Unicode normalization form, can be one of 'NFC', 'NFKC', 'NFD', 'NFKD'. Defaults to 'NFKC'.

  • case_fold - Perform case folding, default is true.

  • strip_accents - Strip accents, default is false.

INHERITANCE

Lucy::Analysis::Normalizer isa Lucy::Analysis::Analyzer isa Lucy::Object::Obj.