Lingua::Stem::Cistem - CISTEM Stemmer for German
use Lingua::Stem::Cistem; my $stemmed_word = Lingua::Stem::Cistem::stem($word); my @segments = Lingua::Stem::Cistem::segment($word); use Lingua::Stem::Cistem qw(:orig); my $stemmed_word = stem($word); my @segments = segment($word); use Lingua::Stem::Cistem qw(:robust); my $stemmed_word = stem_robust($word); my @segments = segment_robust($word);
This is the CISTEM stemmer for German based on the "OFFICIAL IMPLEMENTATION".
Typically stemmers are used in applications like Information Retrieval, Keyword Extraction or Topic Matching.
It applies the CISTEM stemming algorithm to a word, returning the stem of this word.
Now (2019) CISTEM has the best f-score compared to other stemmers for German on CPAN, while being one of the fastest.
The methods in this package keep their original logic and API, only the module name changed from Cistem to Lingua::Stem::Cistem.
Changes in this distribution applied to the "OFFICIAL IMPLEMENTATION":
It is based on the paper
Leonie Weißweiler, Alexander Fraser (2017). Developing a Stemmer for German Based on a Comparative Analysis of Publicly Available Stemmers. In Proceedings of the German Society for Computational Linguistics and Language Technology (GSCL)
which can be read here:
http://www.cis.lmu.de/~weissweiler/cistem/
In the paper, the authors conducted an analysis of publicly available stemmers, developed two gold standards for German stemming and evaluated the stemmers based on the two gold standards. They then proposed the stemmer implemented here and show that it achieves slightly better f-measure than the other stemmers and is thrice as fast as the Snowball stemmer for German while being about as fast as most other stemmers.
Source repository https://github.com/LeonieWeissweiler/CISTEM
Lingua::Stem::Cistem exports no subroutines per default to avoid conflicts with other stemmers.
You can either use the methods without importing the subroutines
use Lingua::Stem::Cistem; my $stem = Lingua::Stem::Cistem::stem($word);
or import some or all of the methods:
use Lingua::Stem::Cistem qw(stem segment); my $stem = stem($word); my @segments = segment($word); use Lingua::Stem::Cistem qw(:all); my $stem = stem($word);
Supported:
:all - imports stem segment stem_robust segment_robust :orig - imports stem segment :robust - imports stem_robust segment_robust
stem($word, $case_insensitivity)
This method takes the word to be stemmed and a boolean specifiying if case-insensitive stemming should be used and returns the stemmed word. If only the word is passed to the method or the second parameter is 0, normal case-sensitive stemming is used, if the second parameter is 1, case-insensitive stemming is used.
Case sensitivity improves performance only if words in the text may be incorrectly upper case. For all-lowercase and correctly cased text, best performance is achieved by using the case-sensitive version.
stem_robust($word, $case_insensitivity, $keep_ge_prefix)
This method works like "stem" with the following differences for robustness:
This should not be necessary, if the input is carefully normalized, tokenized, and filtered.
segment($word, $case_insensitivity)
This method works very similarly to stem. The only difference is that in addition to returning the stem, it also returns the rest that was removed at the end. To be able to return the stem unchanged so the stem and the rest can be concatenated to form the original word, all subsitutions that altered the stem in any other way than by removing letters at the end were left out.
my ($stem, $suffix) = segment($word);
segment_robust($word, $case_insensitivity, $keep_ge_prefix)
This method works exactly like "stem_robust" and returns a list of prefix, stem and suffix:
my ($prefix, $stem, $suffix) = segment_robust($word);
Tests were run using the file goldstandard1.txt (317441 words, 3.76 MB), which can be found here:
https://github.com/LeonieWeissweiler/CISTEM/blob/master/gold_standards/goldstandard1.txt
The test iterates over the words in the file. Times measured include the overhead of startup and iteration.
Platform (only one thread used) Intel Core i7-4770HQ Processor 4 Cores, 8 Threads 2.20 - 3.40 GHz 6 MB Cache 16GB DDR3 RAM MacOS Mojave Version 10.14.4 Perl 5.20.1 +-------------------------------------------------------------+ | source: goldstandard1.txt | words: 317441 | +-------------------------------------------------------------+ | method | version | duration | factor | words/sec | |-------------------------------------------------------------| | stem | official | 2.862s | 1.00 | 110916 | | stem | this v0.01 | 2.678s | 0.94 | 118536 | | stem_robust | this v0.01 | 4.111s | 1.44 | 77217 | | | | | | | | segment | official | 2.594s | 1.00 | 122375 | | segment | this v0.01 | 2.642s | 1.02 | 120151 | | segment_robust | this v0.01 | 4.368s | 1.68 | 72674 | +-------------------------------------------------------------+
http://github.com/wollmers/Lingua-Stem-Cistem
Helmut Wollmersdorfer <helmut@wollmersdorfer.at>
Copyright 2019-2020 Helmut Wollmersdorfer Copyright 2017 Leonie Weißweiler (original version)
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
Lingua::Stem::Snowball, Lingua::Stem::UniNE, Lingua::Stem, Lingua::Stem::Patch
To install Lingua::Stem::Cistem, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Lingua::Stem::Cistem
CPAN shell
perl -MCPAN -e shell install Lingua::Stem::Cistem
For more information on module installation, please visit the detailed CPAN module installation guide.