NAME
Lingua::Word::Parser - Parse a word into scored known and unknown parts
VERSION
version 0.0809
SYNOPSIS
use
Lingua::Word::Parser;
# With a database source:
my
$p
= Lingua::Word::Parser->new(
word
=>
'abioticaly'
,
dbname
=>
'fragments'
,
dbuser
=>
'akbar'
,
dbpass
=>
's3kr1+'
,
);
# With a file source:
$p
= Lingua::Word::Parser->new(
word
=>
'abioticaly'
,
file
=>
'eg/lexicon.dat'
,
);
my
$known
=
$p
->knowns;
my
$combos
=
$p
->power;
my
$score
=
$p
->score;
# Stringified output
$score
=
$p
->score_parts;
# "Raw" output
# The best guess is the last sorted scored set:
Dumper
$score
->{ [
sort
keys
%$score
]->[-1] };
DESCRIPTION
A Lingua::Word::Parser
breaks a word into known affixes.
A word-part lexicon file must have "regular-expression definition" lines of the form:
a(?=\w) opposite
ab(?=\w) away
(?<=\w)o(?=\w) combining
(?<=\w)tic possessing
Please see the included eg/lexicon.dat example file.
A database lexicon must have records as above, but with the column names, id, affix and definition. Please see the included eg/word_part.sql example file.
METHODS
new
$p
= Lingua::Word::Parser->new(
%arguments
);
Create a new Lingua::Word::Parser
object.
Arguments and defaults:
word:
undef
dbuser:
undef
dbpass:
undef
dbname:
undef
dbtype: mysql
dbhost: localhost
knowns
$known
=
$p
->knowns;
Find the known word parts and their bitstring masks.
power
$combos
=
$p
->power;
Find the set of non-overlapping known word parts by considering the power set of all masks.
score
$score
=
$p
->score;
$score
=
$p
->score(
$open_separator
,
$close_separator
);
Score the known vs unknown word part combinations into ratios of characters and chunks, word familiarity, partitions and definitions.
This method sets the score member to a list of hashrefs with keys:
partition
definition
score
familiarity
If not given, the $open_separator and $close_separator are '<' and '>' by default.
score_parts
$score_parts
=
$p
->score_parts;
$score_parts
=
$p
->score_parts(
$open_separator
,
$close_separator
);
$score_parts
=
$p
->score_parts(
$open_separator
,
$close_separator
,
$line_terminator
);
Score the known vs unknown word part combinations into ratios of characters and chunks, word familiarity, partitions and definitions.
If not given, the $open_separator and $close_separator are '<' and '>' by default.
The $line_terminator can be any string, like a newline (\n
or an HTML line-break), but is the empty string (''
) by default.
SEE ALSO
Lingua::TokenParse - The predecessor of this module.
http://en.wikipedia.org/wiki/Affix is the tip of the iceberg...
https://github.com/ology/Word-Part a friendly Dancer user interface.
The t/* and eg/* files in this distribution!
AUTHOR
Gene Boggs <gene@cpan.org>
COPYRIGHT AND LICENSE
This software is copyright (c) 2014-2024 by Gene Boggs.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.