WordList - Specification and base class for WordList::*, modules that contain word list
This document describes version 0.7.11 of WordList (from Perl distribution WordList), released on 2021-09-26.
Use one of the WordList::* modules.
WordList::*
WordList::* modules are modules that contain, well, list of words. This module, WordList, serves as a base class and establishes convention for such modules.
WordList
WordList is an alternative for Games::Word::Wordlist and Games::Word::Wordlist::*. Its main difference is: WordList::* wordlists are read-only/immutable and the modules are designed to have low startup overhead. This makes them more suitable for use in CLI scripts which often only want to pick a word from one or several lists. See "DIFFERENCES WITH GAMES::WORD::WORDLIST" for more details.
Games::Word::Wordlist::*
Unless you are defining a dynamic wordlist (see below), words (or phrases) must be put in __DATA__ section, one per line. Putting the wordlist in the __DATA__ section relieves perl from having to parse the list during the loading of the module. To search for words or picking some random words from the list, the module also need not slurp the whole list into memory (and will not do so unless explicitly instructed).
__DATA__
You must sort your words ascibetically (or by Unicode code point). Sorting makes it more convenient to diff different versions of the module, as well as performing binary search. If you have a different sort order other than ascibetical, you must set package variable $SORT with some true value (say, frequency).
$SORT
frequency
There must not be any duplicate entry in the word list.
Dynamic and non-deterministic wordlist. A dynamic wordlist must set package variable $DYNAMIC to either 1 (deterministic) or 2 (non-deterministic). A dynamic wordlist does not put the wordlist in the DATA section; instead, user relies on first_word() + next_word(), or each_word(), or all_words() to get the list. A deterministic wordlist returns the same list everytime each_word() or all_words() is called. A non-deterministic list can return a different list for a different each_word() or all_words() call. See WordListRole::FirstNextResetFromEach, WordListRole::EachFromFirstNextReset, WordListRole::FromArray if you want to write a dynamic wordlist module. It is possible for a dynamic list to return unordered or duplicate entries, but it is not encouraged.
$DYNAMIC
first_word()
next_word()
each_word()
all_words()
Parameterized wordlist. When instantiating a wordlist class instance, user can pass a list of key-value pairs as parameters. Normally only a dynamic wordlist would accept parameters. Parameters are defined in the %PARAMS package variable. It is a hash of parameter names as keys and parameter specification as values. Parameter specification follows function argument metadata specified in Rinci::function.
%PARAMS
Examples. Examples can be specified in @EXAMPLES package variable. The structure is similar to Rinci function's examples property. For example:
@EXAMPLES
examples
# in lib/WordList/Test/Dynamic/RandomWord/1000.pm @EXAMPLES = ( { summary => '1000 random words, each 10 to 15 characters long', args => {min_len=>10, max_len=>15}, } );
Since this is a non-compatible interface from Games::Word::Wordlist, I also make some other changes:
Namespace is put outside Games::
Games::
Because obviously word lists are not only useful for games.
Namespace is more language-neutral and not English-centric
English wordlists are put under WordList::EN::*. Other languages have their own subnamespaces, e.g. WordList::FR::* or WordList::ID::*. Aside from language subnamespaces, there are also other subnamespaces: WordList::Phrase::$LANG::*, WordList::Password::*, WordList::Domain::*, WordList::HTTP::*, etc.
WordList::EN::*
WordList::FR::*
WordList::ID::*
WordList::Phrase::$LANG::*
WordList::Password::*
WordList::Domain::*
WordList::HTTP::*
Interface is simpler
This is partly due to the list being read-only. The methods provided are just:
- pick (pick one or several random entries, without duplicates or with)
pick
- word_exists (check whether a word is in the list)
word_exists
- each_word (run code for each entry)
each_word
- all_words (return all the words in a list)
all_words
A couple of other functions might be added, with careful consideration.
More extensions
Some roles, subclasses, or alternate implementations are provided. For example, since most wordlist are alphabetically sorted, a binary search can be performed in word_exists(). There is a role, WordListRole::BinarySearch, that does that and can be mixed in. An even faster version of word_exists() using bloom filter is offered by WordListRole::Bloom. A faster version of pick() that does random seeking is offered by WordListRole::RandomSeekPick.
word_exists()
If you want to get the word list from another filehandle source, e.g. a gzipped file, you just need to override reset_iterator(). Your reset_iterator() needs to set the 'fh' attribute to the filehandle. The default first_word() calls reset_iterator() and reads a line from the filehandle. The default next_word() just reads another line from the filehandle. each_word() is implemented in terms of first_word() and next_word(), and word_exists(), pick(), and all_words() are implemented in terms of each_word().
reset_iterator()
pick()
Usage:
$wl = WordList::Module->new([ %params ]);
Constructor.
$wl->each_word($code)
Call $code for each word in the list. The code will receive the word as its first argument.
$code
If code return -2 will exit early.
Another way to iterate the word list is by calling "first_word" to get the first word, then "next_word" repeatedly until you get undef.
undef
Get the next word. See "first_word" for more details.
Reset iterator. Basically "first_word" is equivalent to reset_iterator + "next_word".
reset_iterator
@words = $wl->pick([ $num=1 [ , $allow_duplicates=0 ] ])
Examples:
($word) = $wl->pick; @words = $wl->pick(3);
Pick $n (default: 1) random word(s) from the list, without duplicates (unless $allow_duplicates is set to true). If there are less then $n words in the list and duplicates are not allowed, only that many will be returned.
$n
$allow_duplicates
The algorithm used is from perlfaq ("perldoc -q "random line""), which scans the whole list once (a.k.a. each_word() once). The algorithm is for returning a single entry and is modified to support returning multiple entries.
$wl->word_exists($word) => bool
Check whether $word is in the list.
$word
Algorithm in this implementation is linear scan (O(n)). Check out WordListRole::BinarySearch for an O(log n) implementation, or WordListRole::Bloom for O(1) implementation.
$wl->all_words() => list
Return all the words in a list, in order. Note that if wordlist is very large you might want to use "each_word" instead to avoid slurping all words into memory.
You probably write this:
$word = $wl->pick;
instead of this:
($word) = $wl->pick;
pick() returns a list and in scalar context it returns the number of elements in the list which is 1. This is a common context trap in Perl.
Please visit the project's homepage at https://metacpan.org/release/WordList.
Source repository is at https://github.com/perlancar/perl-WordList.
Related projects: ArrayData, HashData, TableData are newer projects inspired by WordList. I plan to publish newer wordlists as ArrayData::* modules. But WordList will still exist and stabilize its API.
ArrayData::*
WordListRole::* modules.
WordListRole::*
WordList::* modules.
CLI's are provided in App::wordlist (wordlist), App::WordListUtils (e.g. list-wordlist-modules, etc).
perlancar <perlancar@cpan.org>
To contribute, you can send patches by email/via RT, or send pull requests on GitHub.
Most of the time, you don't need to build the distribution yourself. You can simply modify the code, then test via:
% prove -l
If you want to build the distribution (e.g. to try to install it locally on your system), you can install Dist::Zilla, Dist::Zilla::PluginBundle::Author::PERLANCAR, and sometimes one or two other Dist::Zilla plugin and/or Pod::Weaver::Plugin. Any additional steps required beyond that are considered a bug and can be reported to me.
This software is copyright (c) 2021, 2020, 2018, 2017, 2016 by perlancar <perlancar@cpan.org>.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.
Please report any bugs or feature requests on the bugtracker website https://rt.cpan.org/Public/Dist/Display.html?Name=WordList
When submitting a bug or request, please include a test-file or a patch to an existing test-file that illustrates the bug or desired feature.
To install WordList, copy and paste the appropriate command in to your terminal.
cpanm
cpanm WordList
CPAN shell
perl -MCPAN -e shell install WordList
For more information on module installation, please visit the detailed CPAN module installation guide.