The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

WordList::ID::Common::Wikipedia5000 - Top 5000 words from Wikipedia Indonesia pages

VERSION

This document describes version 0.002 of WordList::ID::Common::Wikipedia5000 (from Perl distribution WordLists-ID-Common), released on 2017-12-31.

SYNOPSIS

 use WordList::ID::Common::Wikipedia5000;

 my $wl = WordList::ID::Common::Wikipedia5000->new;

 # Pick a (or several) random word(s) from the list
 my $word = $wl->pick;
 my @words = $wl->pick(3);

 # Check if a word exists in the list
 if ($wl->word_exists('foo')) { ... }

 # Call a callback for each word
 $wl->each_word(sub { my $word = shift; ... });

 # Get all the words
 my @all_words = $wl->all_words;

DESCRIPTION

This module contains 5000 most frequently used Indonesian words in Wikipedia Indonesian pages.

Here's how the list is produced: First the Wikipedia Indonesia's XML.bz2 [1] was downloaded (last downloaded: Dec 30, 2017). Then a couple of ad-hoc, rather simplistic Perl scripts were used to process this large file: one script to split the file to a per-page basis, and the other to strip Wikimedia markup. All-lowercase words were then extracted from these files and merged to become a single file. Then the list is curated to get the final {1000,2500,5000} top words (false positives, misspellings removed).

Note that Wikipedia article pages do not represent general Indonesian text, some words are overrepresented e.g. "lagu" (in articles about particular songs) or "filum".

Some words are derivative forms (not-root words), e.g. "makanannya" or "berdasarkan".

The order of the words in this wordlist is asciibetical, as required by the WordList convention. If you want to know the ranks of words by frequency, as well as the scripts used to generate the result, see the devscripts/ and work/ directories in the Git repository.

[1] https://id.wikipedia.org/wiki/Wikipedia:Wikipedia_bahasa_Indonesia_versi_luring

STATISTICS

 +----------------------------------+-------+
 | key                              | value |
 +----------------------------------+-------+
 | avg_word_len                     | 7.444 |
 | longest_word_len                 | 18    |
 | num_words                        | 5000  |
 | num_words_contains_nonword_chars | 0     |
 | num_words_contains_unicode       | 0     |
 | num_words_contains_whitespace    | 0     |
 | shortest_word_len                | 2     |
 +----------------------------------+-------+

The statistics is available in the %STATS package variable.

HOMEPAGE

Please visit the project's homepage at https://metacpan.org/release/WordLists-ID-Common.

SOURCE

Source repository is at https://github.com/perlancar/perl-WordLists-ID-Common.

BUGS

Please report any bugs or feature requests on the bugtracker website https://rt.cpan.org/Public/Dist/Display.html?Name=WordLists-ID-Common

When submitting a bug or request, please include a test-file or a patch to an existing test-file that illustrates the bug or desired feature.

AUTHOR

perlancar <perlancar@cpan.org>

COPYRIGHT AND LICENSE

This software is copyright (c) 2017 by perlancar@cpan.org.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.