gen_word_break_data.pl - Generate word break table and tests
perl gen_word_break_data.pl [-c] UCD_SRC_DIR
This script generates the tables to lookup Unicode word break properties for the StandardTokenizer. It also converts the word break test suite in the UCD to JSON.
UCD_SRC_DIR should point to a directory containing the files WordBreakProperty.txt, WordBreakTest.txt, and DerivedCoreProperties.txt from the Unicode Character Database available at http://www.unicode.org/Public/6.3.0/ucd/.
modules/unicode/ucd/WordBreak.tab modules/unicode/ucd/WordBreakTest.json
Show total table size for different shift values
To install gen_word_break_data.pl, copy and paste the appropriate command in to your terminal.
cpanm
cpanm gen_word_break_data.pl
CPAN shell
perl -MCPAN -e shell install gen_word_break_data.pl
For more information on module installation, please visit the detailed CPAN module installation guide.