The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

gen_word_break_data.pl - Generate word break table and tests

SYNOPSIS

    perl gen_word_break_data.pl [-c] UCD_SRC_DIR

DESCRIPTION

This script generates the tables to lookup Unicode word break properties for the StandardTokenizer. It also converts the word break test suite in the UCD to JSON.

UCD_SRC_DIR should point to a directory containing the files WordBreakProperty.txt, WordBreakTest.txt, and DerivedCoreProperties.txt from the Unicode Character Database available at http://www.unicode.org/Public/6.0.0/ucd/.

OUTPUT FILES

    modules/unicode/ucd/WordBreak.tab
    modules/unicode/ucd/WordBreakTest.json

OPTIONS

-c

Show total table size for different shift values