The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Lingua::LO::NLP::Syllabify - Segment Lao or mixed-script text into syllables.

FUNCTION

This implements a purely regular expression based algorithm to segment Lao text into syllables, based on the one described in PHISSAMAY et al: Syllabification of Lao Script for Line Breaking.

METHODS

new

new( $text, %options )

The constructor takes a mandatory argument containing the text to split, and any number of hash-style named options. Currently, the only such option is normalize which takes a boolean argument and indicates whether to run the text though a normalization function that swaps tone marks and vowels appearing in the wrong order.

Note that in any case text is passed through "NFC" in Unicode::Normalize first to obtain the Composed Normal Form. In pure Lao text, this affects only the decomposed form of LAO VOWEL SIGN AM that will be transformed from U+0EB2, U+0ECD to U+0EB3.

get_syllables

get_syllables()

Returns a list of Lao syllables found in the text passed to the constructor. If there are any blanks, non-Lao parts etc. mixed in, they will be silently dropped.

get_fragments

get_fragments()

Returns a complete segmentation of the text passed to the constructor as an array of hashes. Each hash has two keys:

text

The text of the respective fragment

is_lao

If true, the fragment is a single valid Lao syllable. If false, it may be whitespace, non-Lao script, Lao characters that don't constitute valid syllables - basically anything at all that's not a valid syllable.