Lingua::LO::NLP::Data - Helper module to keep common read-only data


Provides a few functions that return regular expressions for matching and extracting parts from Lao syllables. Instead of hardcoding these expressions as strings, they are constructed from fragments at runtime, trading maintainability for a small one-time initialization cost.

Also holds common read-only data such as vowel classifications.

You will probably not want to use this module on its own. If you do, see the other Lingua::LO::NLP modules for examples.



Returns a basic regexp that can match a Lao syllable. It consists of a bunch of alternations and will thus return the first possible match which is neither guaranteed to be the longest nor the appropriate one in a longer sequence of characters. It is useful as a building block and for verifying syllables though.


In addition to the matching done by get_sylre_basic, this one makes sure matches are either followed by another complete syllable (or what can only be the start of one), a space, the end of string/line or some non-Lao character. This ensures correct matching of ambiguous syllable boundaries where the core consonant of a following syllable could also be an end consonant of the current one.


The expression returned is the same as for get_sylre_full but also includes named captures that upon a successful match allow to get the syllable's parts from %+.


is_long_vowel( $lao_vowel )

Returns a boolean indicating whether the vowel passed in is long. Consonant placeholders must be included in the form of DOTTED CIRCLE (U+25CC). Note that for speed there is no check if the vowel actually exists in the data, so passing many bogus values may lead to uncontrolled growth of the %VOWEL_LENGTH hash due to autovivification!


normalize_tone_marks( $text )

Normalize tone mark order in $text. Usually when using a combining vowel such as ◌ິ, ◌ຸ or ◌ໍ with a tone mark, they have to be typed in the order consonant-vowel-tonemark as renderers are supposed to stack above-consonant signs in the order they appear in the text, and tone marks are supposed to go on top. As some renderers will put them on top no matter what, these sequences are sometimes incorrectly entered as consonant-tonemark-vowel and would thus not be parsed correctly.

This function is just meant for internal use and modifies its argument in place for speed!