The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

CharClass::Matcher -- Generate C macros that match character classes efficiently

SYNOPSIS

    perl Porting/regcharclass.pl

DESCRIPTION

Dynamically generates macros for detecting special charclasses in latin-1, utf8, and codepoint forms. Macros can be set to return the length (in bytes) of the matched codepoint, or the codepoint itself.

To regenerate regcharclass.h, run this script from perl-root. No arguments are necessary.

Using WHATEVER as an example the following macros will be produced:

is_WHATEVER(s,is_utf8)
is_WHATEVER_safe(s,e,is_utf8)

Do a lookup as appropriate based on the is_utf8 flag. When possible comparisons involving octect<128 are done before checking the is_utf8 flag, hopefully saving time.

is_WHATEVER_utf8(s)
is_WHATEVER_utf8_safe(s,e)

Do a lookup assuming the string is encoded in (normalized) UTF8.

is_WHATEVER_latin1(s)
is_WHATEVER_latin1_safe(s,e)

Do a lookup assuming the string is encoded in latin-1 (aka plan octets).

is_WHATEVER_cp(cp)

Check to see if the string matches a given codepoint (hypotethically a U32). The condition is constructed as as to "break out" as early as possible if the codepoint is out of range of the condition.

IOW:

  (cp==X || (cp>X && (cp==Y || (cp>Y && ...))))

Thus if the character is X+1 only two comparisons will be done. Making matching lookups slower, but non-matching faster.

Additionally it is possible to generate what_ variants that return the codepoint read instead of the number of octets read, this can be done by suffixing '-cp' to the type description.

CODE FORMAT

perltidy -st -bt=1 -bbt=0 -pt=0 -sbt=1 -ce -nwls== "%f"

AUTHOR

Author: Yves Orton (demerphq) 2007

BUGS

No tests directly here (although the regex engine will fail tests if this code is broken). Insufficient documentation and no Getopts handler for using the module as a script.

LICENSE

You may distribute under the terms of either the GNU General Public License or the Artistic License, as specified in the README file.