See also "Character classification", and "Character case changing". Various functions outside this section also work specially with Unicode. Search for the string "utf8" in this document.
This is a misleadingly-named synonym for "is_utf8_invariant_string". On ASCII-ish platforms, the name isn't misleading: the ASCII-range characters are exactly the UTF-8 invariants. But EBCDIC machines have more invariants than just the ASCII characters, so is_utf8_invariant_string is preferred.
is_utf8_invariant_string
This is a somewhat misleadingly-named synonym for "is_utf8_invariant_string". is_utf8_invariant_string is preferred, as it indicates under what conditions the string is invariant.
Evaluates to 1 if the representation of code point cp is the same whether or not it is encoded in UTF-8; otherwise evaluates to 0. UTF-8 invariant characters can be copied as-is when converting to/from UTF-8, saving time. cp is Unicode if above 255; otherwise is platform-native.
cp
Evaluates to 1 if the byte c represents the same character when encoded in UTF-8 as when not; otherwise evaluates to 0. UTF-8 invariant characters can be copied as-is when converting to/from UTF-8, saving time.
c
In spite of the name, this macro gives the correct result if the input string from which c comes is not encoded in UTF-8.
See "UVCHR_IS_INVARIANT" for checking if a UV is invariant.
"UVCHR_IS_INVARIANT"
You should use this after a call to SvPV() or one of its variants, in case any call to string overloading updates the internal UTF-8 encoding flag.
SvPV()
Evaluates to non-zero if the first few bytes of the string starting at s and looking no further than e - 1 are well-formed UTF-8 that represents one of the Unicode surrogate code points; otherwise it evaluates to 0. If non-zero, the value gives how many bytes starting at s comprise the code point's representation.
s
e - 1
Recall that Perl recognizes an extension to UTF-8 that can encode code points larger than the ones defined by Unicode, which are 0..0x10FFFF.
This macro evaluates to non-zero if the first few bytes of the string starting at s and looking no further than e - 1 are from this UTF-8 extension; otherwise it evaluates to 0. If non-zero, the value gives how many bytes starting at s comprise the code point's representation.
0 is returned if the bytes are not well-formed extended UTF-8, or if they represent a code point that cannot fit in a UV on the current platform. Hence this macro can give different results when run on a 64-bit word machine than on one with a 32-bit word size.
Note that it is illegal to have code points that are larger than what can fit in an IV on the current machine.
Evaluates to non-zero if the first few bytes of the string starting at s and looking no further than e - 1 are well-formed UTF-8 that represents one of the Unicode non-character code points; otherwise it evaluates to 0. If non-zero, the value gives how many bytes starting at s comprise the code point's representation.
Evaluates to non-zero if the first few bytes of the string starting at s and looking no further than e - 1 are well-formed UTF-8, as extended by Perl, that represents some code point, subject to the restrictions given by flags; otherwise it evaluates to 0. If non-zero, the value gives how many bytes starting at s comprise the code point's representation. Any bytes remaining before e, but beyond the ones needed to form the first code point in s, are not examined.
flags
e
If flags is 0, this gives the same results as "isUTF8_CHAR"; if flags is UTF8_DISALLOW_ILLEGAL_INTERCHANGE, this gives the same results as "isSTRICT_UTF8_CHAR"; and if flags is UTF8_DISALLOW_ILLEGAL_C9_INTERCHANGE, this gives the same results as "isC9_STRICT_UTF8_CHAR". Otherwise flags may be any combination of the UTF8_DISALLOW_foo flags understood by "utf8n_to_uvchr", with the same meanings.
"isUTF8_CHAR"
UTF8_DISALLOW_ILLEGAL_INTERCHANGE
"isSTRICT_UTF8_CHAR"
UTF8_DISALLOW_ILLEGAL_C9_INTERCHANGE
"isC9_STRICT_UTF8_CHAR"
UTF8_DISALLOW_foo
"utf8n_to_uvchr"
The three alternative macros are for the most commonly needed validations; they are likely to run somewhat faster than this more general one, as they can be inlined into your code.
Use "is_utf8_string_flags", "is_utf8_string_loc_flags", and "is_utf8_string_loclen_flags" to check entire strings.
To install less, copy and paste the appropriate command in to your terminal.
cpanm
cpanm less
CPAN shell
perl -MCPAN -e shell install less
For more information on module installation, please visit the detailed CPAN module installation guide.