The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

Unicode Support "Unicode Support" in perlguts has an introduction to this API.

See also "Character classification", and "Character case changing". Various functions outside this section also work specially with Unicode. Search for the string "utf8" in this document.

This is a misleadingly-named synonym for "is_utf8_invariant_string". On ASCII-ish platforms, the name isn't misleading: the ASCII-range characters are exactly the UTF-8 invariants. But EBCDIC machines have more invariants than just the ASCII characters, so is_utf8_invariant_string is preferred.

This is a somewhat misleadingly-named synonym for "is_utf8_invariant_string". is_utf8_invariant_string is preferred, as it indicates under what conditions the string is invariant.

Evaluates to 1 if the representation of code point cp is the same whether or not it is encoded in UTF-8; otherwise evaluates to 0. UTF-8 invariant characters can be copied as-is when converting to/from UTF-8, saving time. cp is Unicode if above 255; otherwise is platform-native.

Evaluates to 1 if the byte c represents the same character when encoded in UTF-8 as when not; otherwise evaluates to 0. UTF-8 invariant characters can be copied as-is when converting to/from UTF-8, saving time.

In spite of the name, this macro gives the correct result if the input string from which c comes is not encoded in UTF-8.

See "UVCHR_IS_INVARIANT" for checking if a UV is invariant.

You should use this after a call to SvPV() or one of its variants, in case any call to string overloading updates the internal UTF-8 encoding flag.

Evaluates to non-zero if the first few bytes of the string starting at s and looking no further than e - 1 are well-formed UTF-8 that represents one of the Unicode surrogate code points; otherwise it evaluates to 0. If non-zero, the value gives how many bytes starting at s comprise the code point's representation.

Recall that Perl recognizes an extension to UTF-8 that can encode code points larger than the ones defined by Unicode, which are 0..0x10FFFF.

This macro evaluates to non-zero if the first few bytes of the string starting at s and looking no further than e - 1 are from this UTF-8 extension; otherwise it evaluates to 0. If non-zero, the value gives how many bytes starting at s comprise the code point's representation.

0 is returned if the bytes are not well-formed extended UTF-8, or if they represent a code point that cannot fit in a UV on the current platform. Hence this macro can give different results when run on a 64-bit word machine than on one with a 32-bit word size.

Note that it is illegal to have code points that are larger than what can fit in an IV on the current machine.

Evaluates to non-zero if the first few bytes of the string starting at s and looking no further than e - 1 are well-formed UTF-8 that represents one of the Unicode non-character code points; otherwise it evaluates to 0. If non-zero, the value gives how many bytes starting at s comprise the code point's representation.

Evaluates to non-zero if the first few bytes of the string starting at s and looking no further than e - 1 are well-formed UTF-8, as extended by Perl, that represents some code point, subject to the restrictions given by flags; otherwise it evaluates to 0. If non-zero, the value gives how many bytes starting at s comprise the code point's representation. Any bytes remaining before e, but beyond the ones needed to form the first code point in s, are not examined.

If flags is 0, this gives the same results as "isUTF8_CHAR"; if flags is UTF8_DISALLOW_ILLEGAL_INTERCHANGE, this gives the same results as "isSTRICT_UTF8_CHAR"; and if flags is UTF8_DISALLOW_ILLEGAL_C9_INTERCHANGE, this gives the same results as "isC9_STRICT_UTF8_CHAR". Otherwise flags may be any combination of the UTF8_DISALLOW_foo flags understood by "utf8n_to_uvchr", with the same meanings.

The three alternative macros are for the most commonly needed validations; they are likely to run somewhat faster than this more general one, as they can be inlined into your code.

Use "is_utf8_string_flags", "is_utf8_string_loc_flags", and "is_utf8_string_loclen_flags" to check entire strings.