The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

This typedef is used by several core functions that return PV strings, to indicate the UTF-8ness of those strings.

(If you write a new function, you probably should instead return the PV in an SV with the UTF-8 flag of the SV properly set, rather than use this mechanism.)

The possible values this can be are:

UTF8NESS_YES

This means the string definitely should be treated as a sequence of UTF-8-encoded characters.

Most code that needs to handle this typedef should be of the form:

 if (utf8ness_flag == UTF8NESS_YES) {
     treat as utf8;  // like turning on an SV UTF-8 flag
 }
UTF8NESS_NO

This means the string definitely should be treated as a sequence of bytes, not encoded as UTF-8.

UTF8NESS_IMMATERIAL

This means it is equally valid to treat the string as bytes, or as UTF-8 characters; use whichever way you want. This happens when the string consists entirely of characters which have the same representation whether encoded in UTF-8 or not.

UTF8NESS_UNKNOWN

This means it is unknown how the string should be treated. No core function will ever return this value to a non-core caller. Instead, it is used by the caller to initialize a variable to a non-legal value. A typical call will look like:

 utf8ness_t string_is_utf8 = UTF8NESS_UNKNOWN
 const char * string = foo(arg1, arg2, ..., &string_is_utf8);
 if (string_is_utf8 == UTF8NESS_YES) {
    do something for UTF-8;
 }

The following relationships hold between the enum values:

0 <= enum value <= UTF8NESS_IMMATERIAL

the string may be treated in code as non-UTF8

UTF8NESS_IMMATERIAL <= <enum value

the string may be treated in code as encoded in UTF-8

This is a misleadingly-named synonym for "is_utf8_invariant_string". On ASCII-ish platforms, the name isn't misleading: the ASCII-range characters are exactly the UTF-8 invariants. But EBCDIC machines have more invariants than just the ASCII characters, so is_utf8_invariant_string is preferred.

This is a somewhat misleadingly-named synonym for "is_utf8_invariant_string". is_utf8_invariant_string is preferred, as it indicates under what conditions the string is invariant.

Returns the Latin-1 (including ASCII and control characters) equivalent of the input native code point given by ch. Thus, NATIVE_TO_LATIN1(193) on EBCDIC platforms returns 65. These each represent the character "A" on their respective platforms. On ASCII platforms no conversion is needed, so this macro expands to just its input, adding no time nor space requirements to the implementation.

For conversion of code points potentially larger than will fit in a character, use "NATIVE_TO_UNI".

Returns the native equivalent of the input Latin-1 code point (including ASCII and control characters) given by ch. Thus, LATIN1_TO_NATIVE(66) on EBCDIC platforms returns 194. These each represent the character "B" on their respective platforms. On ASCII platforms no conversion is needed, so this macro expands to just its input, adding no time nor space requirements to the implementation.

For conversion of code points potentially larger than will fit in a character, use "UNI_TO_NATIVE".

Returns the Unicode equivalent of the input native code point given by ch. Thus, NATIVE_TO_UNI(195) on EBCDIC platforms returns 67. These each represent the character "C" on their respective platforms. On ASCII platforms no conversion is needed, so this macro expands to just its input, adding no time nor space requirements to the implementation.

Returns the native equivalent of the input Unicode code point given by ch. Thus, UNI_TO_NATIVE(68) on EBCDIC platforms returns 196. These each represent the character "D" on their respective platforms. On ASCII platforms no conversion is needed, so this macro expands to just its input, adding no time nor space requirements to the implementation.

Evaluates to 1 if the representation of code point cp is the same whether or not it is encoded in UTF-8; otherwise evaluates to 0. UTF-8 invariant characters can be copied as-is when converting to/from UTF-8, saving time. cp is Unicode if above 255; otherwise is platform-native.

The maximum width of a single UTF-8 encoded character, in bytes.

NOTE: Strictly speaking Perl's UTF-8 should not be called UTF-8 since UTF-8 is an encoding of Unicode, and Unicode's upper limit, 0x10FFFF, can be expressed with 4 bytes. However, Perl thinks of UTF-8 as a way to encode non-negative integers in a binary format, even those above Unicode.

The maximum number of UTF-8 bytes a single Unicode character can uppercase/lowercase/titlecase/fold into.

If there is a possibility of malformed input, use instead:

"UTF8_SAFE_SKIP" if you know the maximum ending pointer in the buffer pointed to by s; or
"UTF8_CHK_SKIP" if you don't know it.

It is better to restructure your code so the end pointer is passed down so that you know what it actually is at the point of this call, but if that isn't possible, "UTF8_CHK_SKIP" can minimize the chance of accessing beyond the end of the input buffer.

This is a safer version of "UTF8SKIP", but still not as safe as "UTF8_SAFE_SKIP". This version doesn't blindly assume that the input string pointed to by s is well-formed, but verifies that there isn't a NUL terminating character before the expected end of the next character in s. The length UTF8_CHK_SKIP returns stops just before any such NUL.

Perl tends to add NULs, as an insurance policy, after the end of strings in SV's, so it is likely that using this macro will prevent inadvertent reading beyond the end of the input buffer, even if it is malformed UTF-8.

This macro is intended to be used by XS modules where the inputs could be malformed, and it isn't feasible to restructure to use the safer "UTF8_SAFE_SKIP", for example when interfacing with a C library.

Evaluates to 1 if the byte c represents the same character when encoded in UTF-8 as when not; otherwise evaluates to 0. UTF-8 invariant characters can be copied as-is when converting to/from UTF-8, saving time.

In spite of the name, this macro gives the correct result if the input string from which c comes is not encoded in UTF-8.

See "UVCHR_IS_INVARIANT" for checking if a UV is invariant.

You should use this after a call to SvPV() or one of its variants, in case any call to string overloading updates the internal UTF-8 encoding flag.

Returns a boolean as to whether or not uv is one of the Unicode surrogate code points

Evaluates to non-zero if the first few bytes of the string starting at s and looking no further than e - 1 are well-formed UTF-8 that represents one of the Unicode surrogate code points; otherwise it evaluates to 0. If non-zero, the value gives how many bytes starting at s comprise the code point's representation.

Evaluates to 0xFFFD, the code point of the Unicode REPLACEMENT CHARACTER

Returns a boolean as to whether or not uv is the Unicode REPLACEMENT CHARACTER

Evaluates to non-zero if the first few bytes of the string starting at s and looking no further than e - 1 are well-formed UTF-8 that represents the Unicode REPLACEMENT CHARACTER; otherwise it evaluates to 0. If non-zero, the value gives how many bytes starting at s comprise the code point's representation.

Returns a boolean as to whether or not uv is above the maximum legal Unicode code point of U+10FFFF.

Recall that Perl recognizes an extension to UTF-8 that can encode code points larger than the ones defined by Unicode, which are 0..0x10FFFF.

This macro evaluates to non-zero if the first few bytes of the string starting at s and looking no further than e - 1 are from this UTF-8 extension; otherwise it evaluates to 0. If non-zero, the return is how many bytes starting at s comprise the code point's representation.

0 is returned if the bytes are not well-formed extended UTF-8, or if they represent a code point that cannot fit in a UV on the current platform. Hence this macro can give different results when run on a 64-bit word machine than on one with a 32-bit word size.

Note that it is illegal in Perl to have code points that are larger than what can fit in an IV on the current machine; and illegal in Unicode to have any that this macro matches

Returns a boolean as to whether or not uv is one of the Unicode non-character code points

Evaluates to non-zero if the first few bytes of the string starting at s and looking no further than e - 1 are well-formed UTF-8 that represents one of the Unicode non-character code points; otherwise it evaluates to 0. If non-zero, the value gives how many bytes starting at s comprise the code point's representation.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 674:

=cut found outside a pod block. Skipping to next block.