The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

Unicode Support */

PERL_STATIC_INLINE void Perl_append_utf8_from_native_byte(const U8 byte, U8** dest) { /* Takes an input 'byte' (Latin1 or EBCDIC) and appends it to the UTF-8 * encoded string at '*dest', updating '*dest' to include it */

    PERL_ARGS_ASSERT_APPEND_UTF8_FROM_NATIVE_BYTE;

    if (NATIVE_BYTE_IS_INVARIANT(byte))
        *((*dest)++) = byte;
    else {
        *((*dest)++) = UTF8_EIGHT_BIT_HI(byte);
        *((*dest)++) = UTF8_EIGHT_BIT_LO(byte);
    }
}

/* =for apidoc valid_utf8_to_uvchr Like "utf8_to_uvchr_buf" in perlapi, but should only be called when it is known that the next character in the input UTF-8 string s is well-formed (e.g., it passes "isUTF8_CHAR" in perlapi. Surrogates, non-character code points, and non-Unicode code points are allowed.

Returns TRUE if the first len bytes of the string s are the same regardless of the UTF-8 encoding of the string (or UTF-EBCDIC encoding on EBCDIC machines); otherwise it returns FALSE. That is, it returns TRUE if they are UTF-8 invariant. On ASCII-ish machines, all the ASCII characters and only the ASCII characters fit this definition. On EBCDIC machines, the ASCII-range characters are invariant, but so also are the C1 controls.

If len is 0, it will be calculated using strlen(s), (which means if you use this option, that s can't have embedded NUL characters and has to have a terminating NUL byte).

See also "is_utf8_string", "is_utf8_string_flags", "is_utf8_string_loc", "is_utf8_string_loc_flags", "is_utf8_string_loclen", "is_utf8_string_loclen_flags", "is_utf8_fixed_width_buf_flags", "is_utf8_fixed_width_buf_loc_flags", "is_utf8_fixed_width_buf_loclen_flags", "is_strict_utf8_string", "is_strict_utf8_string_loc", "is_strict_utf8_string_loclen", "is_c9strict_utf8_string", "is_c9strict_utf8_string_loc", and "is_c9strict_utf8_string_loclen".

Like "is_utf8_invariant_string" but upon failure, stores the location of the first UTF-8 variant character in the ep pointer; if all characters are UTF-8 invariant, this function does not change the contents of *ep.

This function looks at the sequence of bytes between s and e, which are assumed to be encoded in ASCII/Latin1, and returns how many of them would change should the string be translated into UTF-8. Due to the nature of UTF-8, each of these would occupy two bytes instead of the single one in the input string. Thus, this function returns the precise number of bytes the string would expand by when translated to UTF-8.

Unlike most of the other functions that have utf8 in their name, the input to this function is NOT a UTF-8-encoded string. The function name is slightly odd to emphasize this.

This function is internal to Perl because khw thinks that any XS code that would want this is probably operating too close to the internals. Presenting a valid use case could change that.

See also "is_utf8_invariant_string" in perlapi and "is_utf8_invariant_string_loc" in perlapi,

Returns TRUE if the first len bytes of string s form a valid Perl-extended-UTF-8 string; returns FALSE otherwise. If len is 0, it will be calculated using strlen(s) (which means if you use this option, that s can't have embedded NUL characters and has to have a terminating NUL byte). Note that all characters being ASCII constitute 'a valid UTF-8 string'.

This function considers Perl's extended UTF-8 to be valid. That means that code points above Unicode, surrogates, and non-character code points are considered valid by this function. Use "is_strict_utf8_string", "is_c9strict_utf8_string", or "is_utf8_string_flags" to restrict what code points are considered valid.

See also "is_utf8_invariant_string", "is_utf8_invariant_string_loc", "is_utf8_string_loc", "is_utf8_string_loclen", "is_utf8_fixed_width_buf_flags", "is_utf8_fixed_width_buf_loc_flags", "is_utf8_fixed_width_buf_loclen_flags",

Returns TRUE if "is_utf8_invariant_string" in perlapi returns FALSE for the first len bytes of the string s, but they are, nonetheless, legal Perl-extended UTF-8; otherwise returns FALSE.

A TRUE return means that at least one code point represented by the sequence either is a wide character not representable as a single byte, or the representation differs depending on whether the sequence is encoded in UTF-8 or not.

See also "is_utf8_invariant_string" in perlapi, "is_utf8_string" in perlapi

Returns TRUE if the first len bytes of string s form a valid UTF-8-encoded string that is fully interchangeable by any application using Unicode rules; otherwise it returns FALSE. If len is 0, it will be calculated using strlen(s) (which means if you use this option, that s can't have embedded NUL characters and has to have a terminating NUL byte). Note that all characters being ASCII constitute 'a valid UTF-8 string'.

This function returns FALSE for strings containing any code points above the Unicode max of 0x10FFFF, surrogate code points, or non-character code points.

See also "is_utf8_invariant_string", "is_utf8_invariant_string_loc", "is_utf8_string", "is_utf8_string_flags", "is_utf8_string_loc", "is_utf8_string_loc_flags", "is_utf8_string_loclen", "is_utf8_string_loclen_flags", "is_utf8_fixed_width_buf_flags", "is_utf8_fixed_width_buf_loc_flags", "is_utf8_fixed_width_buf_loclen_flags", "is_strict_utf8_string_loc", "is_strict_utf8_string_loclen", "is_c9strict_utf8_string", "is_c9strict_utf8_string_loc", and "is_c9strict_utf8_string_loclen".

Returns TRUE if the first len bytes of string s form a valid UTF-8-encoded string that conforms to Unicode Corrigendum #9; otherwise it returns FALSE. If len is 0, it will be calculated using strlen(s) (which means if you use this option, that s can't have embedded NUL characters and has to have a terminating NUL byte). Note that all characters being ASCII constitute 'a valid UTF-8 string'.

This function returns FALSE for strings containing any code points above the Unicode max of 0x10FFFF or surrogate code points, but accepts non-character code points per Corrigendum #9.

See also "is_utf8_invariant_string", "is_utf8_invariant_string_loc", "is_utf8_string", "is_utf8_string_flags", "is_utf8_string_loc", "is_utf8_string_loc_flags", "is_utf8_string_loclen", "is_utf8_string_loclen_flags", "is_utf8_fixed_width_buf_flags", "is_utf8_fixed_width_buf_loc_flags", "is_utf8_fixed_width_buf_loclen_flags", "is_strict_utf8_string", "is_strict_utf8_string_loc", "is_strict_utf8_string_loclen", "is_c9strict_utf8_string_loc", and "is_c9strict_utf8_string_loclen".

Returns TRUE if the first len bytes of string s form a valid UTF-8 string, subject to the restrictions imposed by flags; returns FALSE otherwise. If len is 0, it will be calculated using strlen(s) (which means if you use this option, that s can't have embedded NUL characters and has to have a terminating NUL byte). Note that all characters being ASCII constitute 'a valid UTF-8 string'.

If flags is 0, this gives the same results as "is_utf8_string"; if flags is UTF8_DISALLOW_ILLEGAL_INTERCHANGE, this gives the same results as "is_strict_utf8_string"; and if flags is UTF8_DISALLOW_ILLEGAL_C9_INTERCHANGE, this gives the same results as "is_c9strict_utf8_string". Otherwise flags may be any combination of the UTF8_DISALLOW_foo flags understood by "utf8n_to_uvchr", with the same meanings.

See also "is_utf8_invariant_string", "is_utf8_invariant_string_loc", "is_utf8_string", "is_utf8_string_loc", "is_utf8_string_loc_flags", "is_utf8_string_loclen", "is_utf8_string_loclen_flags", "is_utf8_fixed_width_buf_flags", "is_utf8_fixed_width_buf_loc_flags", "is_utf8_fixed_width_buf_loclen_flags", "is_strict_utf8_string", "is_strict_utf8_string_loc", "is_strict_utf8_string_loclen", "is_c9strict_utf8_string", "is_c9strict_utf8_string_loc", and "is_c9strict_utf8_string_loclen".

Like "is_utf8_string" but stores the location of the failure (in the case of "utf8ness failure") or the location s+len (in the case of "utf8ness success") in the ep pointer.

See also "is_utf8_string_loclen".

Like "is_utf8_string" but stores the location of the failure (in the case of "utf8ness failure") or the location s+len (in the case of "utf8ness success") in the ep pointer, and the number of UTF-8 encoded characters in the el pointer.

See also "is_utf8_string_loc".

Evaluates to non-zero if the first few bytes of the string starting at s and looking no further than e - 1 are well-formed UTF-8, as extended by Perl, that represents some code point; otherwise it evaluates to 0. If non-zero, the value gives how many bytes starting at s comprise the code point's representation. Any bytes remaining before e, but beyond the ones needed to form the first code point in s, are not examined.

The code point can be any that will fit in an IV on this machine, using Perl's extension to official UTF-8 to represent those higher than the Unicode maximum of 0x10FFFF. That means that this macro is used to efficiently decide if the next few bytes in s is legal UTF-8 for a single character.

Use "isSTRICT_UTF8_CHAR" to restrict the acceptable code points to those defined by Unicode to be fully interchangeable across applications; "isC9_STRICT_UTF8_CHAR" to use the Unicode Corrigendum #9 definition of allowable code points; and "isUTF8_CHAR_flags" for a more customized definition.

Use "is_utf8_string", "is_utf8_string_loc", and "is_utf8_string_loclen" to check entire strings.

Note also that a UTF-8 "invariant" character (i.e. ASCII on non-EBCDIC machines) is a valid UTF-8 character.

Evaluates to non-zero if the first few bytes of the string starting at s and looking no further than e - 1 are well-formed UTF-8 that represents some Unicode code point completely acceptable for open interchange between all applications; otherwise it evaluates to 0. If non-zero, the value gives how many bytes starting at s comprise the code point's representation. Any bytes remaining before e, but beyond the ones needed to form the first code point in s, are not examined.

The largest acceptable code point is the Unicode maximum 0x10FFFF, and must not be a surrogate nor a non-character code point. Thus this excludes any code point from Perl's extended UTF-8.

This is used to efficiently decide if the next few bytes in s is legal Unicode-acceptable UTF-8 for a single character.

Use "isC9_STRICT_UTF8_CHAR" to use the Unicode Corrigendum #9 definition of allowable code points; "isUTF8_CHAR" to check for Perl's extended UTF-8; and "isUTF8_CHAR_flags" for a more customized definition.

Use "is_strict_utf8_string", "is_strict_utf8_string_loc", and "is_strict_utf8_string_loclen" to check entire strings.

Evaluates to non-zero if the first few bytes of the string starting at s and looking no further than e - 1 are well-formed UTF-8 that represents some Unicode non-surrogate code point; otherwise it evaluates to 0. If non-zero, the value gives how many bytes starting at s comprise the code point's representation. Any bytes remaining before e, but beyond the ones needed to form the first code point in s, are not examined.

The largest acceptable code point is the Unicode maximum 0x10FFFF. This differs from "isSTRICT_UTF8_CHAR" only in that it accepts non-character code points. This corresponds to Unicode Corrigendum #9. which said that non-character code points are merely discouraged rather than completely forbidden in open interchange. See "Noncharacter code points" in perlunicode.

Use "isUTF8_CHAR" to check for Perl's extended UTF-8; and "isUTF8_CHAR_flags" for a more customized definition.

Use "is_c9strict_utf8_string", "is_c9strict_utf8_string_loc", and "is_c9strict_utf8_string_loclen" to check entire strings.

Like "is_strict_utf8_string" but stores the location of the failure (in the case of "utf8ness failure") or the location s+len (in the case of "utf8ness success") in the ep pointer.

See also "is_strict_utf8_string_loclen".

Like "is_strict_utf8_string" but stores the location of the failure (in the case of "utf8ness failure") or the location s+len (in the case of "utf8ness success") in the ep pointer, and the number of UTF-8 encoded characters in the el pointer.

See also "is_strict_utf8_string_loc".

Like "is_c9strict_utf8_string" but stores the location of the failure (in the case of "utf8ness failure") or the location s+len (in the case of "utf8ness success") in the ep pointer.

See also "is_c9strict_utf8_string_loclen".

Like "is_c9strict_utf8_string" but stores the location of the failure (in the case of "utf8ness failure") or the location s+len (in the case of "utf8ness success") in the ep pointer, and the number of UTF-8 encoded characters in the el pointer.

See also "is_c9strict_utf8_string_loc".

Like "is_utf8_string_flags" but stores the location of the failure (in the case of "utf8ness failure") or the location s+len (in the case of "utf8ness success") in the ep pointer.

See also "is_utf8_string_loclen_flags".

Like "is_utf8_string_flags" but stores the location of the failure (in the case of "utf8ness failure") or the location s+len (in the case of "utf8ness success") in the ep pointer, and the number of UTF-8 encoded characters in the el pointer.

See also "is_utf8_string_loc_flags".

Returns the number of UTF-8 characters between the UTF-8 pointers a and b.

WARNING: use only if you *know* that the pointers point inside the same UTF-8 buffer.

Return the UTF-8 pointer s displaced by off characters, either forward or backward.

WARNING: do not use the following unless you *know* off is within the UTF-8 data pointed to by s *and* that on entry s is aligned on the first byte of character or just after the last byte of a character.

Return the UTF-8 pointer s displaced by up to off characters, forward.

off must be non-negative.

s must be before or equal to end.

When moving forward it will not move beyond end.

Will not exceed this limit even if the string is not valid "UTF-8".

Return the UTF-8 pointer s displaced by up to off characters, backward.

off must be non-positive.

s must be after or equal to start.

When moving backward it will not move before start.

Will not exceed this limit even if the string is not valid "UTF-8".

Return the UTF-8 pointer s displaced by up to off characters, either forward or backward.

When moving backward it will not move before start.

When moving forward it will not move beyond end.

Will not exceed those limits even if the string is not valid "UTF-8".

Returns 0 if the sequence of bytes starting at s and looking no further than e - 1 is the UTF-8 encoding, as extended by Perl, for one or more code points. Otherwise, it returns 1 if there exists at least one non-empty sequence of bytes that when appended to sequence s, starting at position e causes the entire sequence to be the well-formed UTF-8 of some code point; otherwise returns 0.

In other words this returns TRUE if s points to a partial UTF-8-encoded code point.

This is useful when a fixed-length buffer is being tested for being well-formed UTF-8, but the final few bytes in it don't comprise a full character; that is, it is split somewhere in the middle of the final code point's UTF-8 representation. (Presumably when the buffer is refreshed with the next chunk of data, the new first bytes will complete the partial code point.) This function is used to verify that the final bytes in the current buffer are in fact the legal beginning of some code point, so that if they aren't, the failure can be signalled without having to wait for the next read.

Like "is_utf8_valid_partial_char", it returns a boolean giving whether or not the input is a valid UTF-8 encoded partial character, but it takes an extra parameter, flags, which can further restrict which code points are considered valid.

If flags is 0, this behaves identically to "is_utf8_valid_partial_char". Otherwise flags can be any combination of the UTF8_DISALLOW_foo flags accepted by "utf8n_to_uvchr". If there is any sequence of bytes that can complete the input partial character in such a way that a non-prohibited character is formed, the function returns TRUE; otherwise FALSE. Non character code points cannot be determined based on partial character input. But many of the other possible excluded types can be determined from just the first one or two bytes.

Returns TRUE if the fixed-width buffer starting at s with length len is entirely valid UTF-8, subject to the restrictions given by flags; otherwise it returns FALSE.

If flags is 0, any well-formed UTF-8, as extended by Perl, is accepted without restriction. If the final few bytes of the buffer do not form a complete code point, this will return TRUE anyway, provided that "is_utf8_valid_partial_char_flags" returns TRUE for them.

If flags in non-zero, it can be any combination of the UTF8_DISALLOW_foo flags accepted by "utf8n_to_uvchr", and with the same meanings.

This function differs from "is_utf8_string_flags" only in that the latter returns FALSE if the final few bytes of the string don't form a complete code point.

Like "is_utf8_fixed_width_buf_flags" but stores the location of the failure in the ep pointer. If the function returns TRUE, *ep will point to the beginning of any partial character at the end of the buffer; if there is no partial character *ep will contain s+len.

See also "is_utf8_fixed_width_buf_loclen_flags".

Like "is_utf8_fixed_width_buf_loc_flags" but stores the number of complete, valid characters found in the el pointer.

Miscellaneous Functions

Test that the given pv (with length len) doesn't contain any internal NUL characters. If it does, set errno to ENOENT, optionally warn using the syscalls category, and return FALSE.

Return TRUE if the name is safe.

what and op_name are used in any warning.

Used by the IS_SAFE_SYSCALL() macro.

Miscellaneous Functions

Returns true if the leading len bytes of the strings s1 and s2 are the same case-insensitively; false otherwise. Uppercase and lowercase ASCII range bytes match themselves and their opposite case counterparts. Non-cased and non-ASCII range bytes match only themselves.

Returns true if the leading len bytes of the strings s1 and s2 are the same case-insensitively in the current locale; false otherwise.

The C library strnlen if available, or a Perl implementation of it.

my_strnlen() computes the length of the string, up to maxlen characters. It will never attempt to address more than maxlen characters, making it suitable for use with strings that are not guaranteed to be NUL-terminated.