Unicode Support These are various utility functions for manipulating UTF8-encoded strings. For the uninitiated, this is a method of representing arbitrary Unicode characters as a variable number of bytes, in such a way that characters in the ASCII range are unmodified, and a zero byte never appears within non-zero characters.

THIS FUNCTION SHOULD BE USED IN ONLY VERY SPECIALIZED CIRCUMSTANCES. Instead, Almost all code should use "uvchr_to_utf8" or "uvchr_to_utf8_flags".

This function is like them, but the input is a strict Unicode (as opposed to native) code point. Only in very rare circumstances should code not be using the native code point.

For details, see the description for "uvchr_to_utf8_flags".

THIS FUNCTION SHOULD BE USED IN ONLY VERY SPECIALIZED CIRCUMSTANCES.

Most code should use "uvchr_to_utf8_flags"() rather than call this directly.

This function is for code that wants any warning and/or error messages to be returned to the caller rather than be displayed. All messages that would have been displayed if all lexical warnings are enabled will be returned.

It is just like "uvchr_to_utf8_flags" but it takes an extra parameter placed after all the others, msgs. If this parameter is 0, this function behaves identically to "uvchr_to_utf8_flags". Otherwise, msgs should be a pointer to an HV * variable, in which this function creates a new HV to contain any appropriate messages. The hash has three key-value pairs, as follows:

text: The text of the message as a SVpv.
warn_categories: The warning category (or categories) packed into a SVuv.
flag: A single flag bit associated with this message, in a SVuv. The bit corresponds to some bit in the *errors return value, such as UNICODE_GOT_SURROGATE.

It's important to note that specifying this parameter as non-null will cause any warnings this function would otherwise generate to be suppressed, and instead be placed in *msgs. The caller can check the lexical warnings state (or not) when choosing what to do with the returned messages.

The caller, of course, is responsible for freeing any returned HV.

Adds the UTF-8 representation of the native code point uv to the end of the string d; d should have at least UVCHR_SKIP(uv)+1 (up to UTF8_MAXBYTES+1) free bytes available. The return value is the pointer to the byte after the end of the new character. In other words,

    d = uvchr_to_utf8(d, uv);

is the recommended wide native character-aware way of saying

    *(d++) = uv;

This function accepts any code point from 0..IV_MAX as input. IV_MAX is typically 0x7FFF_FFFF in a 32-bit word.

It is possible to forbid or warn on non-Unicode code points, or those that may be problematic by using "uvchr_to_utf8_flags".

    d = uvchr_to_utf8_flags(d, uv, flags);

or, in most cases,

    d = uvchr_to_utf8_flags(d, uv, 0);

This is the Unicode-aware way of saying

    *(d++) = uv;

If flags is 0, this function accepts any code point from 0..IV_MAX as input. IV_MAX is typically 0x7FFF_FFFF in a 32-bit word.

Specifying flags can further restrict what is allowed and not warned on, as follows:

If uv is a Unicode surrogate code point and UNICODE_WARN_SURROGATE is set, the function will raise a warning, provided UTF8 warnings are enabled. If instead UNICODE_DISALLOW_SURROGATE is set, the function will fail and return NULL. If both flags are set, the function will both warn and return NULL.

Similarly, the UNICODE_WARN_NONCHAR and UNICODE_DISALLOW_NONCHAR flags affect how the function handles a Unicode non-character.

And likewise, the UNICODE_WARN_SUPER and UNICODE_DISALLOW_SUPER flags affect the handling of code points that are above the Unicode maximum of 0x10FFFF. Languages other than Perl may not be able to accept files that contain these.

The flag UNICODE_WARN_ILLEGAL_INTERCHANGE selects all three of the above WARN flags; and UNICODE_DISALLOW_ILLEGAL_INTERCHANGE selects all three DISALLOW flags. UNICODE_DISALLOW_ILLEGAL_INTERCHANGE restricts the allowed inputs to the strict UTF-8 traditionally defined by Unicode. Similarly, UNICODE_WARN_ILLEGAL_C9_INTERCHANGE and UNICODE_DISALLOW_ILLEGAL_C9_INTERCHANGE are shortcuts to select the above-Unicode and surrogate flags, but not the non-character ones, as defined in Unicode Corrigendum #9. See "Noncharacter code points" in perlunicode.

Extremely high code points were never specified in any standard, and require an extension to UTF-8 to express, which Perl does. It is likely that programs written in something other than Perl would not be able to read files that contain these; nor would Perl understand files written by something that uses a different extension. For these reasons, there is a separate set of flags that can warn and/or disallow these extremely high code points, even if other above-Unicode ones are accepted. They are the UNICODE_WARN_PERL_EXTENDED and UNICODE_DISALLOW_PERL_EXTENDED flags. For more information see "UTF8_GOT_PERL_EXTENDED". Of course UNICODE_DISALLOW_SUPER will treat all above-Unicode code points, including these, as malformations. (Note that the Unicode standard considers anything above 0x10FFFF to be illegal, but there are standards predating it that allow up to 0x7FFF_FFFF (2**31 -1))

A somewhat misleadingly named synonym for UNICODE_WARN_PERL_EXTENDED is retained for backward compatibility: UNICODE_WARN_ABOVE_31_BIT. Similarly, UNICODE_DISALLOW_ABOVE_31_BIT is usable instead of the more accurately named UNICODE_DISALLOW_PERL_EXTENDED. The names are misleading because on EBCDIC platforms,these flags can apply to code points that actually do fit in 31 bits. The new names accurately describe the situation in all cases.

THIS FUNCTION SHOULD BE USED IN ONLY VERY SPECIALIZED CIRCUMSTANCES. Most code should use "utf8_to_uvchr_buf"() rather than call this directly.

Bottom level UTF-8 decode routine. Returns the native code point value of the first character in the string s, which is assumed to be in UTF-8 (or UTF-EBCDIC) encoding, and no longer than curlen bytes; *retlen (if retlen isn't NULL) will be set to the length, in bytes, of that character.

The value of flags determines the behavior when s does not point to a well-formed UTF-8 character. If flags is 0, encountering a malformation causes zero to be returned and *retlen is set so that (s + *retlen) is the next possible position in s that could begin a non-malformed character. Also, if UTF-8 warnings haven't been lexically disabled, a warning is raised. Some UTF-8 input sequences may contain multiple malformations. This function tries to find every possible one in each call, so multiple warnings can be raised for the same sequence.

Various ALLOW flags can be set in flags to allow (and not warn on) individual types of malformations, such as the sequence being overlong (that is, when there is a shorter sequence that can express the same code point; overlong sequences are expressly forbidden in the UTF-8 standard due to potential security issues). Another malformation example is the first byte of a character not being a legal first byte. See utf8.h for the list of such flags. Even if allowed, this function generally returns the Unicode REPLACEMENT CHARACTER when it encounters a malformation. There are flags in utf8.h to override this behavior for the overlong malformations, but don't do that except for very specialized purposes.

The UTF8_CHECK_ONLY flag overrides the behavior when a non-allowed (by other flags) malformation is found. If this flag is set, the routine assumes that the caller will raise a warning, and this function will silently just set retlen to -1 (cast to STRLEN) and return zero.

Note that this API requires disambiguation between successful decoding a NUL character, and an error return (unless the UTF8_CHECK_ONLY flag is set), as in both cases, 0 is returned, and, depending on the malformation, retlen may be set to 1. To disambiguate, upon a zero return, see if the first byte of s is 0 as well. If so, the input was a NUL; if not, the input had an error. Or you can use "utf8n_to_uvchr_error".

Certain code points are considered problematic. These are Unicode surrogates, Unicode non-characters, and code points above the Unicode maximum of 0x10FFFF. By default these are considered regular code points, but certain situations warrant special handling for them, which can be specified using the flags parameter. If flags contains UTF8_DISALLOW_ILLEGAL_INTERCHANGE, all three classes are treated as malformations and handled as such. The flags UTF8_DISALLOW_SURROGATE, UTF8_DISALLOW_NONCHAR, and UTF8_DISALLOW_SUPER (meaning above the legal Unicode maximum) can be set to disallow these categories individually. UTF8_DISALLOW_ILLEGAL_INTERCHANGE restricts the allowed inputs to the strict UTF-8 traditionally defined by Unicode. Use UTF8_DISALLOW_ILLEGAL_C9_INTERCHANGE to use the strictness definition given by Unicode Corrigendum #9. The difference between traditional strictness and C9 strictness is that the latter does not forbid non-character code points. (They are still discouraged, however.) For more discussion see "Noncharacter code points" in perlunicode.

The flags UTF8_WARN_ILLEGAL_INTERCHANGE, UTF8_WARN_ILLEGAL_C9_INTERCHANGE, UTF8_WARN_SURROGATE, UTF8_WARN_NONCHAR, and UTF8_WARN_SUPER will cause warning messages to be raised for their respective categories, but otherwise the code points are considered valid (not malformations). To get a category to both be treated as a malformation and raise a warning, specify both the WARN and DISALLOW flags. (But note that warnings are not raised if lexically disabled nor if UTF8_CHECK_ONLY is also specified.)

Extremely high code points were never specified in any standard, and require an extension to UTF-8 to express, which Perl does. It is likely that programs written in something other than Perl would not be able to read files that contain these; nor would Perl understand files written by something that uses a different extension. For these reasons, there is a separate set of flags that can warn and/or disallow these extremely high code points, even if other above-Unicode ones are accepted. They are the UTF8_WARN_PERL_EXTENDED and UTF8_DISALLOW_PERL_EXTENDED flags. For more information see "UTF8_GOT_PERL_EXTENDED". Of course UTF8_DISALLOW_SUPER will treat all above-Unicode code points, including these, as malformations. (Note that the Unicode standard considers anything above 0x10FFFF to be illegal, but there are standards predating it that allow up to 0x7FFF_FFFF (2**31 -1))

A somewhat misleadingly named synonym for UTF8_WARN_PERL_EXTENDED is retained for backward compatibility: UTF8_WARN_ABOVE_31_BIT. Similarly, UTF8_DISALLOW_ABOVE_31_BIT is usable instead of the more accurately named UTF8_DISALLOW_PERL_EXTENDED. The names are misleading because these flags can apply to code points that actually do fit in 31 bits. This happens on EBCDIC platforms, and sometimes when the overlong malformation is also present. The new names accurately describe the situation in all cases.

All other code points corresponding to Unicode characters, including private use and those yet to be assigned, are never considered malformed and never warn.

THIS FUNCTION SHOULD BE USED IN ONLY VERY SPECIALIZED CIRCUMSTANCES. Most code should use "utf8_to_uvchr_buf"() rather than call this directly.

This function is for code that needs to know what the precise malformation(s) are when an error is found. If you also need to know the generated warning messages, use "utf8n_to_uvchr_msgs"() instead.

It is like "utf8n_to_uvchr" but it takes an extra parameter placed after all the others, errors. If this parameter is 0, this function behaves identically to "utf8n_to_uvchr". Otherwise, errors should be a pointer to a U32 variable, which this function sets to indicate any errors found. Upon return, if *errors is 0, there were no errors found. Otherwise, *errors is the bit-wise OR of the bits described in the list below. Some of these bits will be set if a malformation is found, even if the input flags parameter indicates that the given malformation is allowed; those exceptions are noted:

UTF8_GOT_PERL_EXTENDED

The input sequence is not standard UTF-8, but a Perl extension. This bit is set only if the input flags parameter contains either the UTF8_DISALLOW_PERL_EXTENDED or the UTF8_WARN_PERL_EXTENDED flags.

Code points above 0x7FFF_FFFF (2**31 - 1) were never specified in any standard, and so some extension must be used to express them. Perl uses a natural extension to UTF-8 to represent the ones up to 2**36-1, and invented a further extension to represent even higher ones, so that any code point that fits in a 64-bit word can be represented. Text using these extensions is not likely to be portable to non-Perl code. We lump both of these extensions together and refer to them as Perl extended UTF-8. There exist other extensions that people have invented, incompatible with Perl's.

On EBCDIC platforms starting in Perl v5.24, the Perl extension for representing extremely high code points kicks in at 0x3FFF_FFFF (2**30 -1), which is lower than on ASCII. Prior to that, code points 2**31 and higher were simply unrepresentable, and a different, incompatible method was used to represent code points between 2**30 and 2**31 - 1.

On both platforms, ASCII and EBCDIC, UTF8_GOT_PERL_EXTENDED is set if Perl extended UTF-8 is used.

In earlier Perls, this bit was named UTF8_GOT_ABOVE_31_BIT, which you still may use for backward compatibility. That name is misleading, as this flag may be set when the code point actually does fit in 31 bits. This happens on EBCDIC platforms, and sometimes when the overlong malformation is also present. The new name accurately describes the situation in all cases.

UTF8_GOT_CONTINUATION

The input sequence was malformed in that the first byte was a a UTF-8 continuation byte.

UTF8_GOT_EMPTY

The input curlen parameter was 0.

UTF8_GOT_LONG

The input sequence was malformed in that there is some other sequence that evaluates to the same code point, but that sequence is shorter than this one.

Until Unicode 3.1, it was legal for programs to accept this malformation, but it was discovered that this created security issues.

UTF8_GOT_NONCHAR

The code point represented by the input UTF-8 sequence is for a Unicode non-character code point. This bit is set only if the input flags parameter contains either the UTF8_DISALLOW_NONCHAR or the UTF8_WARN_NONCHAR flags.

UTF8_GOT_NON_CONTINUATION

The input sequence was malformed in that a non-continuation type byte was found in a position where only a continuation type one should be. See also "UTF8_GOT_SHORT".

UTF8_GOT_OVERFLOW

The input sequence was malformed in that it is for a code point that is not representable in the number of bits available in an IV on the current platform.

UTF8_GOT_SHORT

The input sequence was malformed in that curlen is smaller than required for a complete sequence. In other words, the input is for a partial character sequence.

UTF8_GOT_SHORT and UTF8_GOT_NON_CONTINUATION both indicate a too short sequence. The difference is that UTF8_GOT_NON_CONTINUATION indicates always that there is an error, while UTF8_GOT_SHORT means that an incomplete sequence was looked at. If no other flags are present, it means that the sequence was valid as far as it went. Depending on the application, this could mean one of three things:

The curlen length parameter passed in was too small, and the function was prevented from examining all the necessary bytes.
The buffer being looked at is based on reading data, and the data received so far stopped in the middle of a character, so that the next read will read the remainder of this character. (It is up to the caller to deal with the split bytes somehow.)
This is a real error, and the partial sequence is all we're going to get.

UTF8_GOT_SUPER

The input sequence was malformed in that it is for a non-Unicode code point; that is, one above the legal Unicode maximum. This bit is set only if the input flags parameter contains either the UTF8_DISALLOW_SUPER or the UTF8_WARN_SUPER flags.

UTF8_GOT_SURROGATE

The input sequence was malformed in that it is for a -Unicode UTF-16 surrogate code point. This bit is set only if the input flags parameter contains either the UTF8_DISALLOW_SURROGATE or the UTF8_WARN_SURROGATE flags.

To do your own error handling, call this function with the UTF8_CHECK_ONLY flag to suppress any warnings, and then examine the *errors return.

THIS FUNCTION SHOULD BE USED IN ONLY VERY SPECIALIZED CIRCUMSTANCES. Most code should use "utf8_to_uvchr_buf"() rather than call this directly.

This function is for code that needs to know what the precise malformation(s) are when an error is found, and wants the corresponding warning and/or error messages to be returned to the caller rather than be displayed. All messages that would have been displayed if all lexcial warnings are enabled will be returned.

It is just like "utf8n_to_uvchr_error" but it takes an extra parameter placed after all the others, msgs. If this parameter is 0, this function behaves identically to "utf8n_to_uvchr_error". Otherwise, msgs should be a pointer to an AV * variable, in which this function creates a new AV to contain any appropriate messages. The elements of the array are ordered so that the first message that would have been displayed is in the 0th element, and so on. Each element is a hash with three key-value pairs, as follows:

text: The text of the message as a SVpv.
warn_categories: The warning category (or categories) packed into a SVuv.
flag: A single flag bit associated with this message, in a SVuv. The bit corresponds to some bit in the *errors return value, such as UTF8_GOT_LONG.

If the flag UTF8_CHECK_ONLY is passed, no warnings are generated, and hence no AV is created.

The caller, of course, is responsible for freeing any returned AV.

Returns the native code point of the first character in the string s which is assumed to be in UTF-8 encoding; send points to 1 beyond the end of s. *retlen will be set to the length, in bytes, of that character.

If s does not point to a well-formed UTF-8 character and UTF8 warnings are enabled, zero is returned and *retlen is set (if retlen isn't NULL) to -1. If those warnings are off, the computed value, if well-defined (or the Unicode REPLACEMENT CHARACTER if not), is silently returned, and *retlen is set (if retlen isn't NULL) so that (s + *retlen) is the next possible position in s that could begin a non-malformed character. See "utf8n_to_uvchr" for details on when the REPLACEMENT CHARACTER is returned.

Only in very rare circumstances should code need to be dealing in Unicode (as opposed to native) code points. In those few cases, use NATIVE_TO_UNI(utf8_to_uvchr_buf(...)) instead. If you are not absolutely sure this is one of those cases, then assume it isn't and use plain utf8_to_uvchr_buf instead.

Returns the Unicode (not-native) code point of the first character in the string s which is assumed to be in UTF-8 encoding; send points to 1 beyond the end of s. retlen will be set to the length, in bytes, of that character.

If s does not point to a well-formed UTF-8 character and UTF8 warnings are enabled, zero is returned and *retlen is set (if retlen isn't NULL) to -1. If those warnings are off, the computed value if well-defined (or the Unicode REPLACEMENT CHARACTER, if not) is silently returned, and *retlen is set (if retlen isn't NULL) so that (s + *retlen) is the next possible position in s that could begin a non-malformed character. See "utf8n_to_uvchr" for details on when the REPLACEMENT CHARACTER is returned.

Returns the number of characters in the sequence of UTF-8-encoded bytes starting at s and ending at the byte just before e. If <s> and <e> point to the same place, it returns 0 with no warning raised.

If e < s or if the scan would end up past e, it raises a UTF8 warning and returns the number of valid characters.

Compares the sequence of characters (stored as octets) in b, blen with the sequence of characters (stored as UTF-8) in u, ulen. Returns 0 if they are equal, -1 or -2 if the first string is less than the second string, +1 or +2 if the first string is greater than the second string.

-1 or +1 is returned if the shorter string was identical to the start of the longer string. -2 or +2 is returned if there was a difference between characters within the strings.

Converts a string "s" of length *lenp from UTF-8 into native byte encoding. Unlike "bytes_to_utf8", this over-writes the original string, and updates *lenp to contain the new length. Returns zero on failure (leaving "s" unchanged) setting *lenp to -1.

Upon successful return, the number of variants in the string can be computed by having saved the value of *lenp before the call, and subtracting the after-call value of *lenp from it.

If you need a copy of the string, see "bytes_from_utf8".

Converts a potentially UTF-8 encoded string s of length *lenp into native byte encoding. On input, the boolean *is_utf8p gives whether or not s is actually encoded in UTF-8.

Unlike "utf8_to_bytes" but like "bytes_to_utf8", this is non-destructive of the input string.

Do nothing if *is_utf8p is 0, or if there are code points in the string not expressible in native byte encoding. In these cases, *is_utf8p and *lenp are unchanged, and the return value is the original s.

Otherwise, *is_utf8p is set to 0, and the return value is a pointer to a newly created string containing a downgraded copy of s, and whose length is returned in *lenp, updated. The new string is NUL-terminated. The caller is responsible for arranging for the memory used by this string to get freed.

Upon successful return, the number of variants in the string can be computed by having saved the value of *lenp before the call, and subtracting the after-call value of *lenp from it.

Converts a string s of length *lenp bytes from the native encoding into UTF-8. Returns a pointer to the newly-created string, and sets *lenp to reflect the new length in bytes. The caller is responsible for arranging for the memory used by this string to get freed.

Upon successful return, the number of variants in the string can be computed by having saved the value of *lenp before the call, and subtracting it from the after-call value of *lenp.

A NUL character will be written after the end of the string.

If you want to convert to UTF-8 from encodings other than the native (Latin1 or EBCDIC), see "sv_recode_to_utf8"().

Instead use "toUPPER_utf8_safe".

Instead use "toTITLE_utf8_safe".

Instead use "toLOWER_utf8_safe".

Instead use "toFOLD_utf8_safe".

Build to the scalar dsv a displayable version of the string spv, length len, the displayable version being at most pvlim bytes long (if longer, the rest is truncated and "..." will be appended).

The flags argument can have UNI_DISPLAY_ISPRINT set to display isPRINT()able characters as themselves, UNI_DISPLAY_BACKSLASH to display the \\[nrfta\\] as the backslashed versions (like "\n") (UNI_DISPLAY_BACKSLASH is preferred over UNI_DISPLAY_ISPRINT for "\\"). UNI_DISPLAY_QQ (and its alias UNI_DISPLAY_REGEX) have both UNI_DISPLAY_BACKSLASH and UNI_DISPLAY_ISPRINT turned on.

The pointer to the PV of the dsv is returned.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)

Module Install Instructions