The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

The MUTABLE_*() macros cast pointers to the types shown, in such a way (compiler permitting) that casting away const-ness will give a warning; e.g.:

 const SV *sv = ...;
 AV *av1 = (AV*)sv;        <== BAD:  the const has been silently
                                     cast away
 AV *av2 = MUTABLE_AV(sv); <== GOOD: it may warn

MUTABLE_PTR is the base macro used to derive new casts. The other already-built-in ones return pointers to what their names indicate.

The *V_FROM_REF macros extract the SvRV() from a given reference SV and return a suitably-cast to pointer to the referenced SV. When running under -DDEBUGGING, assertions are also applied that check that ref is definitely a reference SV that refers to an SV of the right type.

Cast-to-bool. When Perl was able to be compiled on pre-C99 compilers, a (bool) cast didn't necessarily do the right thing, so this macro was created (and made somewhat complicated to work around bugs in old compilers). Now, many years later, and C99 is used, this is no longer required, but is kept for backwards compatibility.

These are equivalent to the correspondingly-named C99 typedefs on platforms that have those; they evaluate to int and unsigned int on platforms that don't, so that you can portably take advantage of this C99 feature.

This is a helper macro to avoid preprocessor issues, replaced by nothing unless under DEBUGGING, where it expands to an assert of its argument, followed by a comma (hence the comma operator). If we just used a straight assert(), we would get a comma with nothing before it when not DEBUGGING.

Like "lex_stuff_pvn", but takes a literal string instead of a string/length pair.

Returns two comma separated tokens of the input literal string, and its length. This is convenience macro which helps out in some API calls. Note that it can't be used as an argument to macros or functions that under some configurations might be macros, which means that it requires the full Perl_xxx(aTHX_ ...) form for any API calls where it's used.

Returns whether or not the perl currently being compiled has the specified relationship to the perl given by the parameters. For example,

 #if PERL_VERSION_GT(5,24,2)
   code that will only be compiled on perls after v5.24.2
 #else
   fallback code
 #endif

Note that this is usable in making compile-time decisions

You may use the special value '*' for the final number to mean ALL possible values for it. Thus,

 #if PERL_VERSION_EQ(5,31,'*')

means all perls in the 5.31 series. And

 #if PERL_VERSION_NE(5,24,'*')

means all perls EXCEPT 5.24 ones. And

 #if PERL_VERSION_LE(5,9,'*')

is effectively

 #if PERL_VERSION_LT(5,10,0)

This means you don't have to think so much when converting from the existing deprecated PERL_VERSION to using this macro:

 #if PERL_VERSION <= 9

becomes

 #if PERL_VERSION_LE(5,9,'*')

Character classification This section is about functions (really macros) that classify characters into types, such as punctuation versus alphabetic, etc. Most of these are analogous to regular expression character classes. (See "POSIX Character Classes" in perlrecharclass.) There are several variants for each class. (Not all macros have all variants; each item below lists the ones valid for it.) None are affected by use bytes, and only the ones with LC in the name are affected by the current locale.

The base function, e.g., isALPHA(), takes any signed or unsigned value, treating it as a code point, and returns a boolean as to whether or not the character represented by it is (or on non-ASCII platforms, corresponds to) an ASCII character in the named class based on platform, Unicode, and Perl rules. If the input is a number that doesn't fit in an octet, FALSE is returned.

Variant isFOO_A (e.g., isALPHA_A()) is identical to the base function with no suffix "_A". This variant is used to emphasize by its name that only ASCII-range characters can return TRUE.

Variant isFOO_L1 imposes the Latin-1 (or EBCDIC equivalent) character set onto the platform. That is, the code points that are ASCII are unaffected, since ASCII is a subset of Latin-1. But the non-ASCII code points are treated as if they are Latin-1 characters. For example, isWORDCHAR_L1() will return true when called with the code point 0xDF, which is a word character in both ASCII and EBCDIC (though it represents different characters in each). If the input is a number that doesn't fit in an octet, FALSE is returned. (Perl's documentation uses a colloquial definition of Latin-1, to include all code points below 256.)

Variant isFOO_uvchr is exactly like the isFOO_L1 variant, for inputs below 256, but if the code point is larger than 255, Unicode rules are used to determine if it is in the character class. For example, isWORDCHAR_uvchr(0x100) returns TRUE, since 0x100 is LATIN CAPITAL LETTER A WITH MACRON in Unicode, and is a word character.

Variants isFOO_utf8 and isFOO_utf8_safe are like isFOO_uvchr, but are used for UTF-8 encoded strings. The two forms are different names for the same thing. Each call to one of these classifies the first character of the string starting at p. The second parameter, e, points to anywhere in the string beyond the first character, up to one byte past the end of the entire string. Although both variants are identical, the suffix _safe in one name emphasizes that it will not attempt to read beyond e - 1, provided that the constraint s < e is true (this is asserted for in -DDEBUGGING builds). If the UTF-8 for the input character is malformed in some way, the program may croak, or the function may return FALSE, at the discretion of the implementation, and subject to change in future releases.

Variant isFOO_LC is like the isFOO_A and isFOO_L1 variants, but the result is based on the current locale, which is what LC in the name stands for. If Perl can determine that the current locale is a UTF-8 locale, it uses the published Unicode rules; otherwise, it uses the C library function that gives the named classification. For example, isDIGIT_LC() when not in a UTF-8 locale returns the result of calling isdigit(). FALSE is always returned if the input won't fit into an octet. On some platforms where the C library function is known to be defective, Perl changes its result to follow the POSIX standard's rules.

Variant isFOO_LC_uvchr acts exactly like isFOO_LC for inputs less than 256, but for larger ones it returns the Unicode classification of the code point.

Variants isFOO_LC_utf8 and isFOO_LC_utf8_safe are like isFOO_LC_uvchr, but are used for UTF-8 encoded strings. The two forms are different names for the same thing. Each call to one of these classifies the first character of the string starting at p. The second parameter, e, points to anywhere in the string beyond the first character, up to one byte past the end of the entire string. Although both variants are identical, the suffix _safe in one name emphasizes that it will not attempt to read beyond e - 1, provided that the constraint s < e is true (this is asserted for in -DDEBUGGING builds). If the UTF-8 for the input character is malformed in some way, the program may croak, or the function may return FALSE, at the discretion of the implementation, and subject to change in future releases.

The C suffix in the names was meant to indicate that they correspond to the C language isalnum(3).

Also note, that because all ASCII characters are UTF-8 invariant (meaning they have the exact same representation (always a single byte) whether encoded in UTF-8 or not), isASCII will give the correct results when called with any byte in any string encoded or not in UTF-8. And similarly isASCII_utf8 and isASCII_utf8_safe will work properly on any string encoded or not in UTF-8.

Returns a boolean indicating whether the specified character is a control character, analogous to m/[[:cntrl:]]/. See the top of this section for an explanation of the variants. On EBCDIC platforms, you almost always want to use the isCNTRL_L1 variant.

Returns a boolean indicating whether the specified character is a digit, analogous to m/[[:digit:]]/. Variants isDIGIT_A and isDIGIT_L1 are identical to isDIGIT. See the top of this section for an explanation of the variants.

See the top of this section for an explanation of the variants.

isWORDCHAR_A, isWORDCHAR_L1, isWORDCHAR_uvchr, isWORDCHAR_LC, isWORDCHAR_LC_uvchr, isWORDCHAR_LC_utf8, and isWORDCHAR_LC_utf8_safe are also as described there, but additionally include the platform's native underscore.

They are provided for backward compatibility, even though a word character includes more than the standard C language meaning of alphanumeric. To get the C language definition, use the corresponding "isALPHANUMERIC" variant.

Character case changing Perl uses "full" Unicode case mappings. This means that converting a single character to another case may result in a sequence of more than one character. For example, the uppercase of ß (LATIN SMALL LETTER SHARP S) is the two character sequence SS. This presents some complications The lowercase of all characters in the range 0..255 is a single character, and thus "toLOWER_L1" is furnished. But, toUPPER_L1 can't exist, as it couldn't return a valid result for all legal inputs. Instead "toUPPER_uvchr" has an API that does allow every possible legal result to be returned.) Likewise no other function that is crippled by not being able to give the correct results for the full range of possible inputs has been implemented here.

These all return the uppercase of a character. The differences are what domain they operate on, and whether the input is specified as a code point (those forms with a cp parameter) or as a UTF-8 string (the others). In the latter case, the code point to use is the first one in the buffer of UTF-8 encoded code points, delineated by the arguments p .. e - 1.

toUPPER and toUPPER_A are synonyms of each other. They return the uppercase of any lowercase ASCII-range code point. All other inputs are returned unchanged. Since these are macros, the input type may be any integral one, and the output will occupy the same number of bits as the input.

There is no toUPPER_L1 nor toUPPER_LATIN1 as the uppercase of some code points in the 0..255 range is above that range or consists of multiple characters. Instead use toUPPER_uvchr.

toUPPER_uvchr returns the uppercase of any Unicode code point. The return value is identical to that of toUPPER_A for input code points in the ASCII range. The uppercase of the vast majority of Unicode code points is the same as the code point itself. For these, and for code points above the legal Unicode maximum, this returns the input code point unchanged. It additionally stores the UTF-8 of the result into the buffer beginning at s, and its length in bytes into *lenp. The caller must have made s large enough to contain at least UTF8_MAXBYTES_CASE+1 bytes to avoid possible overflow.

NOTE: the uppercase of a code point may be more than one code point. The return value of this function is only the first of these. The entire uppercase is returned in s. To determine if the result is more than a single code point, you can do something like this:

 uc = toUPPER_uvchr(cp, s, &len);
 if (len > UTF8SKIP(s)) { is multiple code points }
 else { is a single code point }

toUPPER_utf8 and toUPPER_utf8_safe are synonyms of each other. The only difference between these and toUPPER_uvchr is that the source for these is encoded in UTF-8, instead of being a code point. It is passed as a buffer starting at p, with e pointing to one byte beyond its end. The p buffer may certainly contain more than one code point; but only the first one (up through e - 1) is examined. If the UTF-8 for the input character is malformed in some way, the program may croak, or the function may return the REPLACEMENT CHARACTER, at the discretion of the implementation, and subject to change in future releases.

These all return the foldcase of a character. "foldcase" is an internal case for /i pattern matching. If the foldcase of character A and the foldcase of character B are the same, they match caselessly; otherwise they don't.

The differences in the forms are what domain they operate on, and whether the input is specified as a code point (those forms with a cp parameter) or as a UTF-8 string (the others). In the latter case, the code point to use is the first one in the buffer of UTF-8 encoded code points, delineated by the arguments p .. e - 1.

toFOLD and toFOLD_A are synonyms of each other. They return the foldcase of any ASCII-range code point. In this range, the foldcase is identical to the lowercase. All other inputs are returned unchanged. Since these are macros, the input type may be any integral one, and the output will occupy the same number of bits as the input.

There is no toFOLD_L1 nor toFOLD_LATIN1 as the foldcase of some code points in the 0..255 range is above that range or consists of multiple characters. Instead use toFOLD_uvchr.

toFOLD_uvchr returns the foldcase of any Unicode code point. The return value is identical to that of toFOLD_A for input code points in the ASCII range. The foldcase of the vast majority of Unicode code points is the same as the code point itself. For these, and for code points above the legal Unicode maximum, this returns the input code point unchanged. It additionally stores the UTF-8 of the result into the buffer beginning at s, and its length in bytes into *lenp. The caller must have made s large enough to contain at least UTF8_MAXBYTES_CASE+1 bytes to avoid possible overflow.

NOTE: the foldcase of a code point may be more than one code point. The return value of this function is only the first of these. The entire foldcase is returned in s. To determine if the result is more than a single code point, you can do something like this:

 uc = toFOLD_uvchr(cp, s, &len);
 if (len > UTF8SKIP(s)) { is multiple code points }
 else { is a single code point }

toFOLD_utf8 and toFOLD_utf8_safe are synonyms of each other. The only difference between these and toFOLD_uvchr is that the source for these is encoded in UTF-8, instead of being a code point. It is passed as a buffer starting at p, with e pointing to one byte beyond its end. The p buffer may certainly contain more than one code point; but only the first one (up through e - 1) is examined. If the UTF-8 for the input character is malformed in some way, the program may croak, or the function may return the REPLACEMENT CHARACTER, at the discretion of the implementation, and subject to change in future releases.

These all return the lowercase of a character. The differences are what domain they operate on, and whether the input is specified as a code point (those forms with a cp parameter) or as a UTF-8 string (the others). In the latter case, the code point to use is the first one in the buffer of UTF-8 encoded code points, delineated by the arguments p .. e - 1.

toLOWER and toLOWER_A are synonyms of each other. They return the lowercase of any uppercase ASCII-range code point. All other inputs are returned unchanged. Since these are macros, the input type may be any integral one, and the output will occupy the same number of bits as the input.

toLOWER_L1 and toLOWER_LATIN1 are synonyms of each other. They behave identically as toLOWER for ASCII-range input. But additionally will return the lowercase of any uppercase code point in the entire 0..255 range, assuming a Latin-1 encoding (or the EBCDIC equivalent on such platforms).

toLOWER_LC returns the lowercase of the input code point according to the rules of the current POSIX locale. Input code points outside the range 0..255 are returned unchanged.

toLOWER_uvchr returns the lowercase of any Unicode code point. The return value is identical to that of toLOWER_L1 for input code points in the 0..255 range. The lowercase of the vast majority of Unicode code points is the same as the code point itself. For these, and for code points above the legal Unicode maximum, this returns the input code point unchanged. It additionally stores the UTF-8 of the result into the buffer beginning at s, and its length in bytes into *lenp. The caller must have made s large enough to contain at least UTF8_MAXBYTES_CASE+1 bytes to avoid possible overflow.

NOTE: the lowercase of a code point may be more than one code point. The return value of this function is only the first of these. The entire lowercase is returned in s. To determine if the result is more than a single code point, you can do something like this:

 uc = toLOWER_uvchr(cp, s, &len);
 if (len > UTF8SKIP(s)) { is multiple code points }
 else { is a single code point }

toLOWER_utf8 and toLOWER_utf8_safe are synonyms of each other. The only difference between these and toLOWER_uvchr is that the source for these is encoded in UTF-8, instead of being a code point. It is passed as a buffer starting at p, with e pointing to one byte beyond its end. The p buffer may certainly contain more than one code point; but only the first one (up through e - 1) is examined. If the UTF-8 for the input character is malformed in some way, the program may croak, or the function may return the REPLACEMENT CHARACTER, at the discretion of the implementation, and subject to change in future releases.

These all return the titlecase of a character. The differences are what domain they operate on, and whether the input is specified as a code point (those forms with a cp parameter) or as a UTF-8 string (the others). In the latter case, the code point to use is the first one in the buffer of UTF-8 encoded code points, delineated by the arguments p .. e - 1.

toTITLE and toTITLE_A are synonyms of each other. They return the titlecase of any lowercase ASCII-range code point. In this range, the titlecase is identical to the uppercase. All other inputs are returned unchanged. Since these are macros, the input type may be any integral one, and the output will occupy the same number of bits as the input.

There is no toTITLE_L1 nor toTITLE_LATIN1 as the titlecase of some code points in the 0..255 range is above that range or consists of multiple characters. Instead use toTITLE_uvchr.

toTITLE_uvchr returns the titlecase of any Unicode code point. The return value is identical to that of toTITLE_A for input code points in the ASCII range. The titlecase of the vast majority of Unicode code points is the same as the code point itself. For these, and for code points above the legal Unicode maximum, this returns the input code point unchanged. It additionally stores the UTF-8 of the result into the buffer beginning at s, and its length in bytes into *lenp. The caller must have made s large enough to contain at least UTF8_MAXBYTES_CASE+1 bytes to avoid possible overflow.

NOTE: the titlecase of a code point may be more than one code point. The return value of this function is only the first of these. The entire titlecase is returned in s. To determine if the result is more than a single code point, you can do something like this:

 uc = toTITLE_uvchr(cp, s, &len);
 if (len > UTF8SKIP(s)) { is multiple code points }
 else { is a single code point }

toTITLE_utf8 and toTITLE_utf8_safe are synonyms of each other. The only difference between these and toTITLE_uvchr is that the source for these is encoded in UTF-8, instead of being a code point. It is passed as a buffer starting at p, with e pointing to one byte beyond its end. The p buffer may certainly contain more than one code point; but only the first one (up through e - 1) is examined. If the UTF-8 for the input character is malformed in some way, the program may croak, or the function may return the REPLACEMENT CHARACTER, at the discretion of the implementation, and subject to change in future releases.

Yields the widest unsigned integer type on the platform, currently either U32 or U64. This can be used in declarations such as

 WIDEST_UTYPE my_uv;

or casts

 my_uv = (WIDEST_UTYPE) val;

The XSUB-writer's interface to the C malloc function.

Memory obtained by this should ONLY be freed with "Safefree".

In 5.9.3, Newx() and friends replace the older New() API, and drops the first parameter, x, a debug aid which allowed callers to identify themselves. This aid has been superseded by a new build option, PERL_MEM_LOG (see "PERL_MEM_LOG" in perlhacktips). The older API is still there for use in XS modules supporting older perls.

Memory obtained by this should ONLY be freed with "Safefree".

The XSUB-writer's interface to the C malloc function. The allocated memory is zeroed with memzero. See also "Newx".

Memory obtained by this should ONLY be freed with "Safefree".

The XSUB-writer's interface to the C realloc function.

Memory obtained by this should ONLY be freed with "Safefree".

Memory obtained by this should ONLY be freed with "Safefree".

This should ONLY be used on memory obtained using "Newx" and friends.

MoveD is like Move but returns dest. Useful for encouraging compilers to tail-call optimise.

CopyD is like Copy but returns dest. Useful for encouraging compilers to tail-call optimise.

The XSUB-writer's interface to the C memzero function. The dest is the destination, nitems is the number of items, and type is the type.

ZeroD is like Zero but returns dest. Useful for encouraging compilers to tail-call optimise.

Fill up memory with a byte pattern (a byte repeated over and over again) that hopefully catches attempts to access uninitialized memory.

PoisonWith(0xAB) for catching access to allocated but uninitialized memory.

PoisonWith(0xEF) for catching access to freed memory.

PoisonWith(0xEF) for catching access to freed memory.

Returns the number of elements in the input C array (so you want your zero-based indices to be less than but not equal to).

Returns a pointer to one element past the final element of the input C array.