The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

Lexer interface

This is the lower layer of the Perl parser, managing characters and tokens.

Pointer to a structure encapsulating the state of the parsing operation currently in progress. The pointer can be locally changed to perform a nested parse without interfering with the state of an outer parse. Individual members of PL_parser have their own documentation.

Creates and initialises a new lexer/parser state object, supplying a context in which to lex and parse from a new source of Perl code. A pointer to the new state object is placed in "PL_parser". An entry is made on the save stack so that upon unwinding the new state object will be destroyed and the former value of "PL_parser" will be restored. Nothing else need be done to clean up the parsing context.

The code to be parsed comes from line and rsfp. line, if non-null, provides a string (in SV form) containing code to be parsed. A copy of the string is made, so subsequent modification of line does not affect parsing. rsfp, if non-null, provides an input stream from which code will be read to be parsed. If both are non-null, the code in line comes first and must consist of complete lines of input, and rsfp supplies the remainder of the source.

The flags parameter is reserved for future use. Currently it is only used by perl internally, so extensions should always pass zero.

Buffer scalar containing the chunk currently under consideration of the text currently being lexed. This is always a plain string scalar (for which SvPOK is true). It is not intended to be used as a scalar by normal scalar means; instead refer to the buffer directly by the pointer variables described below.

The lexer maintains various char* pointers to things in the PL_parser->linestr buffer. If PL_parser->linestr is ever reallocated, all of these pointers must be updated. Don't attempt to do this manually, but rather use "lex_grow_linestr" if you need to reallocate the buffer.

The content of the text chunk in the buffer is commonly exactly one complete line of input, up to and including a newline terminator, but there are situations where it is otherwise. The octets of the buffer may be intended to be interpreted as either UTF-8 or Latin-1. The function "lex_bufutf8" tells you which. Do not use the SvUTF8 flag on this scalar, which may disagree with it.

For direct examination of the buffer, the variable "PL_parser->bufend" points to the end of the buffer. The current lexing position is pointed to by "PL_parser->bufptr". Direct use of these pointers is usually preferable to examination of the scalar through normal scalar means.

Direct pointer to the end of the chunk of text currently being lexed, the end of the lexer buffer. This is equal to SvPVX(PL_parser->linestr) + SvCUR(PL_parser->linestr). A NUL character (zero octet) is always located at the end of the buffer, and does not count as part of the buffer's contents.

Points to the current position of lexing inside the lexer buffer. Characters around this point may be freely examined, within the range delimited by SvPVX("PL_parser->linestr") and "PL_parser->bufend". The octets of the buffer may be intended to be interpreted as either UTF-8 or Latin-1, as indicated by "lex_bufutf8".

Lexing code (whether in the Perl core or not) moves this pointer past the characters that it consumes. It is also expected to perform some bookkeeping whenever a newline character is consumed. This movement can be more conveniently performed by the function "lex_read_to", which handles newlines appropriately.

Interpretation of the buffer's octets can be abstracted out by using the slightly higher-level functions "lex_peek_unichar" and "lex_read_unichar".

Points to the start of the current line inside the lexer buffer. This is useful for indicating at which column an error occurred, and not much else. This must be updated by any lexing code that consumes a newline; the function "lex_read_to" handles this detail.

Indicates whether the octets in the lexer buffer ("PL_parser->linestr") should be interpreted as the UTF-8 encoding of Unicode characters. If not, they should be interpreted as Latin-1 characters. This is analogous to the SvUTF8 flag for scalars.

In UTF-8 mode, it is not guaranteed that the lexer buffer actually contains valid UTF-8. Lexing code must be robust in the face of invalid encoding.

The actual SvUTF8 flag of the "PL_parser->linestr" scalar is significant, but not the whole story regarding the input character encoding. Normally, when a file is being read, the scalar contains octets and its SvUTF8 flag is off, but the octets should be interpreted as UTF-8 if the use utf8 pragma is in effect. During a string eval, however, the scalar may have the SvUTF8 flag on, and in this case its octets should be interpreted as UTF-8 unless the use bytes pragma is in effect. This logic may change in the future; use this function instead of implementing the logic yourself.

Reallocates the lexer buffer ("PL_parser->linestr") to accommodate at least len octets (including terminating NUL). Returns a pointer to the reallocated buffer. This is necessary before making any direct modification of the buffer that would increase its length. "lex_stuff_pvn" provides a more convenient way to insert text into the buffer.

Do not use SvGROW or sv_grow directly on PL_parser->linestr; this function updates all of the lexer's variables that point directly into the buffer.

Insert characters into the lexer buffer ("PL_parser->linestr"), immediately after the current lexing point ("PL_parser->bufptr"), reallocating the buffer if necessary. This means that lexing code that runs later will see the characters as if they had appeared in the input. It is not recommended to do this as part of normal parsing, and most uses of this facility run the risk of the inserted characters being interpreted in an unintended manner.

The string to be inserted is represented by len octets starting at pv. These octets are interpreted as either UTF-8 or Latin-1, according to whether the LEX_STUFF_UTF8 flag is set in flags. The characters are recoded for the lexer buffer, according to how the buffer is currently being interpreted ("lex_bufutf8"). If a string to be inserted is available as a Perl scalar, the "lex_stuff_sv" function is more convenient.

Insert characters into the lexer buffer ("PL_parser->linestr"), immediately after the current lexing point ("PL_parser->bufptr"), reallocating the buffer if necessary. This means that lexing code that runs later will see the characters as if they had appeared in the input. It is not recommended to do this as part of normal parsing, and most uses of this facility run the risk of the inserted characters being interpreted in an unintended manner.

The string to be inserted is represented by octets starting at pv and continuing to the first nul. These octets are interpreted as either UTF-8 or Latin-1, according to whether the LEX_STUFF_UTF8 flag is set in flags. The characters are recoded for the lexer buffer, according to how the buffer is currently being interpreted ("lex_bufutf8"). If it is not convenient to nul-terminate a string to be inserted, the "lex_stuff_pvn" function is more appropriate.

Insert characters into the lexer buffer ("PL_parser->linestr"), immediately after the current lexing point ("PL_parser->bufptr"), reallocating the buffer if necessary. This means that lexing code that runs later will see the characters as if they had appeared in the input. It is not recommended to do this as part of normal parsing, and most uses of this facility run the risk of the inserted characters being interpreted in an unintended manner.

The string to be inserted is the string value of sv. The characters are recoded for the lexer buffer, according to how the buffer is currently being interpreted ("lex_bufutf8"). If a string to be inserted is not already a Perl scalar, the "lex_stuff_pvn" function avoids the need to construct a scalar.

Discards text about to be lexed, from "PL_parser->bufptr" up to ptr. Text following ptr will be moved, and the buffer shortened. This hides the discarded text from any lexing code that runs later, as if the text had never appeared.

This is not the normal way to consume lexed text. For that, use "lex_read_to".

Consume text in the lexer buffer, from "PL_parser->bufptr" up to ptr. This advances "PL_parser->bufptr" to match ptr, performing the correct bookkeeping whenever a newline character is passed. This is the normal way to consume lexed text.

Interpretation of the buffer's octets can be abstracted out by using the slightly higher-level functions "lex_peek_unichar" and "lex_read_unichar".

Discards the first part of the "PL_parser->linestr" buffer, up to ptr. The remaining content of the buffer will be moved, and all pointers into the buffer updated appropriately. ptr must not be later in the buffer than the position of "PL_parser->bufptr": it is not permitted to discard text that has yet to be lexed.

Normally it is not necessarily to do this directly, because it suffices to use the implicit discarding behaviour of "lex_next_chunk" and things based on it. However, if a token stretches across multiple lines, and the lexing code has kept multiple lines of text in the buffer for that purpose, then after completion of the token it would be wise to explicitly discard the now-unneeded earlier lines, to avoid future multi-line tokens growing the buffer without bound.

Reads in the next chunk of text to be lexed, appending it to "PL_parser->linestr". This should be called when lexing code has looked to the end of the current chunk and wants to know more. It is usual, but not necessary, for lexing to have consumed the entirety of the current chunk at this time.

If "PL_parser->bufptr" is pointing to the very end of the current chunk (i.e., the current chunk has been entirely consumed), normally the current chunk will be discarded at the same time that the new chunk is read in. If flags includes LEX_KEEP_PREVIOUS, the current chunk will not be discarded. If the current chunk has not been entirely consumed, then it will not be discarded regardless of the flag.

Returns true if some new text was added to the buffer, or false if the buffer has reached the end of the input text.

Looks ahead one (Unicode) character in the text currently being lexed. Returns the codepoint (unsigned integer value) of the next character, or -1 if lexing has reached the end of the input text. To consume the peeked character, use "lex_read_unichar".

If the next character is in (or extends into) the next chunk of input text, the next chunk will be read in. Normally the current chunk will be discarded at the same time, but if flags includes LEX_KEEP_PREVIOUS then the current chunk will not be discarded.

If the input is being interpreted as UTF-8 and a UTF-8 encoding error is encountered, an exception is generated.

Reads the next (Unicode) character in the text currently being lexed. Returns the codepoint (unsigned integer value) of the character read, and moves "PL_parser->bufptr" past the character, or returns -1 if lexing has reached the end of the input text. To non-destructively examine the next character, use "lex_peek_unichar" instead.

If the next character is in (or extends into) the next chunk of input text, the next chunk will be read in. Normally the current chunk will be discarded at the same time, but if flags includes LEX_KEEP_PREVIOUS then the current chunk will not be discarded.

If the input is being interpreted as UTF-8 and a UTF-8 encoding error is encountered, an exception is generated.

Reads optional spaces, in Perl style, in the text currently being lexed. The spaces may include ordinary whitespace characters and Perl-style comments. #line directives are processed if encountered. "PL_parser->bufptr" is moved past the spaces, so that it points at a non-space character (or the end of the input text).

If spaces extend into the next chunk of input text, the next chunk will be read in. Normally the current chunk will be discarded at the same time, but if flags includes LEX_KEEP_PREVIOUS then the current chunk will not be discarded.

Parse a Perl arithmetic expression. This may contain operators of precedence down to the bit shift operators. The expression must be followed (and thus terminated) either by a comparison or lower-precedence operator or by something that would normally terminate an expression such as semicolon. If flags includes PARSE_OPTIONAL then the expression is optional, otherwise it is mandatory. It is up to the caller to ensure that the dynamic parser state ("PL_parser" et al) is correctly set to reflect the source of the code to be parsed and the lexical context for the expression.

The op tree representing the expression is returned. If an optional expression is absent, a null pointer is returned, otherwise the pointer will be non-null.

If an error occurs in parsing or compilation, in most cases a valid op tree is returned anyway. The error is reflected in the parser state, normally resulting in a single exception at the top level of parsing which covers all the compilation errors that occurred. Some compilation errors, however, will throw an exception immediately.

Parse a Perl term expression. This may contain operators of precedence down to the assignment operators. The expression must be followed (and thus terminated) either by a comma or lower-precedence operator or by something that would normally terminate an expression such as semicolon. If flags includes PARSE_OPTIONAL then the expression is optional, otherwise it is mandatory. It is up to the caller to ensure that the dynamic parser state ("PL_parser" et al) is correctly set to reflect the source of the code to be parsed and the lexical context for the expression.

The op tree representing the expression is returned. If an optional expression is absent, a null pointer is returned, otherwise the pointer will be non-null.

If an error occurs in parsing or compilation, in most cases a valid op tree is returned anyway. The error is reflected in the parser state, normally resulting in a single exception at the top level of parsing which covers all the compilation errors that occurred. Some compilation errors, however, will throw an exception immediately.

Parse a Perl list expression. This may contain operators of precedence down to the comma operator. The expression must be followed (and thus terminated) either by a low-precedence logic operator such as or or by something that would normally terminate an expression such as semicolon. If flags includes PARSE_OPTIONAL then the expression is optional, otherwise it is mandatory. It is up to the caller to ensure that the dynamic parser state ("PL_parser" et al) is correctly set to reflect the source of the code to be parsed and the lexical context for the expression.

The op tree representing the expression is returned. If an optional expression is absent, a null pointer is returned, otherwise the pointer will be non-null.

If an error occurs in parsing or compilation, in most cases a valid op tree is returned anyway. The error is reflected in the parser state, normally resulting in a single exception at the top level of parsing which covers all the compilation errors that occurred. Some compilation errors, however, will throw an exception immediately.

Parse a single complete Perl expression. This allows the full expression grammar, including the lowest-precedence operators such as or. The expression must be followed (and thus terminated) by a token that an expression would normally be terminated by: end-of-file, closing bracketing punctuation, semicolon, or one of the keywords that signals a postfix expression-statement modifier. If flags includes PARSE_OPTIONAL then the expression is optional, otherwise it is mandatory. It is up to the caller to ensure that the dynamic parser state ("PL_parser" et al) is correctly set to reflect the source of the code to be parsed and the lexical context for the expression.

The op tree representing the expression is returned. If an optional expression is absent, a null pointer is returned, otherwise the pointer will be non-null.

If an error occurs in parsing or compilation, in most cases a valid op tree is returned anyway. The error is reflected in the parser state, normally resulting in a single exception at the top level of parsing which covers all the compilation errors that occurred. Some compilation errors, however, will throw an exception immediately.

Parse a single complete Perl code block. This consists of an opening brace, a sequence of statements, and a closing brace. The block constitutes a lexical scope, so my variables and various compile-time effects can be contained within it. It is up to the caller to ensure that the dynamic parser state ("PL_parser" et al) is correctly set to reflect the source of the code to be parsed and the lexical context for the statement.

The op tree representing the code block is returned. This is always a real op, never a null pointer. It will normally be a lineseq list, including nextstate or equivalent ops. No ops to construct any kind of runtime scope are included by virtue of it being a block.

If an error occurs in parsing or compilation, in most cases a valid op tree (most likely null) is returned anyway. The error is reflected in the parser state, normally resulting in a single exception at the top level of parsing which covers all the compilation errors that occurred. Some compilation errors, however, will throw an exception immediately.

The flags parameter is reserved for future use, and must always be zero.

Parse a single unadorned Perl statement. This may be a normal imperative statement or a declaration that has compile-time effect. It does not include any label or other affixture. It is up to the caller to ensure that the dynamic parser state ("PL_parser" et al) is correctly set to reflect the source of the code to be parsed and the lexical context for the statement.

The op tree representing the statement is returned. This may be a null pointer if the statement is null, for example if it was actually a subroutine definition (which has compile-time side effects). If not null, it will be ops directly implementing the statement, suitable to pass to "newSTATEOP". It will not normally include a nextstate or equivalent op (except for those embedded in a scope contained entirely within the statement).

If an error occurs in parsing or compilation, in most cases a valid op tree (most likely null) is returned anyway. The error is reflected in the parser state, normally resulting in a single exception at the top level of parsing which covers all the compilation errors that occurred. Some compilation errors, however, will throw an exception immediately.

The flags parameter is reserved for future use, and must always be zero.

Parse a single label, possibly optional, of the type that may prefix a Perl statement. It is up to the caller to ensure that the dynamic parser state ("PL_parser" et al) is correctly set to reflect the source of the code to be parsed. If flags includes PARSE_OPTIONAL then the label is optional, otherwise it is mandatory.

The name of the label is returned in the form of a fresh scalar. If an optional label is absent, a null pointer is returned.

If an error occurs in parsing, which can only occur if the label is mandatory, a valid label is returned anyway. The error is reflected in the parser state, normally resulting in a single exception at the top level of parsing which covers all the compilation errors that occurred.

Parse a single complete Perl statement. This may be a normal imperative statement or a declaration that has compile-time effect, and may include optional labels. It is up to the caller to ensure that the dynamic parser state ("PL_parser" et al) is correctly set to reflect the source of the code to be parsed and the lexical context for the statement.

The op tree representing the statement is returned. This may be a null pointer if the statement is null, for example if it was actually a subroutine definition (which has compile-time side effects). If not null, it will be the result of a "newSTATEOP" call, normally including a nextstate or equivalent op.

If an error occurs in parsing or compilation, in most cases a valid op tree (most likely null) is returned anyway. The error is reflected in the parser state, normally resulting in a single exception at the top level of parsing which covers all the compilation errors that occurred. Some compilation errors, however, will throw an exception immediately.

The flags parameter is reserved for future use, and must always be zero.

Parse a sequence of zero or more Perl statements. These may be normal imperative statements, including optional labels, or declarations that have compile-time effect, or any mixture thereof. The statement sequence ends when a closing brace or end-of-file is encountered in a place where a new statement could have validly started. It is up to the caller to ensure that the dynamic parser state ("PL_parser" et al) is correctly set to reflect the source of the code to be parsed and the lexical context for the statements.

The op tree representing the statement sequence is returned. This may be a null pointer if the statements were all null, for example if there were no statements or if there were only subroutine definitions (which have compile-time side effects). If not null, it will be a lineseq list, normally including nextstate or equivalent ops.

If an error occurs in parsing or compilation, in most cases a valid op tree is returned anyway. The error is reflected in the parser state, normally resulting in a single exception at the top level of parsing which covers all the compilation errors that occurred. Some compilation errors, however, will throw an exception immediately.

The flags parameter is reserved for future use, and must always be zero.