The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Parse::RecDescent - Generate Recursive-Descent Parsers

SYNOPSIS

 use Parse::RecDescent;

 # Generate a parser from the specification in $grammar:

         $parser = new Parse::RecDescent ($grammar);

 # Generate a parser from the specification in $othergrammar

         $anotherparser = new Parse::RecDescent ($othergrammar);


 # Parse $text using rule 'startrule' (which must be
 # defined in $grammar):

        $parser->startrule($text);


 # Parse $text using rule 'otherrule' (which must also
 # be defined in $grammar):

         $parser->otherrule($text);


 # Change the universal token separator (prefix) pattern
 # (the default is: '\s*'):

        $Parse::RecDescent::tokensep = '[ \t]+';


 # Replace productions of existing rules (or create new ones)
 # with the productions defined in $newgrammar:

        $parser->Replace($newgrammar);


 # Extend existing rules (or create new ones)
 # by adding extra productions defined in $moregrammar:

        $parser->Extend($moregrammar);


 # Global flags (useful as command line arguments under -s):

        $::RD_ERRORS       # unless undefined, report fatal errors
        $::RD_WARNINGS     # if defined, also report non-fatal problems
        $::RD_HINTS        # if defined, also suggestion remedies
        $::RD_TRACE        # if defined, also trace parsers' behaviour

DESCRIPTION

Overview

Parse::RecDescent incrementally generates top-down recursive-descent text parsers from simple yacc-like grammar specifications. It provides:

  • Regular expressions or literal strings as terminals (tokens),

  • Multiple (non-contiguous) productions for any rule,

  • Repeated and optional subrules within productions,

  • Full access to Perl within actions specified as part of the grammar,

  • Simple automated error reporting during parser generation and parsing,

  • The ability to commit to, uncommit to, or reject particular productions during a parse,

  • Incremental extension of the parsing grammar (even during a parse),

Using Parse::RecDescent

Parser objects are created by calling Parse::RecDescent::new, passing in a grammar specification (see the following subsections). If the grammar is correct, new returns a blessed reference which can then be used to initiate parsing through any rule specified in the original grammar. A typical sequence looks like this:

        $grammar = q {
                        # GRAMMAR SPECIFICATION HERE
                     };

        $parser = new Parse::RecDescent ($grammar) or die "Bad grammar!\n";

        # acquire $text

        defined $parser->startrule($text) or print "Bad text!\n";

If the rule through which parsing is initiated succeeds, its value (see below) is returned. Failure to generate the original parser or failure to match a test is indicated by returning undef.

Rules

In the grammar from which the parser is built, rules are specified by giving an identifier (which must satisfy /[A-Za-z]w*/), followed by a colon, followed by one or more productions, separated by single vertical bars. Rules are entirely free-format:

        rule1:  production1
             |  production2 |
                production3 | production4

        rule2
                : production1
                | production2

At any point in the grammar previously defined rules may be extended with additional productions. This is achieved by redeclaring the rule with the new productions. Thus:

        rule1: a | b | c
        rule2: d | e | f
        rule1: g | h

is exactly equivalent to:

        rule1: a | b | c | g | h
        rule2: d | e | f

Each production in a rule consists of zero or more items, each of which may be either: the name of another rule to be matched (a "subrule"), a pattern or string literal to be matched directly (a "token"), a block of Perl code to be executed (an "action"), a special instruction to the parser (a "directive"), or a standard Perl comment (which is ignored).

A rule matches a text if one of its productions matches. A production matches if each of its items match consecutive substrings of the text. The productions of a rule being matched are tried in the same order that they appear in the original grammar, and the first matching production terminates the match attempt (successfully). If all productions are tried and none matches, the match attempt fails.

Each successfully matched item in a production is assigned a value, which can be accessed in subsequent actions within the same production (or, in some cases, as the return value of a successful subrule call). Unsuccessful items don't have an associated value, since the failure of an item causes the entire surrounding production to immediately fail. The following sections describe the various types of items and their success values.

Subrules

A subrule which appears in a production is an instruction to the parser to attempt to match the named rule at that point in the text being parsed. If the named subrule is not defined at that point the production containing it immediately fails.

A rule may (recursively) call itself as a subrule, but not as the left-most item in any of its productions (since such a recursion would never terminate).

The value associated with a subrule is the value associated with its $return variable (see "Actions" below), or with the last successfully matched item in the subrule match.

Subrules may also be specified with a trailing repetition specifier, indicating that they are to be (greedily) matched the specified number of times. The available specifiers are:

                subrule(?)      # Match one-or-zero times
                subrule(s)      # Match one-or-more times
                subrule(s?)     # Match zero-or-more times
                subrule(N)      # Match exactly N times for integer N > 0
                subrule(N..M)   # Match between N and M times
                subrule(..M)    # Match between 1 and M times
                subrule(N..)    # Match at least N times

Since a repeated subrule may match many instances of the subrule itself, the value associated with it is not a simple scalar, but rather a reference to a list of scalars, each of which is the value associated with one of the individual subrule matches. In other words in the rule:

                program: statement(s)

the value associated with the repeated subrule "statement(s)" is a reference to an array containing the values matched by each call to the individual subrule "statement".

Tokens

If a quote-delimited string or a Perl regex appears in a production, the parser attempts to match that string or pattern at that point in the text. For example:

                typedef: "typedef" typename identifier ';'

                identifier: /[A-Za-z_][A-Za-z0-9_]*/

As in regular Perl, a single quoted string is uninterpolated, whilst a double-quoted string or a pattern is interpolated (at the time of matching, not when the parser is constructed). Hence, it is possible to define rules in which tokens can be set at run-time:

                typedef: "$::typedefkeyword" typename identifier ';'

                identifier: /$::identpat/

Note that, since each rule is implemented inside a special namespace belonging to its parser, it is necessary to explicitly quantify variables from the main package.

Regex tokens can be specified using just backslashes as delimiters or with the explicit m<delimiter>......<delimiter> syntax:

                typedef: "typedef" typename identifier ';'

                identifier: m{[A-Za-z_][A-Za-z0-9_]*}

A regex of either type can also have any valid trailing parameter(s) (that is, any of [gimsox]):

                typedef: "typedef" typename identifier ';'

                identifier: / [a-z_]            # LEADING ALPHA OR UNDERSCORE
                              [a-z0-9_]*        # THEN DIGITS ALSO ALLOWED
                            /ix                 # CASE/SPACE/COMMENT INSENSITIVE

The value associated with any successfully matched token is a string containing the actual text which was matched by the token.

Token separators

For the purpose of matching, each token in a production is considered to be preceded by a "token separator" - a pattern which must be matched before a token match is attempted. By default, the token separator is optional whitespace (which always matches, at least trivially), but this default may be reset at several different levels.

The variable Parse::RecDescent::tokensep stores the universal token separator, which is the default for all token matches in all parsers build with Parse::RecDescent.

The token separator for an individual parser ($psr) can be reset by changing $psr->{"tokensep"} (or $thisparser->{"tokensep"} within a grammar action - see "Actions below).

Likewise the token separators for individual rules or even individual productions can be changed by altering the corresponding {"tokensep"} member of the appropriate sub-object (see the description of $thisrule and $thisprod in the next section.)

To remove a redefinition of the token separator for a particular, simply assign it an undefined value. Its value will then revert to the that of the next higher level (for example, undefining a production's token separator causes it to default to the enclosing rule's separator which may in turn default to the parser's, etc.). If the universal token separator Parse::RecDescent::tokensep is undefined, its default value reverts to the original default of optional whitespace.

Actions

An action is a block of Perl code which is to be executed (as the block of a do statement) when the parser reaches that point in a production. The action executes within a special namespace belonging to the active parser, so care must be taken in correctly qualifying variable names.

The action is considered to succeed if the final value of the block is defined (that is, if the implied do statement evaluates to a defined value - even one which would be treated as "false"). Note that the value associated with a successful action is also the final value in the block.

Within an action, a number of useful parse-time variables are available in the special parser namespace (there are other variables also accessible, but meddling with them will probably just break your parser. As a general rule, if you avoid referring to unquantified variables - especially those starting with an underscore - inside an action, things should be okay):

@item

The array slice @item[1..$#item] stores the value associated with each item (that is, each subrule, token, or action) in the current production. The analogy is to $1, $2, etc. in a yacc grammar.

The first element ($item[0]) stores the name of the current rule being matched. Note that, for obvious reasons, @item only contains the values of items before the current point in the production.

$return

If a value is assigned to $return within an action, that value is returned if the production containing the action eventually matches successfully. Note that setting $return doesn't cause the current production to succeed. It merely tells it what to return if it does succeed. Hence $return is analogous to $$ in a yacc grammar.

If $return is not assigned within a production, the value of the last component of the production (namely: $item[$#item]) is returned if the production succeeds.

$commit

The current state of commitment to the current production (see "Directives" below).

$text

The remaining (unparsed) text. Changes to $text do not propagate out of the current production.

$thisline

The current line number within the current parse (starting from 1). For efficiency, $thisline is actually a tied hash, and only recomputes the current line number when the variable's value is used. Assignment to $thisline has no effect.

$thisparser

A reference to the Parse::RecDescent object through which parsing was initiated. Almost never used except to reset the default token separator for the entire parser. For example:

                no_tabs: { $thisparser->{tokensep} = '[ \n]*' }
                         rules
                         { $thisparser->{tokensep} = undef }>>

The value of $thisparser propagates down the subrules of a parse but not back up. Hence, you can invoke subrules from another parser for the scope of the current rule as follows:

                rule: subrule1 subrule2
                    | { $thisparser = $::otherparser } <reject>
                    | subrule3 subrule4
                    | subrule5

The result is that the production calls "subrule1" and "subrule2" of the current parser, and the remaining productions call the named subrules from $::otherparser. Note, however that "Bad Things" will happen if ::otherparser isn't a blessed reference and/or doesn't have methods with the same names as the required subrules!

$thisrule

A reference to the Parse::RecDescent::Rule object corresponding to the rule currently being matched. Generally only used to (permanently) change the token separator for a rule. For example:

                commasep: { $thisrule->{tokensep} = '\s*,\s*' } <reject>
                        | cmdcode /[A-Za-z]{20}/ /\d{5,10}/
                        | cmdcode /\d{10}/ /\d{5,10}/

Note that a rule's token separator is only used before the explicit terminals in the rule (such as \d{10} in the above example). Subrules (like "cmdcode") use their own token separators.

$thisprod

A reference to the Parse::RecDescent::Production object corresponding to the production currently being matched. Again, not generally used, except to change the token separator $thisprod->{tokensep} for a single production.

Warning: the parser relies on the information in the this... objects in some non-obvious ways. Tinkering with the other members of these objects will probably cause Bad Things to happen, unless you really know what you're doing.

Look-ahead

If a subrule, token, or action is prefixed by "...", then it is treated as a "look-ahead" request. That means that the current production can (as usual) only succeed if the specified item is matched, but that the matching does not consume any of the text being parsed. This is very similar to the /(?=...)/ look-ahead construct in Perl patterns. Thus, the rule:

                inner_word: word ...word

will match whatever the subrule "word" matches, provided that match is followed by some more text which would subrule "word" would also match (although this second substring is not actually consumed by "inner_word")

Likewise, a "...!" prefix, causes the following item to succeed (without consuming any text) if and only if it would normally fail. Hence, a rule such as:

                identifier: ...!keyword ...!'_' /[A-Za-z_]\w*/

matches a string of characters which satisfies the pattern /[A-Za-z_]\w*/, but only if the same sequence of characters would not match either subrule "keyword" or the literal token '_'.

Sequences of look-ahead prefixes accumulate, multiplying their positive and/or negative senses. Hence:

                inner_word: word ...!......!word

is exactly equivalent the the original example above (a warning is issued in cases like these, since they often indicate something left out, or misunderstood).

Note that actions can also be treated as look-aheads. In such cases, the state of the parser text (in the local variable $text) after the look-ahead action is guaranteed to be identical to its state before the action, regardless of how it's changed within the action (unless you actually undefine $text, in which case you get the disaster you deserve :-).

Directives

Directives are special pre-defined actions which may be used to alter the behaviour of the parser. There are five directives: <commit>, <uncommit>, <reject>, <resync>, and <error>:

Committing and uncommitting

The <commit> and <uncommit> directives permit the recursive descent of the parse tree to be pruned (or "cut") for efficiency. Within a rule, a <commit> directive instructs the rule to ignore subsequent productions if the current production fails. For example:

                command: 'find' <commit> filename
                       | 'open' <commit> filename
                       | 'move' filename filename

Clearly, if the leading token 'find' is matched in the first production but that production fails for some other reason, then the remaining productions cannot possibly match. The presence of the <commit> causes the "command" rule to fail immediately if an invalid "find" command is found, and likewise if an invalid "open" command is encountered.

It is also possible to revoke a previous commitment. For example:

                if_statement: 'if' <commit> condition
                                        'then' block <uncommit>
                                        'else' block
                            | 'if' <commit> condition
                                        'then' block

In this case, a failure to find an "else" block in the first production shouldn't preclude trying the second production, but a failure to find a "condition" certainly should.

As a special case, any production in which the first item is an <uncommit> immediately revokes a preceding <commit> (even though the production would not otherwise have been tried). For example, in the rule:

                request: 'explain' expression
                       | 'explain' <commit> keyword
                       | 'save'
                       | 'quit'
                       | <uncommit> term '?'

if the text being matched was "explain?", and the first two productions failed, then the <commit> in production two would cause productions three and four to be skipped, but the leading <uncommit> in the production five would allow that production to attempt a match.

Note in the preceding example, that the <commit> was only placed in production two. If production one had been:

                request: 'explain' <commit> expression 

then production two would be (inappropriately) skipped if a leading "explain..." was encountered.

Both <commit> and <uncommit> directives always succeed, and their value is always 1.

Rejecting a production

The <reject> directive immediately causes the current production to fail (it is exactly equivalent to, but more obvious than, the action {undef}). A <reject> is useful when it is desirable to get the side effects of the actions in one production, without prejudicing a match by some other production later in the rule. For example, to insert tracing code into the parse:

                complex_rule: { print "In complex rule...\n"; } <reject>
                            
                complex_rule: simple_rule '+' 'i' '*' simple_rule
                            | 'i' '*' simple_rule
                            | simple_rule

It is also possible to specify a conditional rejection, using the form <reject:condition>, which only rejects if the specified condition is true. This form of rejection is exactly equivalent to the action {(condition)?undef:1}>. For example:

                command: save_command
                       | restore_command
                       | <reject: defined $::tolerant> { exit }
                       | <error: Unknown command. Ignored.>

A <reject> directive never succeeds (and hence has no associated value). A conditional rejection may succeed (if its condition is not satisfied), in which case its value is 1.

Resynchronization

The <resync> directive provides a visually distinctive means of consuming some of the text being parsed, usually to skip an erroneous input. In its simplest form <resync> simply consumes text up to and including the next newline ("\n") character, succeeding only if the newline is found, in which case it causes its surrounding rule to return zero on success.

In other words, a <resync> is exactly equivalent to the token /[^\n]*\n/ followed by the action { $return = 0 } (except that productions beginning with a <resync> are ignored when generating error messages). A typical use might be:

                script : command(s)

                command: save_command
                       | restore_command
                       | <error: Unknown command. Ignored.>
                       | <resync> # TRY NEXT LINE, IF POSSIBLE

It is also possible to explicitly specify a resynchronization pattern, using the <resync:pattern> variant. This version succeeds only if the specified pattern matches (and consumes) the parsed text. In other words, <resync:pattern> is exactly equivalent to the token /pattern/ (followed by a { $return = 0 } action). For example, if commands were terminated by newlines or semi-colons:

                command: save_command
                       | restore_command
                       | <error: Unknown command. Ignored.>
                       | <resync:[^;\n]*[;\n]>

The value of a successfully matched <resync> directive (of either type) is the text that it consumed. Note, however, that since the directive also sets $return, a production consisting of a lone <resync> succeeds but returns the value zero (which a calling rule may find useful to distinguish between "true" matches and "tolerant" matches). Remember that returning a zero value indicates that the rule succeeded (since only an undef denotes failure within Parse::RecDescent parsers.

Error handling

The <error> directive provides automatic or user-defined generation of error messages during a parse. In its simplest form <error> unconditionally generates an error message based on the mismatch between the last item expected and the text which cause it to fail. For example, given the rule:

                McCoy: curse ',' name ', I'm a doctor, not a' a_profession '!'
                     | possessive 'dead,' name '!'
                     | <error>

the following strings would produce the following messages:

"Amen, Jim!"
               ERROR (line 1): Invalid McCoy: Expected curse or possessive
                               not found
"Dammit, Jim, I'm a doctor!"
               ERROR (line 1): Invalid McCoy: Expected ", I'm a doctor, not a"
                               but found ", I'm a doctor!" instead
"He's dead,\n"
               ERROR (line 2): Invalid McCoy: Expected name not found
"He's alive!"
               ERROR (line 1): Invalid McCoy: Expected 'dead,' but found
                               "alive!" instead
"Dammit, Jim, I'm a doctor, not a pointy-eared Vulcan!"
               ERROR (line 1): Invalid McCoy: Expected a profession but found
                               "pointy-eared Vulcan!" instead

Note that, when autogenerating error messages, all underscores in any rule name used in a message are replaced by single spaces (for example "a_production" becomes "a production"). Judicious choice of rule names can therefore considerably improve the readability of automatic error messages (as well as the maintainability of the original grammar).

If the automatically generated error is not sufficient, it is possible to provide an explicit message as part of the error directive. For example:

                Spock: "Fascinating ',' (name | 'Captain') '.'
                     | "Highly illogical, doctor."
                     | <error: He never said that!>

which would result in all failures to parse a "Spock" subrule printing the following message:

               ERROR (line <N>): Invalid Spock:  He never said that!

The error message is treated as a "qq{...}" string and interpolated when the error is generated (not when the directive is specified!). Hence:

                <error: Mystical error near "$text">

would correctly insert the ambient text string which caused the error.

There are two other forms of error directive: <error?> and <error?: msg>. These behave just like <error> and <error: msg> respectively, except that they are only triggered if the rule is "committed" at the time they are encountered. For example:

                Scotty: "Ya kenna change the Laws of Phusics," <commit> name
                      | name <commit> ',' 'she's goanta blaw!'
                      | <error?>

will only generate an error for a string beginning with "Ya kenna change the Laws o' Phusics," or a valid name, but which still fails to match the corresponding production. That is, $parser->Scotty("Aye, Cap'ain") will fail silently (since neither production will "commit "the rule on that input), whereas $parser->Scotty("Mr Spock, ah jest kenna do'ut!")> will fail with the error message:

               ERROR (line 1): Invalid Scotty: expected 'she's goanta blaw!'
                               but found 'I jest kenna do'ut!' instead.

since in that case the second production would commit after matching the leading name.

Note that to allow this behaviour, all <error> directives which are the first item in a production uncommit the rule just long enough to allow their production to be attempted (that is, when their production fails, the commitment is reinstated so that subsequent productions are skipped).

In order to permanently uncommit the rule before an error message, it is necessary to put an explicit <uncommit> before the <error>. For example:

                line: 'Kirk:'  <commit> Kirk
                    | 'Spock:' <commit> Spock
                    | 'McCoy:' <commit> McCoy
                    | <uncommit> <error?> <reject>
                    | <resync>

There is one situation in which the output of the various types of error directive is suppressed; namely, when the rule containing them is being parsed as part of a "look-ahead" (see "Look-ahead"). In this case, the error directive will still cause the rule to fail, but will do so silently.

An unconditional <error> directive always fails (and hence has no associated value). This means that encountering such a directive always causes the production containing it to fail. Hence an <error> directive will inevitably be the last (useful) item of a rule (a warning is issued if a production contains items after an unconditional <error> directive).

An <error?> directive will succeed (that is: fail to fail :-), if the current rule is uncommitted when the directive is encountered. In that case the directive's associated value is zero. Hence, this type of error directive can usefully be used before the end of a production. For example:

                command: 'do' <commit> something
                       | 'report' <commit> something
                       | <error?: Syntax error> <error: Unknown command>

Warning: The <error?> directive does not mean "always fail (but do so silently unless committed)". It actually means "only fail (and report) if committed, otherwise succeed". To achieve the "fail silently if uncommitted" semantics, it is necessary to use:

                rule: item <commit> item(s)
                    | <error?> <reject>      # FAIL SILENTLY UNLESS COMMITTED

The level of reporting during both parser construction and parsing is controlled by the presence or absence of five global variables: $::RD_ERRORS, $::RD_WARNINGS, $::RD_HINTS, and <$::RD_TRACE>. If $::RD_ERRORS is defined (and by default, it is) then fatal errors are reported.

Defining any of the remaining variables (which are not defined by default) cumulatively increases the amount of information reported. Defining $::RD_WARN causes non-fatal errors to also be reported, whilst defining $::RD_HINT causes the parser generator to offer more detailed analyses and hints on both errors and warnings.

Defining $::RD_TRACE causes the parser generator and the parser to report their progress to STDERR in excruciating detail. This is mainly useful for debugging a grammar that isn't behaving as you expected it to. To this end, if $::RD_TRACE is defined when a parser is built, any actual parser code which is generated is also written to a file named "RD_TRACE" in the local directory.

Note that the four variables belong to the "main" package, which makes them easier to refer to in the code controlling the parser, and also makes it easy to turn them into command line flags ("-RD_ERROR", "-RD_WARN", "-RD_HINT", "-RD_TRACE") under "perl -s".

Alternations

Alternations are implicit (unnamed) rules defined as part of a production. An alternation is defined as a series of '|'-separated productions inside a pair of round brackets. For example:

                character: 'the' ( good | bad | ugly ) /dude/

Every alternation implicitly defines a new subrule (imaginatively named "implicit_subrule_<N>" for increasing values of <N> starting at "000001"), and inserts a call to that implicit subrule in place of the brackets. Hence the above example is merely a convenient short-hand for:

                character: 'the' implicit_subrule_000001 /dude/

                implicit_subrule_000001: good | bad | ugly

Since alternations are parsed by recursively calling the parser generator, any type(s) of item can appear in an alternation. For example:

                character: 'the' ( 'high' "plains"      # Silent, with poncho
                                 | /no[- ]name/         # Silent, no poncho
                                 | vengeance_seeking    # Poncho-optional
                                 | <error>              
                                 ) drifter

In this case, if an error occurred, the automatically generated message would be:

                ERROR (line <N>): Invalid implicit subrule: Expected
                                  'high' or /no[- ]name/ or generic,
                                  but found "pacifist" instead

Since every implicit subrule actually has a name, it's even possible to extend or replace them:

                parser->Replace("implicit_subrule_000001: 'generic Eastwood'");

Incremental parsing

Parse::RecDescent provides two methods - Extend and <Replace> - which can be used to alter the grammar matched by a parser. Both methods take the same argument as Parse::RecDescent::new, namely a grammar specification string

Parse::RecDescent::Extend interprets the grammar specification and adds any productions it finds to the end of the rules for which they are specified. For example:

                $add = "name: 'Jimmy-Bob' | 'Bobby-Jim'\ndesc: colour /necks?/";
                parser->Extend($add);

adds two productions to the rule "name" (creating it if necessary) and one production to the rule "desc".

Parse::RecDescent::Replace is identical, except that it first resets are rule specified in the additional grammar, removing any existing productions. Hence after:

                $add = "name: 'Jimmy-Bob' | 'Bobby-Jim'\ndesc: colour /necks?/";
                parser->Replace($add);

are are only only valid "name"s and one possible description. adds two productions to the rule "name" (creating it if necessary) and one production to the rule "desc".

A more interesting use of the Extend and Replace methods is to call them inside the action of an executing parser. For example:

                typedef: 'typedef' type_name identifier ';'
                               { $thisparser->Extend("type_name: '$item[3]'") }
                       | <error>

                identifier: ...!type_name /[A-Za-z_]w*/

which automatically prevents type names from being typedef'd, or:

                command: 'map' key_name 'to' abort_key
                               { $thisparser->Replace("abort_key: '$item[2]'") }
                       | 'map' key_name 'to' key_name
                               { map_key($item[2],$item[4]) }
                       | abort_key
                               { exit if confirm("abort?") }

                abort_key: 'q'

                key_name: ...!abort_key /[A-Za-z]/

which allows the user to change the abort key binding, but not to unbind it.

The careful use of such constructs makes it possible to reconfigure a a running parser, eliminating the need for semantic feedback by providing syntactic feedback instead.

A metagrammar for Parse::RecDescent

The following is a specification of grammar format accepted by Parse::RecDescent::new (specified in the Parse::RecDescent grammar format!):

        grammar    : components(s)

        component  : rule | comment

        rule       : "\n" identifier ":" production(s?)
        
        production : items(s) 

        item       : lookahead(?) simpleitem
                   | directive
                   | comment

        lookahead  : '...' | '...!'                   # +'ve or -'ve lookahead

        simpleitem : subrule                          # match another rule
                   | repetition                       # match repeated subrules
                   | terminal                         # match the next input
                   | bracket                          # match alternative items
                   | action                           # do something

        subrule    : identifier                       # the name of the rule

        repetition : subrule howoften

        howoften   : '(?)'                            # 0 or 1 times
                   | '(s?)'                           # 0 or more times
                   | '(s)'                            # 1 or more times
                   | /(\d+)[.][.](/\d+)/              # $1 to $2 times
                   | /[.][.](/\d*)/                   # at most $1 times
                   | /(\d*)[.][.])/                   # at least $1 times

        terminal   : /[/]([\][/]|[^/])*[/]/           # interpolated pattern
                   | /"([\]"|[^"])*"/                 # interpolated literal
                   | /'([\]'|[^'])*'/                 # uninterpolated literal

        action     : '{' {extract_codeblock()} '}'    # embedded Perl code

        bracket    : '(' item(s) production(s?) ')'   # alternative subrules

        directive  : '<commit>'                       # commit to production
                   | '<uncommit>'                     # cancel commitment
                   | '<resync>'                       # skip to newline
                   | '<resync: pattern '>'            # skip <pattern>
                   | '<reject>'                       # fail this production
                   | '<reject:' condition '>'         # fail if <condition>
                   | '<error>'                        # report an error
                   | '<error:' msg '>'                # report error as "<msg>"
                   | '<error?>'                       # error only if committed
                   | '<error?:' msg '>'               #   "    "    "    "

        identifier : /[A-Za-z]\w*/                    # must start with alpha

        comment    : /#[^\n]*/                        # same as Perl

        pattern    : { extract_bracketed('<') }       # allow embedded "<..>"

        condition  : { extract_codeblock('{<') }      # full Perl expression

        msg        : { extract_bracketed('<') }       # allow embedded "<..>"

DIAGNOSTICS

Diagnostics are intended to be self-explanatory (particularly if you use -RD_HINT (under "perl -s") or define $::RD_HINT inside the program). Parse::RecDescent currently diagnoses the following problems:

  • Invalid regular expressions used as pattern terminals.

  • Invalid Perl code in code blocks.

  • Lookahead used in the wrong place or in a nonsensical way.

  • "Obvious" cases of left-recursion.

  • Unrecognisable components in the grammar specification.

  • "Orphaned" rule components specified before the first rule or after an <error> directive.

  • Missing rule definitions (this generates a warning only, since you may be providing them later via Parse::RecDescent::Extend()).

  • Attempts to define rules named 'Replace' or 'Extend', which cannot be called directly through the parser object because of the predefined meaning of Parse::RecDescent::Replace and Parse::RecDescent::Extend. (Only a warning is generated, since such rules can still be used as subrules).

  • Multiple consecutive lookahead specifiers (a warning only, since their effects simply accumulate).

  • Instances where greedy repetition behaviour will almost certainly cause the failure of a production (a warning only - see "On-going Issues" below).

AUTHOR

Damian Conway (damian@cs.monash.edu.au)

BUGS AND IRRITATIONS

There are undoubtedly serious bugs lurking somewhere in this much code :-) Bug reports and other feedback are most welcome.

Ongoing annoyances include:

  • There's no support for parsing directly from an input stream. If and when the Perl Gods give us regular expressions on streams, this should be trivial (ahem!) to implement.

  • The parser generator can get confused if actions aren't properly closed or if they contain particularly nasty Perl syntax errors (particularly unmatched curly brackets).

  • The generator only detects the most obvious form of left recursion (potential recursion on the first subrule in a rule). More subtle forms of left recursion (for example, through the second item in a rule after a "zero" match of a preceding "zero-or-more" repetition, or after a match of a subrule with an empty production) are not found.

  • Instead of complaining about left-recursion, the generator should silently transform the grammar to remove it. Don't expect this feature any time soon as it would require a more sophisticated approach to parser generation than is currently used.

  • The generated parsers don't always run as fast as might be wished.

  • The meta-parser should be bootstrapped using Parse::RecDescent :-)

ON-GOING ISSUES AND FUTURE DIRECTIONS

  1. Repetitions are "incorrigibly greedy" in that they will eat everything they can and won't backtrack if that behaviour causes a production to fail needlessly. So, for example:

                    rule: subrule(s) subrule

    will never succeed, because the repetition will eat all the subrules it finds, leaving none to match the second item. Such constructions are relatively rare (and Parse::RecDescent::new generates a warning whenever they occur) so this may not be a problem, especially since the insatiable behaviour can be overcome "manually" by writing:

                    rule: penultimate_subrule(s) subrule
    
                    penultimate_subrule: subrule ...subrule

    The issue is that this construction is exactly twice as expensive as the original, whereas backtracking would add only 1/N to the cost (for matching N repetitions of subrule). I would welcome feedback on the need for backtracking; particularly on cases where the lack of it makes parsing performance problematical.

  2. Having opened that can of worms, it's also necessary to consider whether there is a need for non-greedy repetition specifiers. Again, it's possible (at some cost) to manually provide the required functionality:

                    rule: nongreedy_subrule(s) othersubrule
    
                    nongreedy_subrule: subrule ...!othersubrule

    Overall, the issue is whether the benefit of this extra functionality outweighs the drawbacks of further complicating the (currently minimalist) grammar specification syntax, and (worse) introducing more overhead into the generated parsers.

  3. An <autocommit> directive would be nice. That is, it would be useful to be able to say:

                    command: <autocommit>
                    command: 'find' name
                           | 'find' address
                           | 'do' command 'at' time 'if' condition
                           | 'do' command 'at' time
                           | 'do' command
                           | unusual_command

    and have the generator work out that this should be "pruned" thus:

                    command: 'find' name
                           | 'find' <commit> address
                           | 'do' <commit> command <uncommit>
                                    'at' time
                                    'if' <commit> condition
                           | 'do' <commit> command <uncommit>
                                    'at' <commit> time
                           | 'do' <commit> command
                           | unusual_command

    There are several issues here. Firstly, should the <autocommit> automatically install an <uncommit> at the start of the last production (on the grounds that the "command" rule doesn't know whether an "unusual_command" might start with "find" or "do") or should the "unusual_command" subgraph be analysed (to see if it might be viable after a "find" or "do")?

    The second issue is how regular expressions should be treated. The simplest approach would be simply to uncommit before them (on the grounds that they might match). Better efficiency would be obtained by analyzing all preceding literal tokens to determine whether the pattern would match them.

    Overall, the issues are: can such automated "pruning" approach a hand-tuned version sufficiently closely to warrant the extra set-up expense, and (more importantly) is the problem important enough to even warrant the non-trivial effort of building an automated solution?

COPYRIGHT

       Copyright (c) 1997, Damian Conway. All Rights Reserved.
     This module is free software. It may be used, redistributed
     and/or modified under the terms of the Perl Artistic License
          (see http://www.perl.com/perl/misc/Artistic.html)

3 POD Errors

The following errors were encountered while parsing the POD:

Around line 665:

Unterminated S<...> sequence

Around line 751:

You forgot a '=back' before '=head2'

Around line 969:

You forgot a '=back' before '=head1'