The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Parse::Marpa - (Alpha) Earley's algorithm with LR(0) precomputation

VERSION

WARNING -- THIS IS A DEVELOPER'S RELEASE, NOT FOR USE BY NON-DEVELOPERS. I use these releases to avail myself of the cpantesters results, and to test the release process itself.

Of course, it's open source, and you're entitled to appoint yourself a developer if you insist on it. But that will usually not be a reasonable thing to do.

Right now the documentation in this release is half-converted from one interface to another, so it's incomplete and inconsistent.

SYNOPSIS

    use 5.010_000;
    use strict;
    use warnings;
    use English;
    use Parse::Marpa;

    # remember to use refs to strings
    my $value = Parse::Marpa::mdl(
        (do { local($RS) = undef; my $source = <DATA>; \$source; }),
        \("2+2*3")
    );
    say $$value;

    __DATA__
    semantics are perl5.  version is 0.205.0.  start symbol is Expression.

    Expression: Expression, /[*]/, Expression.  priority 200.  q{
        $_->[0] * $_->[2]
    }.

    Expression: Expression, /[+]/, Expression.  priority 100.  q{
        $_->[0] + $_->[2]
    }.

    Expression: /\d+/.  q{ $_->[0] }.

DESCRIPTION

Parsing Terminology

The Marpa documents assume that you know the standard concepts and terminology of parsing. This section is intended to be read quickly to serve as a reminder. I put defining uses of terms in boldface, for easy skimming. If you are already comfortable with parsing terminology, you could skip this section entirely.

This terse summary of the vocabulary is aimed at those who are mostly familiar with it already. If you aren't, it's probably not a good idea to try to learn the basics of parsing here. As an introduction, I reccommend Mark Jason Dominus's excellent chapter on parsing in the Perl context. Online, Wikipedia is an excellent place to start.

Basic terms

A grammar is a set of rules. The rules describe a set of strings of symbols. A string of symbols is often called a symbol string. The rules of a grammar are often called productions.

Stages of Parsing

A recognizer is a program that determines whether its input is one of the symbol strings in the set described by the rules of a grammar. A parser is a program which finds the structure of the input according to the rules of a grammar.

The term parsing is used in a strict and a loose sense. Parsing in the loose sense means the entire process of finding a grammar's structure, including a separate recognition phase if the parser has one. (Marpa does.) If a parser has phases, parsing in the strict sense refers to the phase that finds the structure of the input. When this document intends the term parsing in its strict sense, it will speak explicitly of "parsing in the strict sense". Otherwise, the term parsing will mean parsing in the loose sense.

Parsers often use a lexical analyzer to convert raw input, usually input text, into a series of tokens. Each token represents a symbol of the grammar and has a value. The series of symbols represented by the series of tokens becomes the symbol string input seen by the recognizer. The symbol string input is also called the input sentence. A lexical analyzer is often called a lexer or a scanner, and lexical analysis is often called lexing or scanning.

Productions

A standard way of describing rules is Backus-Naur Form, or BNF. In one common way of writing BNF, a production looks like this.

    Expression ::= Term Factor

The equivalent in Marpa's MDL language looks like this:

    Expression: Term, Factor.

In the production above, Expression, Term and Factor are symbols. A production consists of a left hand side and a right hand side. In context-free grammars, like those Marpa parses, the left hand side of a production is always a symbol string of length 1. The right hand side of a production is a symbol string of zero or more symbols. In the example, Expression is the left hand side, and Term and Factor are right hand side symbols.

Left hand side and right hand side are often abbreviated as rhs and lhs. If the rhs of a production has no symbols, the production is called an empty production or an empty rule.

Any symbol which is allowed to occur in the symbol string input is called a terminal symbol. If the symbols in a symbol string are all terminals, that symbol string is also called a sentence.

Derivations

A step of a derivation, or derivation step, is made by taking a symbol string and any production in the grammar whose lhs occurs in that first symbol string, and replacing an occurrence of the lhs symbol in the first symbol string with the rhs of the production. For example, if A, B, C, D, and X are symbols, and

    X: B, C.

is a production, then

    A X D -> A B C D

is a derivation step, with "A X D" as its beginning and "A B C D" as its end or result. We say that the string "A X D" derives the string "A B C D".

A derivation is a sequence of derivation steps. The length of a derivation is its length in steps. A symbol string directly derives another if and only if there is a derivation of length 1 from the first symbol string to the result. A symbol string is said to derive itself in a derivation of length 0. Such a zero length derivation is a trivial derivation.

If a derivation is not trivial or direct, that is, if it has more than one step, then it is an indirect derivation. A derviation which is not trivial, that is, one which has one or more steps, is a non-trivial derivation.

Where the symbol string beginning a derivation consists of a single symbol, we often say that symbol produces the symbol string which results from the derivation. We say that the beginning symbol trivially, non-trivially, directly or indirectly produces the symbol string if the length of the derivation is 0, greater than 0, 1, or greater than 1, just as we do when we say a symbol string derives another symbol string.

When we say that a symbol produces or derives a symbol string, we are taking a top-down point of view. We sometimes take a bottom-up point of view, and say that the symbol matches the symbol string, or that the symbol string matches the symbol.

In any parse, one symbol is distinguished as the start symbol. The parse of an input is successful if and only if the start symbol produces the input according to the grammar.

Nulling

The length of a symbol string is the number of symbols in it. The zero length symbol string is called the empty string. The empty string can be considered to be a sentence, in which case it is the empty sentence. A string of one or more symbols is non-empty. A derivation which produces the empty string is a null derivation. A derivation from the start symbol which produces the empty string is a null parse.

If in a particular grammar, a symbol has a null derivation, it is a nullable symbol. If, in a particular grammar, the only sentence produced by a symbol is the empty sentence, it is a nulling symbol. All nulling symbols are nullable symbols. In any instance where a symbol produces the empty string, it is said to be nulled, or to be a null symbol.

Other Special Cases

If any derivation from the start symbol uses a rule, that rule is called reachable or accessible. A rule that is not accessible is called unreachable or inaccessbile. If any derivation which results in a sentence uses a rule, that rule is said to be productive. A rule that is not productive is called unproductive. A simple case of an unproductive rule is one whose rhs contains a symbol which is not a terminal and not on the lhs of any other rule. A useless rule is one which is inaccessible or unproductive.

If any symbol in the grammar non-trivially produces a symbol string containing itself, the grammar is said to be recursive. If any symbol non-trivially produces a symbol string with itself on the left, the grammar is said to be left-recursive. If any symbol non-trivially produces a symbol string with itself on the right, the grammar is said to be right-recursive. A non-trivial derivation of a symbol string from itself is a cycle. A grammar which contains no cycles is cycle-free.

Structure

The structure of a parse can be represented by as a series of derivation steps from the start symbol to the input. If, for a particular grammar, more than one derivation is possible for an input, that input is said to have an ambiguous parse. An ambiguous grammar is one which for which there is some input that has an ambiguous parse.

Another way to represent structure is as a parse tree. Every symbol used in the parse is represented by a node of the parse tree. Wherever a production is used in the parse, its lhs is represented by a parent node and the rhs symbols are represented by child nodes. The start symbol is the root of the tree. Terminals and symbols on the left hand side of empty productions are leaf nodes. If a node is not a leaf node, it is an inner node. A nulled node is one that represents a nulled symbol.

The node at the root of the tree is the start node. Any node with a symbol being used as a terminal is a terminal node. A node closer to the root node is said to be higher than a node further away.

Semantics

In real life, the structure of a parse is usually a means to an end. Grammars usually have a semantics associated with them, and what the user actually wants is the value of the parse according to the semantics. The tree representation is especially useful when evaluating a parse.

Nulled nodes are treated specially in Marpa, as described below. Nodes which are not nulled are evaluated in standard ways. Every non-null leaf node which represents a terminal symbol has a value associated with it on input. Non-null inner nodes take their semantics from the production whose lhs they represent. The semantics for a production describe how to calculate the value of the node which represents the lhs (the parent node) from the values of zero or more of the nodes which represent the rhs symbols (child nodes). Values are computed recursively, bottom-up. The value of a parse is the value of its start symbol.

Capabilities

Marpa parses any language which can be expressed in cycle-free BNF. (A cycle occurs when a symbol string, after one or more derivation steps, reproduces itself exactly.) Essentially, a cycle is recursion without change. Recursion is highly useful, but cycles always seem to be pathological. Marpa will parse recursive grammars, including left-recursive and right-recursive ones, as long as they are cycle-free.

In formal terms, Marpa parses any cycle-free context-free grammar. Marpa can parse grammars with empty productions and useless productions. Often you cannot express the semantics of a grammar in a natural way without using empty productions. Useless productions cause no major harm, and Marpa is happy to tolerate them.

Ambiguous grammars are a Marpa specialty. (An ambiguous grammar is a grammar which might can parse an input in more than one way.) Ambiguity is often useful even if you are only interested in one parse. An ambiguous grammar is often the easiest and most sensible way to express a language.

Human languages are ambiguous. English sentences are often ambiguous, but human listeners hear the parse that makes most sense. Marpa allows the user to prioritize rules so that a preferred parse is returned first. Marpa can also return all the parses of an ambiguous grammar, if that's what the user prefers.

Marpa incorporates major improvements from recent research into Earley's algorithm. In particular, it combines Earley's with LR(0) precomputation. Marpa also has innovations of its own, including predictive and ambiguous lexing.

The Easy Way

Most of Marpa's capabilities are available using a single static method: Parse::Marpa::mdl. The mdl method requires a grammar description in MDL (the Marpa Description Language) and a string. mdl parses the string according to the MDL description. In scalar context, mdl returns a reference to the value of the first parse. In list context, it returns references to the values of all parses. See below for more detail about the mdl static method.

Semantic Actions

Marpa's semantics are specified with Perl 5 code strings, called actions. Marpa allows lexing actions, a preamble action, rule actions and null symbol actions.

Actions are calculated in a special namespace set aside for that purpose. The preamble action is always run first and can be used to initialize that namespace.

The result of an action is the result from running its Perl 5 code string. From the synopsis, here's a rule for an expression that does addition:

    Expression: Expression, /[+]/, Expression.

and here's its action:

    $_->[0] + $_->[2]

In rule actions, $_ is defined as a reference to an array of the values of the symbols on the left hand side of the rule.

Marpa is targeted to Perl 6. When Perl 6 is ready, Perl 6 code will become its default semantics.

Null Symbol Values

Every symbol has a null symbol value, or more briefly, a null value, and this is used as the value of the symbol when it is nulled. The default null value is a Marpa option (default_null_value). If not explicitly set, default_null_value is a Perl 5 undefined.

Every symbol can have its own null symbol value. In cases where a null symbol derives other null symbols, only the value of the symbol highest in the null derivation is used. For more details, and examples, see "Null Symbol Values" in Parse::Marpa::Evaluator.

Lexing

The easiest way to parse a Perl 5 string in Marpa is to use MDL's default lexing. MDL allows terminals to be defined either as Perl 5 regexes or, for difficult cases, lex actions, which are Perl 5 code. Unlike most parser generators, Marpa does not require that the regular expressions result in a deterministic lexer. It is OK with Marpa if more than one token is possible at a location, or if possible tokens overlap. Ambiguities encountered in lexing are passed up to Marpa's parse engine, and dealt with there.

Marpa allows terminals to appear on the left hand side of rules. Most parsers have a problem with this, but Marpa does not.

Marpa is not restricted to MDL's model of lexing. Advanced users can invent new models of the input, customized to their applications. For more detail see "Tokens and Earlemes" in Parse::Marpa::Grammar.

Lack of Backward Compatibility

While this module is in alpha, versions may not be backward compatible. MDL protects users by requiring the version to be specified, and by insisting on an exact match with Marpa's version number. This strict version regime is the same as that being considered for Perl 6. Nonetheless, Marpa's version matching may become less strict once it goes beta.

How to Read These Documents

The rest of this Description section deals with advanced topics. To create your first Marpa parser, once you've finished with this document, you want to read the Parse::Marpa::Doc::MDL document. That has the details on how to create an MDL grammar description. That should be all you need to get started.

If you want to get into advanced uses of Marpa, the Parse::Marpa::Grammar, Parse::Marpa::Recognizer, and Parse::Marpa::Evaluator documents describe methods which allow you control over the individual phases of the parse, and discuss advanced topics associated with each phase. Parse::Marpa::Doc::Diagnostics describes techniques, named arguments and methods available for debugging and tracing.

Parse::Marpa::Doc::Plumbing documents Marpa's plumbing. Parse::Marpa::MDL documents utilities for converting MDL symbol names to plumbing interface names. Parse::Marpa::Lex documents some lex actions which are used by MDL, and which are available to users for their own lexing.

For advanced diagnostics or for reading Marpa's code, it is necessary to understand Marpa's internals. These are described in Parse::Marpa::Doc::Internals.

Details about sources (books, web pages and articles) referred to in these documents or used in the writing of Marpa are collected in Parse::Marpa::Doc::Bibliography. For those interested in theory, Parse::Marpa::Doc::Algorithm describes Marpa's algorithm, explains how Marpa would not have been possible without the the work of others, and details what is new with Marpa. Parse::Marpa::Doc::To_Do is Marpa's list of things to do.

Phases

The mdl method hides the process of creating Marpa's objects and using Marpa's object methods from the user. But for advanced applications, and for tracing and diagnostics, it is useful to know in detail how Marpa works.

Marpa parsing take place in three phases: grammar creation, input recognition and parse evaluation. For brevity, I'll often speak of the the parse evaluation phase as the evaluation phase, and the input recognition phase as the recognition phase.

Corresponding to the three phases, Marpa has three kinds of object: grammars, recognizers and evaluators. Recognizers are created from grammars and evaluators are created from recognizers.

Grammar objects (Parse::Marpa::Grammar) are created first. They may be created with rules or empty. Rules may be added to them after they have been created. After all the rules have been added, but before it is used to create a recognizer, a grammar must be precomputed. Details on grammar objects and methods can be found at Parse::Marpa::Grammar.

To create a Marpa recognizer object (Parse::Marpa::Recognizer), a Marpa grammar object is required. Once a recognizer object has been created, it can accept input. You can create multiple recognizers from a single grammar, and safely run them simultaneously.

Recognizing an input is answering the "yes" or "no" question: Does the input match the grammar? While recognizing its input, Marpa builds tables. Marpa's evaluation phase works from these tables. Before creation of an evaluation object, the input has not been parsed in the strict sense of the term, that is, its structure according to the grammar has not been determined. For more details on recognizer objects and methods, see Parse::Marpa::Recognizer.

Currently, Marpa fully supports only non-streaming or "offline" input. Marpa will also parse streamed inputs, but the methods to find completed parses in a streamed input are still experimental.

In offline mode, once input is completed, an evaluator object (Parse::Marpa::Evaluator) can be created. For each recognizer, only one evaluator object can be in use at any one time.

An evaluator object is an iterator. If the grammar is ambiguous, the evaluator object can be used to return the values of all the parses. For details on evaluator objects and methods, see Parse::Marpa::Evaluator.

Plumbing and Porcelain

A grammar is specified to Marpa through a grammar interface. There are two kinds of grammar interfaces, porcelain and plumbing. There is only one plumbing interface. As of the moment there is also only one porcelain interface, the Marpa Demonstration Language.

The plumbing is a set of named arguments to the new and set methods of Marpa grammar objects. Porcelain interfaces use the plumbing indirectly. The plumbing is efficient, but MDL is easier to read, write and maintain. Users seeking efficiency are usually better off using compiled MDL. The documentation for the plumbing is Parse::Marpa::Doc::Plumbing.

There can be other porcelain interfaces. Users are encouraged to design their own porcelain. In Marpa's eyes all porcelain will be equal. I call the porcelain that I am delivering with Marpa the Marpa Demonstration Language instead of the "Marpa Language" to emphasize its lack of special status. The documentation for MDL can be found at Parse::Marpa::Doc::MDL.

Namespaces

Actions run in special namespaces unique to each recognizer object. These special namespaces belong entirely to the user.

In the following namespaces, users should use only documented methods:

    Parse::Marpa
    Parse::Marpa::Grammar
    Parse::Marpa::Lex
    Parse::Marpa::MDL
    Parse::Marpa::Recognizer
    Parse::Marpa::Evaluator
    Parse::Marpa::Read_Only

Marpa namespaces and variables not mentioned in this section, should not be relied on or modified. Users should use variables in the the Parse::Marpa::Read_Only namespace on a read-only basis.

The $_ variable is made available to the rule actions, and the $STRING and $START variables to the lex actions, on a read-only basis, except as described in the documentation. Staying read-only can be tricky when dealing with Perl 5 arrays and hashes. Be careful about auto-vivification!

Returns and Exceptions

Most Marpa methods return only if successful. On failure they throw an exception using croak(). If you don't want the exception to be fatal, catch it using eval. A few failures are considered "non-exceptional" and returned. Non-exceptional failures are described in the documentation for the method which returns them.

METHODS

mdl

     $first_result = Parse::Marpa::mdl(\$grammar_description, \$string_to_parse);

     @all_results = Parse::Marpa::mdl(\$grammar_description, \$string_to_parse);

     $first_result = Parse::Marpa::mdl(
         \$grammar_description,
         \$string_to_parse,
         { warnings => 0 }
     );

The mdl static method takes three arguments: a reference to a string containing an MDL description of the grammar; a reference to a string with the text to be parsed; and (optionally) a reference to a hash with options. The available options are described below.

In scalar context, mdl returns a reference to the value of the first parse. In list context, mdl returns a list of references to the values of the parses. If there are no parses, mdl returns undefined in scalar context and the empty list in list context.

Diagnostic Methods

A separate document on diagnostics deals with methods for debugging grammars and parses.

OPTIONS

Marpa has options which control its behavior. These may be set using named arguments when Parse::Marpa::Grammar and Parse::Marpa::Recognizer objects are created, with the Parse::Marpa::Grammar::set method, and with the Parse::Marpa::mdl static method. Except as noted, recognizer objects inherit the Marpa option settings of the grammar from which they were created, and evaluator objects inherit the Marpa option settings of the recognizer from which they were created.

The primary Marpa options are listed below, by argument name, and described. Options for debugging and tracing are described in the separate document on diagnostics.

Porcelain interfaces have their own conventions for Marpa options. The documentation of MDL describes, and the documentation of all porcelain interfaces should describe, which options can be set through that interface, and how.

ambiguous_lex

Treats its value as a boolean. If true, ambiguous lexing is used. Ambiguous lexing means that even if a terminal is matched by a closure or a regex, the search for other terminals at that location continues. If multiple terminals match, all the tokens found are considered for use in the parse. If the parse is ambiguous, it is possible that all the tokens will actually be used. Ambiguous lexing is the default. The ambiguous_lex option cannnot be changed after grammar precomputation.

If ambiguous_lex is false, Marpa does unambiguous lexing, which is the standard in parser generators. With unambiguous lexing, lexing at each location ends when the first terminal matches. The user must ensure the first terminal to match is the correct one. Traditionally, users have done this by making their lex patterns deterministic -- that is, they set their lex patterns up so that every valid input lexes in one and only one way.

Marpa offers users who opt for unambiguous lexing a second alternative. Terminals are tested in order of priority, and the priorities can be set by the user.

code_lines

If there is a problem with user supplied code, Marpa prints the error message and a description of where the code is being used. Marpa also displays the code. The value of code_lines tells Marpa how many lines to print before truncating the code. If it's zero, no code is displayed. If it's negative, all the code is displayed, no matter how long it is. The default is 30 lines. The code_lines option can be changed at any point in a parse.

default_action

Takes as its value a string, which must be code in the current semantics. (Right now Perl 5 is the only semantics available.) This value is used as the action for rules which have no explicitly specified action. If default_action is not set, the default action is to return a Perl 5 undefined. default_action cannot be changed once a recognizer object has been created.

default_lex_prefix

The value must be a regex in the current semantics. (Right now Perl 5 is the only semantics available.) The lexers allow every terminal to specify a lex prefix, a pattern to be matched and discarded before the pattern for the terminal itself is matched. Lex prefixes are often used to handle leading whitespace. default_lex_prefix cannot be changed once a grammar is precomputed.

If a terminal has no lex prefix set, default_lex_prefix is used. When default_lex_prefix is not set, the default lex prefix is equivalent to a regex which matches only the empty string.

default_null_value

The value must be an action, that is, a string containing code in the current semantics. (Right now Perl 5 is the only semantics available.) The null value of a symbol is the symbol's value when it matches the empty string in a parse. Null symbol values are calculated when a recognizer object is created. default_null_value cannot be changed after that point.

Symbols with an explicitly set null action use the value returned by that explicitly set action. Otherwise, if there is a default_null_value action, that action is run when the recognizer is created, and the result becomes that symbol's null value. If there is no default_null_value action, and a symbol has no explicitly set null action, that symbol's null value is a Perl 5 undefined. There's more about null values above and in "Null Symbol Values" in Parse::Marpa::Evaluator.

max_parses

The value must be an integer. If it is greater than zero, evaluators will return no more than that number of parses. If it is zero, there will be no limit on the number of parses returned by an evaluator. The default is for there to be no limit. max_parses can be changed at any point in the parse.

Grammars for which the number of parses grows exponentially with the length of the input are common, and easy to create by mistake. This option is one way to deal with that.

online

A boolean. If true, Marpa runs in online or streaming mode. If false, Marpa runs in offline mode. The online option cannot be changed after a recognizer is created. Offline mode is the default, and the only mode available in this release.

In offline mode, when the first evaluator is created from a recognizer, Marpa assumes that input to the recognizer has ended. The recognizer does some final bookkeeping, and refuses to accept any more input.

opaque

The opaque option is used to set the opacity of an object. A value of 1 marks the object opaque. A value of 0 is the default, and marks it transparent. Not specifying this option and accepting the default behavior is always safe. If an object is been marked opaque, its opacity setting cannot be changed.

Recognizers inherit the opacity marking of the grammar used to create them, if there was one. If a recognizer is created from a grammar without an opacity marking, and no opaque option is specified, the recognizer is marked opaque. Evaluators inherit the opacity marking of the recognizer used to create them.

If an evaluator object is marked transparent, an optimization called "node value memoization" is enabled. An evaluator should be marked transparent only if its semantic actions can be safely memoized.

Marpa marks a grammar opaque internally, if the grammar uses certain kinds of sequence productions. For more details, see "Node Memoization" in Parse::Marpa::Evaluator.

preamble

The value must be a string which contains code in the current semantics. (Right now Perl 5 is the only semantics available.) The preamble is run when the recognizer object is created, in a namespace special to the recognizer object. Rule actions and lex actions also run in this namespace. A preamble may be used to set up globals. The preamble of a recognizer object cannot be changed after the recognizer object has been created.

If multiple preambles are specified as named arguments, the most recent preamble replaces any earlier one. This is consistent with the behavior of other named arguments, but it differs from the behavior of MDL, which creates a preamble by concatenating code strings.

semantics

The value is a string specifying the type of semantics used in the semantic actions. The current default, and the only available semantics at this writing, is perl5. The semantics option cannot be changed after the grammar is precomputed.

trace_file_handle

The value is a file handle. Warnings and trace output go to the trace file handle. By default it's STDERR. The trace file handle can be changed at any point in a parse.

version

If present, the version option must match the current Marpa version exactly. This is because while Marpa is in alpha, features may change dramatically from version to version. The version option cannot be changed after the grammar is precomputed.

warnings

The value is a boolean. If true, it enables warnings about inaccessible and unproductive rules in the grammar. Warnings are written to the trace file handle. By default, warnings are on. Turning warnings on after grammar precomputation is useless, and itself results in a warning.

Inaccessible and unproductive rules sometimes indicate errors in the grammar design. But a user may have plans for them, may wish to keep them as notes, or may simply wish to deal with them later.

IMPLEMENTATION NOTES

Use of object orientation in Marpa is superficial. Only grammars, recognizers and evaluators are objects, and they are not designed to be inherited.

Speed

Speed seems very good for an Earley's implementation. In fact, current performance limits are more often a function of the lexing than of the Marpa parse engine.

Ambiguous Lexing

Ambiguous lexing has a cost, and grammars which can turn ambiguous lexing off can expect to parse twice as fast. Right now when Marpa lexes with multiple regexes at a single location, it uses a series of Perl 5 regex matches, one for each terminal.

There may be a more efficient way to find all the matches in a set of alternatives. A complication is that Marpa does predictive lexing, so that the list of lexables is not known until just before the match is attempted. But I believe that lazy evaluation and memoizing could have big payoffs in the cases of most interest.

The Marpa Demonstration Language

The Marpa Demonstration Language was written to demonstrate Marpa's capabilities, including its lexing capabilities. A porcelain interface which doesn't use ambiguous lexing could easily run faster. One with a customized lexer might be faster yet.

If MDL's parsing speed becomes an issue for a particular grammar, that grammar can be precompiled. Subsequent runs of the precompiled grammar don't incur the overhead of either MDL parsing or precomputation. Marpa parses the user's MDL using a grammar precompiled from an self-describing grammar written in MDL.

Comparison with other Parsers

In thinking about speed, it is helpful to keep in mind Marpa's place in the parsing food chain. Marpa parses grammars that bison, yacc, Parse::Yapp, and Parse::RecDescent cannot parse. For these, Marpa is clearly faster. When it comes to time efficiency, never is not hard to beat.

Marpa allows grammars to be expressed in their most natural form. It's ideal where programmer time is important relative to running time. Right now, special-purpose needs are often addressed with regexes. This works wonderfully if the grammar involved is regular, but across the Internet many man-years are being spent trying to shoehorn non-regular grammars into Perl 5 regexes.

Marpa is a good alternative whenever another parser requires backtracking. Marpa never needs to backtrack. It finds every possible parse the first time through. Backtracking is a gamble, and one you often find you've made against the odds.

Some grammars have constructs to control backtracking. This control comes at a high price. Solutions with these constructs built into them are as unreadable as anything in the world of programming gets, and fragile in the face of change to boot.

If you know you will be writing an LALR grammar or a regular one, that is a good reason not to use Marpa. When a grammar is LALR or regular, Marpa takes advantage of this and runs faster. But such a grammar will run faster yet on a parser designed for it: bison, yacc and Parse::Yapp for LALR; regexes for regular grammars.

Finally, there are the many situations when we want to do some parsing as a one-shot and don't want to have to care what subcategory our grammar falls in. We want to write some quick BNF, do the parsing, and move on. For this, there's Marpa.

DEPENDENCIES

Requires Perl 5.10. Users who want or need the maturity and/or stability of Perl 5.8 or earlier are probably also best off with more mature and stable alternatives to Marpa.

AUTHOR

Jeffrey Kegler

Why is it Called "Marpa"?

Marpa is the name of the greatest of the Tibetan "translators". In his time (the 11th century AD) Indian Buddhism was at its height. A generation of scholars was devoting itself to producing Tibetan versions of Buddhism's Sanskrit scriptures. Marpa became the greatest of them, and today he is known simply as Marpa Lotsawa: "Marpa the Translator".

Translation in the 11th century was not a job for the indoors type. A translator needed to study with the teachers who had the texts and could explain them. That meant going to India. Marpa lived in Tibet's Lhotrak Valley. To get to India, he needed to cross two hundred difficult and lawless miles to the Khala Chela Pass, then scale its three-mile high summit.

The last four hundred miles, from Khala Chela to the great Buddhist center of Nalanda, had turned out to be the most deadly. The first expeditions had descended straight to the hot plains and met disaster. Almost no germs live in the cold, thin air of Tibet. With no immunity to India's diseases, entire expeditions had been wiped out within weeks of arrival.

Blatant Plug

There's more about Marpa in my new novel, The God Proof, in which his studies, travels and adventures are a major subplot. The God Proof centers around Kurt Gödel's proof of God's existence. Yes, that Kurt Gödel, and yes, he really did work out a God Proof (it's in his Collected Works, Vol. 3, pp. 403-404). The God Proof is available as a free download http://www.lulu.com/content/933192, and in print form at Amazon.com: http://www.amazon.com/God-Proof-Jeffrey-Kegler/dp/1434807355.

TO DO

See Parse::Marpa::To_Do.

BUGS

Priority Conflicts

If non-default priorities are given to rules, it's possible two rules with different priorities could wind up in the same SDFA state. Marpa can't proceed when that happens. (See the internals document if you're interested in details about SDFA's.)

I've actually never seen this happen, and one reason the problem is not fixed is that I will need to contrive a case where the problem occurs before I test the fix. But if you're the unlucky first person to encounter this issue, here are the workarounds.

Workaround 1: Marpa will report the rules which caused the conflict. If they can be changed to have the same priority, the problem is solved.

Workaround 2: Instead of using priorities, use multiple parses. That is, instead of using priorities to make the desired parse first in order, allow the "natural" order and iterate through the parses until you get the one you want.

Workaround 3: Make a small change in the grammar. Be aware that the code which creates the SDFA is smart enough that it will probably need to be a real change to the language. Simply writing different rules with the same effect probably won't make the problem go away.

Here's what I think is the fix: Change the SDFA to be a little more non-deterministic, so that there are different SDFA nodes for the different priorities, with empty transitions between them. (Aren't you sorry you asked?) With a fix of this kind, testing examples (even if they were easier to find) is not sufficient to show correctness. I'll need to show that the current and the fixed SDFA's are "equivalent". That demonstration may need to be a mathematical proof. For now, there's the comfort that the problem seems to be quite rare.

Non-intuitive Parse Order in Unusual Cases

This problem occurs when

  • An ambiguous production has more than two nullable symbols on the right hand side; and

  • The order of the parses for that production matters.

This doesn't happen in any practical grammars I've tried. Perhaps it's a unnatural way to set up the semantics. But it certainly does happen in textbook grammars.

A very straightforward workaround is described below. But the problem needs to be fixed before Marpa goes beta.

Details: The problem occurs because these productions are rewritten internally by CHAF. A rightmost parse comes first, as I have documented, but it is a rightmost parse for the grammar as rewritten by CHAF. This is a bug for pendantic reasons, because CHAF rewriting is supposed to be invisible. It's a bug for practical reasons because the CHAF-driven order is not intuitive, and I can't picture it ever being the desired first choice. Priorities are not a workaround, because priorites cannot be set for rules within a CHAF rewrite.

Workaround: Rewrite the rule for which this is a problem. The problem only occurs where a rule is subject to CHAF rewriting, and CHAF rewrites are only done to rules with more than two nullables on the right hand side. It is always possible to break up a rule into other rules such that at most two nullables occur on the right hand side.

End of Line Comment Cannot be Last in MDL Source

A Perl style end of line comment cannot be last thing in MDL source. Workaround: Add a blank line.

What! You Found Even More Bugs!

Please report any bugs or feature requests to bug-parse-marpa at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Parse-Marpa. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

    perldoc Parse::Marpa
    

You can also look for information at:

ACKNOWLEDGMENTS

Marpa is the parser described in Aycock and Horspool 2002. I've made significant changes to it, which are documented separately (Parse::Marpa::Doc::Algorithm). Aycock and Horspool, for their part, built on the algorithm discovered by Jay Earley.

I'm grateful to Randal Schwartz for his encouragement over the years that I've been working on Marpa. My one conversation with Larry Wall about Marpa was brief and long ago, but his openness to the idea was a major encouragement, and his insights into how humans do programming, how they do languages, and how those two endeavors interconnect, a major influence. More recently, Allison Randal and Patrick Michaud have been generous with their very valuable time. They might have preferred that I volunteered as a Parrot cage-cleaner, but if so, they were too polite to say.

Many at perlmonks.org answered questions for me. I used answers from chromatic, Corion, dragonchild, jdporter, samtregar and Juerd, among others, in writing this module. I'm just as grateful to those whose answers I didn't use. My inquiries were made while I was thinking out the code and it wasn't always 100% clear what I was after. If the butt is moved after the round, it shouldn't count against the archer.

In writing the Pure Perl version of Marpa, I benefited from studying the work of Francois Desarmenien (Parse::Yapp), Damian Conway (Parse::RecDescent) and Graham Barr (Scalar::Util). Adam Kennedy patiently corrected me on the finer points of module writing, as well as about some issues where I really should have know better.

COPYRIGHT & LICENSE

Copyright 2007-2008 Jeffrey Kegler, all rights reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.