The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Marpa - Parse any Language You Can Describe in BNF

SYNOPSIS

    my $grammar = Marpa::Grammar->new(
        {   start   => 'Expression',
            actions => 'My_Actions',
            default_action => 'first_arg',
            rules   => [
                { lhs => 'Expression', rhs => [qw/Term/] },
                { lhs => 'Term', rhs => [qw/Factor/] },
                { lhs => 'Factor', rhs => [qw/Number/] },
                { lhs => 'Term', rhs => [qw/Term Add Term/], action => 'do_add' },
                {   lhs    => 'Factor',
                    rhs    => [qw/Factor Multiply Factor/],
                    action => 'do_multiply'
                },
            ],
        }
    );

    $grammar->precompute();

    my $recce = Marpa::Recognizer->new( { grammar => $grammar } );

    my @tokens = (
        [ 'Number', 42 ],
        [ 'Multiply', ],
        [ 'Number', 1 ],
        [ 'Add', ],
        [ 'Number', 7 ],
    );

    $recce->tokens( \@tokens );

    sub My_Actions::do_add {
        my ( undef, $t1, undef, $t2 ) = @_;
        return $t1 + $t2;
    }

    sub My_Actions::do_multiply {
        my ( undef, $t1, undef, $t2 ) = @_;
        return $t1 * $t2;
    }

    sub My_Actions::first_arg { shift; return shift; }

    my $value_ref = $recce->value;
    my $value = $value_ref ? ${$value_ref} : 'No Parse';

DESCRIPTION

This is ALPHA software

This is alpha software. There may be bugs. The interface may change. Please be careful. Do not rely on it for anything mission-critical.

Overview

Marpa parses any language whose grammar can be written in BNF. That includes recursive grammars, ambiguous grammars, infinitely ambiguous grammars and grammars with useless or empty productions.

This document contains a top-level overview of the API for the Marpa parse engine. The two examples in the synopsis show the typical flow of calls in the Marpa API. This document will use these examples to describe the basic features of Marpa in semi-tutorial fashion. Marpa's advanced features, and full reference details of all features, can be found in the other Marpa API documents.

The Three Phases

A parser needs to:

  • Accept a grammar.

  • Read input.

  • Return values from the parses, according to a semantics.

In Marpa these three tasks are, for the most part, distinct phases. The Marpa API is structured around them.

Creating a Grammar

The package Marpa::Grammar creates grammars. It also "precomputes" the grammars. Precomputation creates a set of data structures that the recognizer will need.

Recognizing Input

The Marpa::Recognizer package creates recognizers from grammars. A recognizer reads input and produces tables based on the grammar and the input. The information that the evaluators use to find parses is in the recognizer's tables.

Evaluating Parses

Marpa's evaluators find the parse results in the recognizer's tables. If the input did not match the grammar, the tables will contain no parse results, and the evaluators will indicate that. If the input did match the grammar, the tables will contain one or more parse results. There may be more than one parse result, because Marpa allows ambiguous grammars.

Marpa has two evaluators: a Single Parse Evaluator, and a Multi-parse Evaluator. The Single Parse Evaluator returns the value of only one of the parse results. The Multi-parse Evaluator allows the application to iterate parse trees.

For each of its rules, a Marpa application can specify a Perl closure to implement that rule's semantics. For each token, a Marpa application can specify a Perl scalar to be the token's value. The Marpa evaluators apply these semantics recursively to the parse trees. The value of the root node of a parse tree becomes the value of the parse result and is returned to the user.

EXAMPLE 1: A SIMPLE CALCULATOR

The synopsis shows the code for a very simple calculator. It handles only addition and multiplication of integers. This section explains, line by line, how it works.

Marpa::Grammar::new

    my $grammar = Marpa::Grammar->new(
        {   start   => 'Expression',
            actions => 'My_Actions',
            default_action => 'first_arg',
            rules   => [
                { lhs => 'Expression', rhs => [qw/Term/] },
                { lhs => 'Term', rhs => [qw/Factor/] },
                { lhs => 'Factor', rhs => [qw/Number/] },
                { lhs => 'Term', rhs => [qw/Term Add Term/], action => 'do_add' },
                {   lhs    => 'Factor',
                    rhs    => [qw/Factor Multiply Factor/],
                    action => 'do_multiply'
                },
            ],
        }
    );

Marpa grammars are Marpa::Grammar objects. They are created with the Marpa::Grammar::new constructor. The arguments to Marpa::Grammar::new are references to hashes of named arguments. In the key/value pairs of these hashes, the hash key is the name of the argument, and the hash value is the value of the named argument.

The start Named Argument

    start => 'Expression',

The start named argument is required. Its value is a string containing the name of the grammar's start symbol.

Named Arguments for the Semantics

            actions => 'My_Actions',
            default_action => 'first_arg',

The actions and default_action named arguments specify semantics. Their argument values are strings, which rely for their interpretation on the evaluators.

The evaluators will be described later. Peeking ahead, the default_action named argument will be interpreted as an action name. This action name will resolve to a Perl closure that implements the semantics for rules without a per-rule semantics. actions will be the name of a Perl package where Marpa looks for the semantic Perl closures.

The rules Named Argument

    rules => [
        { lhs => 'Expression', rhs => [qw/Term/] },
        { lhs => 'Term',       rhs => [qw/Factor/] },
        { lhs => 'Factor',     rhs => [qw/Number/] },
        { lhs => 'Term', rhs => [qw/Term Add Term/], action => 'do_add' },
        {   lhs    => 'Factor',
            rhs    => [qw/Factor Multiply Factor/],
            action => 'do_multiply'
        },
    ],

The value of the rules named argument is a reference to an array of rule descriptors. In this example, all the rule descriptors are in the "long" form -- they are references to hashes of rule properties. In each key/value pair of a rule descriptor hash, the key is the name of a rule property, and the hash value is the value of that rule property.

The lhs Property

The value of the lhs rule property must be a string containing the name of the rule's left hand side symbol. Every Marpa rule must have a left hand side symbol.

The rhs Property

The value of the rhs property is a reference to an array of strings containing names of the rule's right hand symbols, in order. This array may be zero length, in which case this is an empty rule -- a rule with no symbols on the right hand side. A rule is also empty if there is no rhs specifier in its descriptor. There are no empty rules in this example.

The action Property

The value of the action rule property is a string. What the string means is up to the evaluator. Peeking ahead, each action property string will be interpreted as an action name. This action name will be resolved to a Perl closure that implements the rule's semantics.

Marpa::Grammar::precompute

    $grammar->precompute();

Before a Marpa grammar object can be used by a Marpa recognizer, it must be precomputed. Precomputation compiles data structures that the recognizer will need.

Marpa::Recognizer::new

    my $recce = Marpa::Recognizer->new( { grammar => $grammar } );

Marpa::Recognizer::new creates a new recognizer. Its arguments are references to hashes of named arguments. In this example the only named argument is "grammar", which is required. The value of the grammar named argument must be a precomputed Marpa grammar for the recognizer to use in building its tables.

Marpa::Recognizer::tokens

    my @tokens = (
        [ 'Number', 42 ],
        [ 'Multiply', ],
        [ 'Number', 1 ],
        [ 'Add', ],
        [ 'Number', 7 ],
    );

    $recce->tokens( \@tokens );

Marpa::Recognizer::tokens reads the input to be recognized. Its first (and in our example, only) argument is a reference to an array of token descriptors. Each token descriptor is an array reference.

The first element of a token descriptor is a string containing the token name. The token name must be the name of a valid terminal symbol in the grammar. By default all symbols are valid as terminal symbols, unless a grammar contains empty rules.

The grammars in our examples do not contain empty rules, and therefore we are free to use any symbol in the grammar as a token name. For more on terminals, including how to explicitly mark terminal symbols when the grammar contains empty rules, see "Terminals" in Marpa::Grammar.

The second element of a token descriptor is the token value. A token value must be a Perl scalar, but the rest of its semantics is entirely up to the application. If the token value is omitted, it is a Perl undef. In the calculator example, the values of the "Add" and "Multiply" tokens are never used, so they are allowed to default to undef.

Marpa::Recognizer::value

    my $value_ref = $recce->value;
    my $value = $value_ref ? ${$value_ref} : 'No Parse';

The Marpa::Recognizer::value method returns a reference to the parse result's value, if there was a parse result. If there was no parse result, Marpa::Recognizer::value returns undef.

The Marpa::Recognizer::value method is Marpa's Single Parse evaluator. Marpa has two evaluators: a Single Parse Evaluator and a Multi-parse Evaluator. The Single Parse Evaluator will handle all unambiguous grammars as well as many ambiguous parses. The Single Parse Evaluator is the best evaluator for most situations.

Resolving the Semantics

The first thing Marpa::Recognizer::value needs to do is resolve the semantics. Resolving the semantics means mapping the action names into semantic Perl closures. Semantic Perl closures are Perl closures which directly implement semantics. In this example, the actions named argument is interpreted as a Perl package name. Marpa will look for its semantic Perl closures in that package.

    actions => 'My_Actions',
    { lhs => 'Factor', rhs => [qw/Factor Multiply Factor/], action => 'do_multiply' },

For example, the action property for the above rule is "do_multiply" and the actions named argument to the grammar was "My_Actions". So Marpa looks for a closure whose fully qualified name is My_Actions::do_multiply, which it finds.

    sub My_Actions::do_multiply {
        my ( undef, $t1, undef, $t2 ) = @_;
        return $t1 * $t2;
    }

Rules do not always have action properties. That is the case with these rules in this example:

    { lhs => 'Expression', rhs => [qw/Term/] },
    { lhs => 'Term', rhs => [qw/Factor/] },
    { lhs => 'Factor', rhs => [qw/Number/] },

Where there is no action rule property, Marpa tries to use the lhs property as an action name. When Marpa cannot resolve the lhs property as an action name, it will fall back to using the default action for the grammar.

For example, for the first rule in the above display, Marpa will look for a Perl closure with the fully qualified name "My_Actions::Expression". When Marpa does not find a closure by that name, Marpa will fall back to trying to find a semantic Perl closure using the default action.

The other two rules in the above display work similarly. Marpa will look for Perl closures named named "My_Actions::Term" and "My_Actions::Factor". It will not find them and will fall back to trying to use the default action, as described next.

    default_action => 'first_arg',

The default_action named argument is resolved in the same way as are the action properties of the rules. In this example, default_action is specified as "first_arg" and resolves to My_Actions::first_arg.

Semantic Perl Closures

    sub My_Actions::first_arg { shift; return shift; }
    sub My_Actions::do_add {
        my ( undef, $t1, undef, $t2 ) = @_;
        return $t1 + $t2;
    }

The semantic Perl closures are callbacks, called when each node in a parse tree is evaluated. The callbacks receive one or more arguments. The first argument to a semantic Perl closure is always a per-parse-result object, which the callbacks can use as a scratchpad. In these examples, the per-parse-result object is not used.

For a non-empty rule, the second and any subsequent arguments to the callback are the values, in lexical order, of the symbols on the right hand side of the rule. If the semantic Perl closure is for an empty rule, the per-parse-result object will be its only argument.

Every semantic Perl closure is expected to return a value. With one exception, this value is passed up to a parent node as an argument. The exception is the value for the start rule. The return value for the start rule becomes the value of the parse result.

The default_action named argument is optional. When there is no default action for a grammar, rules without an explicit semantics return a Perl undef.

THE SINGLE PARSE EVALUATOR

In most cases the parse result returned by Marpa's Single Parse Evaluator is all that is wanted. The Single Parse Evaluator is not a separate package. It is implemented as the "Marpa::Recognizer::value" method.

Marpa::Recognizer's Single Parse Evaluator is far more powerful than the evaluator in most parser generators, but it does have limitations.

  • The "Marpa::Recognizer::value" method cannot be used if your grammar is infinitely ambiguous. A grammar is infinitely ambiguous if there is some input for which you could list the parse results forever, with none of the parse results being duplicates.

  • The "Marpa::Recognizer::value" method cannot be used if your parse is ambiguous and you want to see two or more of the possible parse results.

  • The "Marpa::Recognizer::value" method cannot be used if your parse is ambiguous and you want to be able to choose which of the parse results you see.

In all other cases, the "Marpa::Recognizer::value" can be used, will be faster, and will usually be the best choice.

THE MULTI-PARSE EVALUATOR

The Multi-parse Evaluator is for situations in which you cannot use Marpa's Single Parse Evaluator. The Multi-parse Evaluator handles all possible parses -- every possible input to any BNF-expressible grammar. For unambiguous and finitely ambiguous grammars, the Multi-parse Evaluator will iterate all the parse results. The Multi-parse Evaluator also allows the application to control the order in which parse results are produced.

For infinitely ambiguous grammars, the Multi-parse Evaluator will iterate all the non-cyclical parse results. Few users will need or want to work with infinitely ambiguous grammars. Those interested in more on this topic can look at the documentation for Marpa::Evaluator.

Finitely ambiguous grammars can be very useful. They haven't been much used in the past, because few parsers were capable of handling them.

The Single Parse Evaluator will evaluate parses from finitely ambiguous grammars. But the Single Parse Evaluator is limited to returning one parse result, chosen arbitrarily. That is sufficient for a lot of work with ambiguous grammars. When it's not sufficient, the Multi-parse Evaluator is the right tool.

EXAMPLE 2: AN AMBIGUOUS PARSE

This is the same calculator as before, rewritten to be ambiguous. Rather than give multiplication precedence over addition, the rewritten calculator allows any order of operations. In this example, the semantic Perl closures (My_Actions::do_add, etc.) and the @tokens array remain the same as before.

Eliminating precedence makes the grammar shorter, but it also means there can be multiple parse results, and the different parse results can have different values. In this application we decide, for each input, to return the value for every one of the parse results. The Single Parse Evaluator will not do this, so we use the Multi-parse Evaluator.

    my $ambiguous_grammar = Marpa::Grammar->new(
        {   start   => 'E',
            actions => 'My_Actions',
            rules   => [
                [ 'E', [qw/E Add E/],      'do_add' ],
                [ 'E', [qw/E Multiply E/], 'do_multiply' ],
                [ 'E', [qw/Number/],       ],
            ],
            default_action => 'first_arg',
        }
    );

    $ambiguous_grammar->precompute();

    my $ambiguous_recce =
        Marpa::Recognizer->new( { grammar => $ambiguous_grammar } );

    $ambiguous_recce->tokens( \@tokens );

    my $evaler = Marpa::Evaluator->new( { recce => $ambiguous_recce } );

    my @values = ();
    if ($evaler) {
        while ( defined( my $ambiguous_value_ref = $evaler->value() ) ) {
            push @values, ${$ambiguous_value_ref};
        }
    }

Short Form Rule Descriptors

    rules => [
        [ 'E', [qw/E Add E/],      'do_add' ],
        [ 'E', [qw/E Multiply E/], 'do_multiply' ],
        [ 'E', [qw/Number/], ],
    ],

The rule descriptors in the ambiguous example demonstrate the "short" or array form of rule descriptors. Array form rule descriptors are references to arrays. Here the elements are, in order, the lhs property, the rhs property, and the action property.

Marpa::Evaluator::new

    my $evaler = Marpa::Evaluator->new( { recce => $ambiguous_recce } );

Multi-parse evaluators are objects in the Marpa::Evaluator class. They are created with the Marpa::Evaluator::new method. The arguments to this constructor are references to hashes of named arguments. In this example, the only named argument is "recce". The recce named argument is required. It must be the name of the recognizer in whose tables the evaluator will find its parse results.

It is possible that Marpa::Evaluator::new will be able to see immediately that there are no parse results in the recognizer's tables. In that case Marpa::Evaluator::new returns a Perl undef.

Marpa::Evaluator::value

    my @values = ();
    if ($evaler) {
        while ( defined( my $ambiguous_value_ref = $evaler->value() ) ) {
            push @values, ${$ambiguous_value_ref};
        }
    }

The Marpa::Evaluator::value method iterates through the parse results. It returns a reference to the value of the current parse result. At the end of the iteration, after all parse results have been returned, Marpa::Evaluator::value returns undef. If there were no parse results, Marpa::Evaluator::value returns undef the first time that it is called.

ERRORS AND EXCEPTIONS

Methods in the Marpa API do not return errors. When there are errors, Marpa API methods throw an exception.

INHERITANCE

Classes in the Marpa API are not designed to be inherited.

OTHER DOCUMENTS

This document gave a semi-tutorial overview of the entire Marpa API. For full details on Marpa's grammar objects and their methods, see the Marpa::Grammar document. For full details on Marpa's recognizer objects and their methods, see the Marpa::Recognizer document.

Marpa's standard semantics are fully described in the Marpa::Semantics document. Techniques for tracing and for debugging your Marpa grammars are described in the Marpa::Tracing document.

The Marpa::Implementation describes Marpa's internals, focusing on the Marpa tables (Earley sets) and how to read them. This information can be quite useful for users of Marpa who are tracing and debugging grammars they've written.

The Marpa::Evaluator document provides details about the Multi-parse Evaluator. The Marpa::Models document (yet to be written) will provide details about alternative models of the input for Marpa.

Marpa::Parse_Terms is intended as a quick refresher in parsing terminology. My sources, and other useful references, are described in Marpa::Bibliography. Marpa::Algorithm describes the Marpa algorithm itself. It will only be of interest to those with a theoretical bent.

HISTORY

Marpa is a branch from Parse::Marpa. Marpa will, in the near future, replace Parse::Marpa.

AUTHOR

Jeffrey Kegler

Why is it Called "Marpa"?

Marpa is the name of the greatest of the Tibetan "translators". In his time (the 11th century AD) Indian Buddhism was at its height. A generation of scholars was devoting itself to producing Tibetan versions of Buddhism's Sanskrit scriptures. Marpa became the greatest of them, and today is known as Marpa Lotsawa: "Marpa the Translator".

Blatant Plug

Marpa is a character in my novel, The God Proof. The God Proof centers around Kurt Gödel's proof of God's existence. Yes, that Kurt Gödel, and yes, he really did work out a God Proof (it's in his Collected Works, Vol. 3, pp. 403-404). The God Proof is available as a free download (http://www.lulu.com/content/933192). It can be purchased in print form at Amazon.com: http://www.amazon.com/God-Proof-Jeffrey-Kegler/dp/1434807355.

ACKNOWLEDGMENTS

Marpa is derived from the parser described in Aycock and Horspool 2002. I've made significant changes to it, which are documented separately (Marpa::Algorithm). Aycock and Horspool, for their part, built on the algorithm discovered by Jay Earley.

I'm grateful to Randal Schwartz for his support over the years that I've been working on Marpa. My chats with Larry Wall have been few and brief, but his openness to new ideas has been a major encouragement and his insight into the relationship between "natural language" and computer language has been a major influence. More recently, Allison Randal and Patrick Michaud have been generous with their very valuable time. They might have preferred that I volunteered as a Parrot cage-cleaner, but if so, they were too polite to say.

Many at perlmonks.org answered questions for me. I used answers from chromatic, Corion, dragonchild, jdporter, samtregar and Juerd, among others, in writing this module. I'm just as grateful to those whose answers I didn't use. My inquiries were made while I was thinking out the code and it wasn't always 100% clear what I was after. If the butt is moved after the round, it shouldn't count against the archer.

In writing the Pure Perl version of Marpa, I benefited from studying the work of Francois Desarmenien (Parse::Yapp), Damian Conway (Parse::RecDescent) and Graham Barr (Scalar::Util). Adam Kennedy patiently instructed me in module writing, both on the finer points and on issues about which I really should have know better.

My academic sources are described in detail in my bibliography and my (very difficult and sketchy) description of the Marpa algorithm. Nonetheless, here I am to name those who created the algorithms which are the direct and near-direct predecesssors of Marpa -- Jay Earley, Joop Leo, John Aycock and R. Nigel Horspool. Without their work Marpa would not exist.

SUPPORT

Marpa comes without warranty. Support is provided on a volunteer basis through the standard mechanisms for CPAN modules. The Support document has details.

LICENSE AND COPYRIGHT

Copyright 2007-2010 Jeffrey Kegler, all rights reserved. Marpa is free software under the Perl license. For details see the LICENSE file in the Marpa distribution.