Name

Marpa::R2::BNF - BNF Interface

Synopsis

    use Marpa::R2;

    my $grammar = Marpa::R2::Grammar->new(
        {   
            actions        => 'My_Actions',
            default_action => 'do_first_arg',
            source          => \(<<'END_OF_SOURCE'),
    :start ::= Script
    Script ::= Expression+ separator => <op comma> action => do_script
    Expression ::=
        Number
        | (<op lparen>) Expression (<op rparen>) action => do_parens assoc => group
       || Expression (<op pow>) Expression action => do_pow assoc => right
       || Expression (<op times>) Expression action => do_multiply
        | Expression (<op divide>) Expression action => do_divide
       || Expression (<op add>) Expression action => do_add
        | Expression (<op subtract>) Expression action => do_subtract
    END_OF_SOURCE
        }
    );

Overview

This page is the reference for the Marpa's "BNF interface", also called its "Stuifzand interface". The Stuifzand interface allows the grammar to be specified using a string whose text is written in a variation of BNF. BNF source strings are specified using the the source named argument of Marpa::R2's grammars. BNF source strings perform the functions of the rules and start named arguments of Marpa grammars. When the source named argument is used to specify a grammar, the rules and start named arguments should not be used, and vice versa.

The Stuifzand interface is an example of a domain-specific language (DSL). The DSL of this document can (and has been) used to specify other DSL's, and therefore can be seen as a "meta-DSL".

The source string

Structure

Whitespace separates tokens, but is otherwise ignored. Comments are equivalent to whitespace. A hash ("#") character starts a comment, which continues to the end of the line.

The Stuifzand format consists of rule declarations. Each rule declaration consists of, in order:

  • A left hand side (LHS). This will be a symbol or a pseudo-symbol.

  • The declaration operator ("::=").

  • A right hand side (RHS). For each type of rule, the RHS will be described in detail below. In the case of empty rules, the RHS will be empty.

  • An adverb list. Each adverb consists of a keyword, the adverb operator ("=>"), and the adverb's value. The adverb list may be zero length.

Symbol names

Symbol names can be either "bare" or enclosed in angle brackets. Bare symbol names must consist entirely of Perl word characters (alphanumerics, plus the underscore). The angle brackets, if used, serve to "quote" the symbol name, and will not be part of the symbol name.

If angle brackets are used, symbol names may also contain whitespace, as in

    <op comma>

A whitespace sequence inside angle brackets can include any whitespace character that is legal in Perl, including newlines. This allows very long symbol names to be line wrapped, if necessary.

Unlike angle brackets, the whitespace does become part of the symbol name, but it does so in a "normalized" form. Leading and trailing whitespace in the name is discarded, and all other whitespace sequences are converted to a single ASCII space character. This means that

    < op comma  >
    <op   comma>
    <     op comma>

and even

    <op
    comma>

will all be regarded as the same symbol name. The normalized form of that symbol name is <op comma>. The actual name of a symbol is the string inside the brackets of its normalized form.

Types of rules

Rules are of four kinds: the start rule, empty rules, quantified rules and prioritized rules. A rule like the following,

    a ::= b c d

whose RHS consists only of a sequence of symbols, is treated as a prioritized rule with only one alternative and only one priority. In a typical large grammar, most rules are prioritized rules, many of them taking a simple form like that in the display above.

Start rule

Every source string should have a start rule. The LHS of this rule is the :start pseudo-symbol. The RHS must be a single symbol name. The start rule does not accept any adverbs.

Empty rules

An empty rule is a rule with an empty RHS. An empty right hand side is what the name suggests: no symbols or punctuation. An action adverb is allowed in an empty rule, but no others. A empty rule makes its LHS symbol a nullable symbol.

Quantified rules

A rule is quantified if its RHS consists of a single symbol name, followed by a "quantifier". The quantifer is either a star ("*"), or a plus sign ("+") indicating, respectively, that the sequence rule has a minimum length of 0 or 1. Quantified rules allow adverbs.

Prioritized rules

Any rule which is not a start rule, and not empty or quantified, is considered prioritized. The prioritization may be simple, consisting only of a single priority level.

Each prioritized right hand side consists of one or more right hand alternatives (called "alternatives" for brevity). An alternative is a series of one or more symbols, followed by zero or more adverbs.

The alternatives are separated by single or double "or" symbols. This means the alternatives in a prioritized right hand side proceed from tightest (highest) priority to loosest. The double "or" symbol ("||") is the "loosen" operator -- the alternatives after it have a looser (lower) priority than the alternatives before it. The single "or" symbol ("|") is the ordinary "alternative" operator -- alternatives on each side of it have the same priority. Associativity is specified using adverbs, as described below.

Within an alternative, symbols may be enclosed in parentheses. A symbol enclosed in parentheses is hidden from Marpa's semantics. A set of parentheses may contain more than one symbol, in which case the entire sequence of symbols is hidden, as if they had been enclosed in parentheses individually.

Adverbs

Adverbs are allowed for empty rules, quantifiied rules, and for the right hand side alternatives in prioritized rules. Adverbs must occur after the symbols. Adverbs consist of a keyword, the adverb operator ("=>"), and the adverb's value. The keyword must be one of those described in this section. The adverb's value must be as described for each keyword.

action

The value of the action keyword must be a Perl word (a string of alphanumerics or underscores). It will be interpreted as the action name, as described for the action named argument of rule descriptors.

assoc

The assoc adverb is only valid in a prioritized rule. Its value must be one of left, right or group. Its effect will be as described below.

separator

The separator keyword is only valid for a quantified right side, and its value must be a symbol of the grammar. It will be used as the separator, as described for sequence rules.

proper

The proper keyword is only valid for a quantified right side, and its value must be a boolean, in the form of a binary digit (0 or 1). It indicates proper or Perl separation, as described for sequence rules.

Precedence

Marpa's precedence is a generalization beyond the traditional ideas of precedence. Traditional precedence parsing required the classification of operators as postfix, infix, etc. Marpa's precedence parsing is NOT based on the special treatment of operators.

For the purpose of precedence, an operand is an occurrence in a RHS alternative of the LHS symbol. An operator is considered to be anything that is not an operand. The arity of an alternative is the number of operands that it contains. All arities are allowed, from zero to the arbitrary number imposed by system limits such as memory and file size.

For example, in the synopsis, the LHS symbol is Expression. The alternative

        (<op lparen>) Expression (<op rparen>)

contains one occurrence of Expression and therefore has an arity of one. The <op lparen> and <op rparen> are considered operators.

In the RHS alternative

       Expression (<op pow>) Expression

Expression occurs twice, and therefore the arity is 2. <op pow> is considered to be an operator.

Because for this purpose an operator is defined as anything that is not an operand, some symbols are considered operators that would not be in the traditional approach. For example, in the RHS alternative

       Number

there are no occurrences of Expression, so that the alternative has an arity of zero -- it is nullary. The symbol Number is considered to be an operator.

An alternative with arity 0 is nullary. Precedence and associativity are meaningless in this case and will be ignored.

An alternative with arity 1 is unary. Precedence will have effect, but left and right associativity will not.

An alternative with arity 2 is binary. Precedence will have effect, and left and right associativity will behave in the traditional way, The traditional behavior for binary alternatives is exactly as described next for the N-ary case.

An alternative with an arity of N, where N is 2 or greater, is N-ary. Precedence will have effect. For left associativity, only the leftmost operand of an N-ary alternative associates -- operands after the first will have the next-tightest priority level. For right associativity, only the rightmost operand of an N-ary alternative associates -- all operands except the last will have the next-tightest priority level.

Marpa also allows "group" associativity. In "group" associativity, all operands associate at the loosest (lowest) priority. That is, in an alternative with group associativity, each operand may be a full expression of the kind defined by the prioritized rule. "Group" associativity is used, for example, in implementing the traditional function of parentheses in Marpa. Group associativity is meaningless for nullary alternatives, and is ignored.

Caveats

Symbol names

The BNF interface may rewrite its symbol names internally. Here we call the name by which a symbol is specified in the BNF interface string its "external symbol name", and its name as rewritten, its "internal symbol name". In many cases, Marpa::R2 makes sure the user sees only the external names of symbols. For example, the rule() Grammar method returns external symbol names.

However, for some purposes, such as tracing grammars, the use of internal names may sometimes be appropriate. Marpa::R2's use of external and internal symbol names is evolving. When the Stuifzand interface is used, except as documented in this section, the user may see either the internal or external form of a symbol's name. Also, the exact form of the internal name is subject to change.

Precedence: the good news

Marpa's generalization of precedence works for all grammars that can be defined by prioritized rules. It is efficient (linear) for all grammars that could be parsed by the traditional precedence parsing methods. Marpa also allows you to define alternatives not allowed by traditional methods. Many of these are useful, and most of the useful ones can be parsed efficiently.

Precedence and ambiguous grammars

Because of the many forms of recursion allowed, it is possible to define highly ambiguous grammars using the precedence mechanism. This can occur even by accident.

The user should especially be careful with right hand side alternatives in which all the symbols are operands. These can be useful. For example, an implicit operation can be defined using an binary alternative with no non-operands, and this could implement, for example, the standard notation for concatenation or multiplication. But to do this efficiently requires either avoiding ambiguity, or controlling its use carefully.

Marpa does catch the case where an alternative consists only of a single operand -- a "unit rule". This causes a fatal error. Unit rules are easy to define by accident in the Stuifzand interface. The author knows of no practical use for them, and their presence in a grammar is usually unintentional. Note that, in the event an application does find a use for a grammar with unit rules, Marpa's standard interface can parse it.

Copyright and License

  Copyright 2013 Jeffrey Kegler
  This file is part of Marpa::R2.  Marpa::R2 is free software: you can
  redistribute it and/or modify it under the terms of the GNU Lesser
  General Public License as published by the Free Software Foundation,
  either version 3 of the License, or (at your option) any later version.

  Marpa::R2 is distributed in the hope that it will be useful,
  but WITHOUT ANY WARRANTY; without even the implied warranty of
  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
  Lesser General Public License for more details.

  You should have received a copy of the GNU Lesser
  General Public License along with Marpa::R2.  If not, see
  http://www.gnu.org/licenses/.