The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

Name

Marpa::R2::Glade - Low-level interface to Marpa's ASF's

Synopsis

  my $grammar = Marpa::R2::Scanless::G->new(
      {   source => \(<<'END_OF_SOURCE'),
  :start ::= pair
  pair ::= duple | item item
  duple ::= item item
  item ::= Hesperus | Phosphorus
  Hesperus ::= 'a'
  Phosphorus ::= 'a'
  END_OF_SOURCE
      }
  );

  my $slr = Marpa::R2::Scanless::R->new( { grammar => $grammar } );
  $slr->read( \'aa' );
  my $asf = Marpa::R2::ASF->new( { slr => $slr } );
  die 'No ASF' if not defined $asf;
  my $output_as_array = asf_to_basic_tree($asf);
  my $actual_output   = array_display($output_as_array);

The code for asf_to_basic_tree() represents a user-supplied call using the interface described below. An full example of ast_to_basic_tree(), which constructs a Perl array "tree", is given below. array_display() displays the tree in a compact form. The code for it is also given below. The return value of array_display() is as follows:

    Glade 2 has 2 symches
      Glade 2, Symch 0, pair ::= duple
        Glade 6, duple ::= item item
          Glade 8 has 2 symches
            Glade 8, Symch 0, item ::= Hesperus
              Glade 13, Hesperus ::= 'a'
                Glade 15, Symbol 'a': "a"
            Glade 8, Symch 1, item ::= Phosphorus
              Glade 1, Phosphorus ::= 'a'
                Glade 17, Symbol 'a': "a"
      Glade 2, Symch 1, pair ::= item item
        Glade 8 revisited

THIS INTERFACE is ALPHA and EXPERIMENTAL

The interface described in this document is very much a work in progress. It is alpha and experimental. The bad side of this is that it is subject to change from version to version in major ways, rapidly and without notice. The good side is that field is 100% open for users to have feedback into the final interface.

About this document

This document describes the low-level interface to Marpa's abstract syntax forests (ASF's). An abstract syntax forest, as the name suggests, is an abstract syntax tree, generalized to contain the multiple trees that result from an ambiguous parse.

This low-level interface allows the maximum flexiblity in building the forest, but requires the application to do much of the work. A higher-level interface is planned.

Getting around in a parse forest

An abstract syntax forest (ASF) is similar to an abstract syntax tree (AST), but it has an additional ability -- it can represent an ambiguous parse. An ASF is an efficient and practical way to represent multiple AST's. In their structure, ASF's have similarities to ordinary parse trees, but there are very important differences:

  • First, traversing many forests of practical interest requires care if your time is not going to go exponential. This issue exists with trees to some extent, but it is much more severe with parse forests.

  • Second, Marpa parses should not be relied on to be trees in the strict sense -- the underlying algorithm allows parses to contain cycles. Most likely the ability to create parses with cycles will come to the SLIF someday.

  • Third, for a tree there are only two kinds of nodes: rules (interior nodes) and tokens (leaf nodes). Forests also use nodes to represent the various kinds of ambiguity that a parse can encounter, and this makes traversing a forest somewhat more complex.

Ambiguity: factoring versus symches

Ambiguity in a parse can come in two forms, and Marpa's ASF's treat the distinction as important. An ambiguity can be a symbolic choice (symch), or a factoring. Symbolic choices are the kind of ambiguity that springs first to mind -- a choice between rules, or a choice between a rule and token. Factorings occur when only one rule applies, but that rule can divide the input in different ways. I'll give examples below.

Symches and factorings behave very differently:

  • Symches are less common than factorings.

  • Factorings are sometimes of no interest; symches are almost always of interest.

  • Symches almost always have only a small number of alternatives; factorings often have many, sometimes too many to manage.

  • The maximum number of symches is a reasonable finite constant.

  • The maximum number of factorings depends on the length of the string being factored, and can grow arbitrarily large.

Most applications will want to treat symches and factorings differently.

An example of a symch

Here's an example of a symch. The grammar is:

    :start ::= planet
    planet ::= hesperus
    planet ::= phosphorus
    hesperus ::= venus
    phosphorus ::= venus
    venus ~ 'venus'

For the input string 'venus', the forest would look like

    Symbol #0 planet has 2 symches
      Symch #0.0
      GL2 Rule 1: planet ::= hesperus
        GL3 Rule 3: hesperus ::= venus
          GL4 Symbol venus: "venus"
      Symch #0.1
      GL2 Rule 2: planet ::= phosphorus
        GL5 Rule 4: phosphorus ::= venus
          GL6 Symbol venus: "venus"

An example of a factoring

Now for an example of a factoring. Here's the grammar:

    :start ::= top
    top ::= b b
    b ::= a a
    b ::= a
    a ~ 'a'

For the input 'aaa', a parse from this grammar must have two b's, one short (a single 'a') and one long (two a's). But they can be in either order. This is a factoring. Here's Marpa's dump of the forest:

    GL2 Rule 1: top ::= b b
      Factoring #0
        GL3 Rule 3: b ::= a
          GL4 Symbol a: "a"
        GL5 Rule 2: b ::= a a
          GL6 Symbol a: "a"
          GL7 Symbol a: "a"
      Factoring #1
        GL8 Rule 2: b ::= a a
          GL9 Symbol a: "a"
          GL10 Symbol a: "a"
        GL11 Rule 3: b ::= a
          GL12 Symbol a: "a"

The structure of a forest

Representing ambiguity requires new kinds of nodes, nodes which were not necessary in an AST. Marpa's ASF divides its nodes into three types:

  • Glades

    Glade nodes represent tokens, as well as the individual symbols on the RHS of rules. (The term "glade" comes from the idea of a glade as a distinct place in a forest that is open to light.) A glade will have one or more symches as children.

    Each glade corresponds to a unique symbol and a unique span in the input. (A span is duple, consisting of a start location and a length.) For a token, the glade symbol will be the token symbol. For a rule, the glade symbol will be the LHS of the rule.

  • Symches

    Glades contain one or more symbolic choices, which I will call symches. Each symch is either a token or a rule. Since symches occur within a glade, and each glade has a unique symbol, only one of the symches in a glade can be a token symch. There can, however, be many rule symches in a glade -- one for every rule with the glade symbol on its LHS.

    Each symch contains one or more factorings. (The factoring of a token symch will always be trivial.) Symches may have one or more factorings omitted. A symch which omits factorings is said to be truncated. By default, at most 42 factorings are allowed in a symch.

  • Factorings

    As mentioned, a token factoring will always contain exactly one factor. For a rule, a factoring is a way of dividing up the span of the glade among its RHS symbols. Every rule factoring has one or more factors.

    For a rule symch, each "factor" of any of its factorings corresponds to a symbol instance on the RHS of the rule. Each such RHS factor, in fact, a glade. As a glade, each RHS factor will have a unique symbol and input span.

More terminology

Much of the terminology of ASF's has already been explained. The peak of a forest is its topmost node, which will be a glade node. The downglades of another glade, call it glade G, are those glades which are below it, and which have no other nodes except symches and factorings in between themselves and glade G. Glade G is called the upglade of its downglades.

If you can reach glade B from glade A by following a path of downglades, glade A is said to be upper with respect to A and glade B is said to be lower with respect to B.

A glade with no downglades is a terminal glade. A glade with no upglade must be the peak. Choices where there is only one alternative are generally called trivial. A symch is called trivial if it is the only one immediately below its glade. A factoring is called trivial if it is the only one immediately below its symch. A glade is called trivial if it has a trivial symch which has a trivial factoring.

Memoization

When traversing a forest, you need to take steps to avoid traversing the same subtree twice. Memoization can easily be the difference between an program which is unuseably slow even in small cases, and one which is very fast even for large inputs.

Repeated subtraversals happen which two upper glades share the same lower glade. This is a very frequent occurrence in ASF's. Additionally, the underlying algorithm allows cycles, and some day the SLIF will as well. Traversing a cycle will cause an infinite loop unless you memoize.

The example in this POD include memoization scheme which is very simple but adequate for most purposes. The main logic of it is shown here.

The Glade ID

Each glade has a glade ID. This can be relied on to be a non-negative integer. A glade ID may be zero. Glade ID's are obtained from the "peak()" and "factoring_downglades()" methods.

        my ( $asf, $glade, $seen ) = @_;
        return bless ["Glade $glade revisited"], 'My_Revisit'
            if $seen->[$glade];
        $seen->[$glade] = 1;

As you can see, memoization is not hard. Putting memoization in your very first drafts of code will save you time and trouble.

Constructor

new()

  my $asf = Marpa::R2::ASF->new( { slr => $slr } );
  die 'No ASF' if not defined $asf;

Creates a new ASF object. Must be called with a list of one or more hashes of named arguments. Current only one named argument is allowed, the slr argument, and that argument is required. The value of the slr argument must be a SLIF recognizer object.

Returns the new ASF object, or undef if there was a problem.

Forest Methods

These "forest" methods deal with the ASF as a whole, as compared to methods with a focus on specific glades, symches, factorings or factors.

grammar()

    my $grammar     = $asf->grammar();

Returns the SLIF grammar associated with the ASF. This can be convenient when using SLIF grammar methods while examining an ASF. All failures are thrown as exceptions.

peak()

    my $peak = $asf->peak();

Returns the glade ID of the peak. This may be zero. All failures are thrown as exceptions.

Glade Methods

glade_literal()

        my $literal = $asf->glade_literal($glade);

Returns the literal substring of the input associated with the glade. Every glade is associated with a span -- a start location in the input, and a length. On failure, throws an exception.

The literal is determined by the range. Marpa applications are not required to read the input monotonically, and the complications that arise for those which do not are addressed in "Literals and G1 spans" in Marpa::R2::Scanless::R.

glade_symch_count()

    my $symch_count = $asf->glade_symch_count($glade);

Requires a glade ID as its only argument. Returns the number of symches below the glade specified by the argument. On failure, throws an exception.

glade_symbol_id()

    my $symbol_id    = $asf->glade_symbol_id($glade);
    my $display_form = $grammar->symbol_display_form($symbol_id);

Requires a glade ID as its only argument. Returns the symbol ID of the "glade symbol" for the glade specified by the argument. The "glade symbol" always exsits. If the glade is for a token, it is the token symbol. If the glade is for a set of rules, it is the LHS that those rules share in common. On failure, throws an exception.

Symch Methods

symch_rule_id()

    my $rule_id = $asf->symch_rule_id( $glade, $symch_ix );

Requires two arguments: a glade ID and a zero-based symch index. Returns the rule ID of the symch specified by the two arguments. If the symch is for a token, -1 is returned. Returns a Perl undef, if the glade exists, but the symch index is too high. On other failure, throws an exception.

symch_is_truncated()

[ To be written. ]

symch_factoring_count()

    my $factoring_count =
        $asf->symch_factoring_count( $glade, $symch_ix );

Requires two arguments: a glade ID and a zero-based symch index. Returns numbers of factorings in the symch specified by the two arguments. Returns a Perl undef, if the glade exists, but the symch index is too high. On other failure, throws an exception.

Factoring Methods

factoring_downglades()

    my $downglades =
        $asf->factoring_downglades( $glade, $symch_ix,
        $factoring_ix );

Requires three arguments: a glade ID, the zero-based index of a symch and the zero-based index of a factoring. On success, returns a reference to an array. The array contains the glade IDs of the the downglades in the factoring specified by the arguments.

Returns a Perl undef, if the glade and symch exists, but the factoring index is too high. On other failure, throws an exception. In particular, exceptions are thrown if the symch is for a token; and if the glade exists, but the symch index is too high.

The code for the synopsis

The asf_to_basic_tree() code

    sub asf_to_basic_tree {
        my ( $asf, $glade ) = @_;
        my $peak = $asf->peak();
        return glade_to_basic_tree( $asf, $peak, [] );
    } ## end sub asf_to_basic_tree

    sub glade_to_basic_tree {
        my ( $asf, $glade, $seen ) = @_;
        return bless ["Glade $glade revisited"], 'My_Revisit'
            if $seen->[$glade];
        $seen->[$glade] = 1;
        my $grammar     = $asf->grammar();
        my @symches     = ();
        my $symch_count = $asf->glade_symch_count($glade);
        SYMCH: for ( my $symch_ix = 0; $symch_ix < $symch_count; $symch_ix++ ) {
            my $rule_id = $asf->symch_rule_id( $glade, $symch_ix );
            if ( $rule_id < 0 ) {
                my $literal      = $asf->glade_literal($glade);
                my $symbol_id    = $asf->glade_symbol_id($glade);
                my $display_form = $grammar->symbol_display_form($symbol_id);
                push @symches,
                    bless [qq{Glade $glade, Symbol $display_form: "$literal"}],
                    'My_Token';
                next SYMCH;
            } ## end if ( $rule_id < 0 )

            # ignore any truncation of the factorings
            my $factoring_count =
                $asf->symch_factoring_count( $glade, $symch_ix );
            my @symch_description = ("Glade $glade");
            push @symch_description, "Symch $symch_ix" if $symch_count > 1;
            push @symch_description, $grammar->rule_show($rule_id);
            my $symch_description = join q{, }, @symch_description;

            my @factorings = ($symch_description);
            for (
                my $factoring_ix = 0;
                $factoring_ix < $factoring_count;
                $factoring_ix++
                )
            {
                my $downglades =
                    $asf->factoring_downglades( $glade, $symch_ix,
                    $factoring_ix );
                push @factorings,
                    map { glade_to_basic_tree( $asf, $_, $seen ) } @{$downglades};
            } ## end for ( my $factoring_ix = 0; $factoring_ix < $factoring_count...)
            push @symches,
                bless [
                "Glade $glade, symch $symch_ix has $factoring_count factorings",
                @factorings
                ],
                'My_Factorings'
                if $factoring_count > 1;
            push @symches, bless [ @factorings[ 0, 1 ] ], 'My_Rule';
        } ## end SYMCH: for ( my $symch_ix = 0; $symch_ix < $symch_count; ...)
        return bless [ "Glade $glade has $symch_count symches", @symches ],
            'My_Symches'
            if $symch_count > 1;
        return $symches[0];
    } ## end sub glade_to_basic_tree

Because of the blessings in this example, a standard dump of the output array is too cluttered for comfortable reading. The following code displays the output array in a more compact form. This code actually works for all arrays and makes no use of Marpa, but it is included for completeness, and to provide a simple example of array traversal.

The array_display() code

    sub array_display {
        my ($array) = @_;
        my ( undef, @lines ) = @{ array_lines_display($array) };
        my $text = q{};
        for my $line (@lines) {
            my ( $indent, $body ) = @{$line};
            $indent -= 4;
            $text .= ( q{ } x $indent ) . $body . "\n";
        }
        return $text;
    } ## end sub array_display

    sub array_lines_display {
        my ($array) = @_;
        my $reftype = Scalar::Util::reftype($array) // '!undef!';
        return [ [ 0, $array ] ] if $reftype ne 'ARRAY';
        my @lines = ();
        ELEMENT: for my $element ( @{$array} ) {
            for my $line ( @{ array_lines_display($element) } ) {
                my ( $indent, $body ) = @{$line};
                push @lines, [ $indent + 2, $body ];
            }
        } ## end ELEMENT: for my $element ( @{$array} )
        return \@lines;
    } ## end sub array_lines_display

Details

This section contains some mathematical elaborations of the above. These details are segregated because they are not essential to using this interface, and while some readers find them more helpful than distracting, for others it is the reverse.

Why are token symches always trivial?

A token symch is always trivial: that is, it contains exactly one factoring. Proof: Each factoring is a division of the input among the factors. A token is a single symbol, and therefore a token factoring has exactly one factor. When there is only factor, there is only one way of dividing up the input among the factors. Therefore there is only one possible factoring, and a token symch must be trivial. QED.

An alternative way of defining glade terminology

Here's a way of defining some of the above terms which is more elegant, but also more mathematical. First, define the glade length from glades A to glade B in an ASF as the number of glades on the shortest path from A to B, not including glade A. (Recall that paths are directional.) If there is no path between glades A and B, the glade length is undefined. Glade B is a downglade of glade A, and a glade A is an upglade of glade B, if and only if the glade length from A to B is 1.

A glade A is upper with respect to glade B, and a glade B is lower with respect to glade A, if and only if the glade length from A to B is defined.

A peak of an ASF is a node without upglades. By construction of the ASF, there is only one peak. A glade is trivial if and only if it has exactly one downglade.

Copyright and License

  Copyright 2013 Jeffrey Kegler
  This file is part of Marpa::R2.  Marpa::R2 is free software: you can
  redistribute it and/or modify it under the terms of the GNU Lesser
  General Public License as published by the Free Software Foundation,
  either version 3 of the License, or (at your option) any later version.

  Marpa::R2 is distributed in the hope that it will be useful,
  but WITHOUT ANY WARRANTY; without even the implied warranty of
  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
  Lesser General Public License for more details.

  You should have received a copy of the GNU Lesser
  General Public License along with Marpa::R2.  If not, see
  http://www.gnu.org/licenses/.