The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Marpa::R3::External::Low - Low-level external scanning

About this document

The alternative input models described in this document are an advanced technique. If you are starting out with Marpa, you probably want to ignore this document. If you are an experienced Marpa user, it is still safe to ignore this document, but you might find the possibilities it discusses interesting.

This is not the document to which to turn first. Before reading this document, you should be familiar with Marpa::R3::External::Basic.

The low level

The high level scanning methods read a lexeme and add that lexeme to the Earley sets in a single operation. This is very natural, and usually what the user wants. But sometimes the user wants to add more than one lexeme at a location, or to use lexemes of varying length. For this, the low level methods exist.

The low level methods separate the external scanning of a lexeme into two operations: lexeme_alternative() and lexeme_alternative_literal() read a lexeme and add it to the pending lexeme queue. lexeme_complete() adds the lexemes at the next (or "closest") location in the pending lexeme queue to the Earley sets. In the pending lexeme queue, location is measured in "earlemes".

Earlemes

Earlemes are an idea of location, used by the low-level scanning routines. Earlemes are named after Jay Earley, the inventor of the first algorithm in Marpa's lineage. Earlemes are not tied to physical location, and they are not tied to Earley set (G1) location, so they can be very flexible. Earlemes exist to provide a "middle" ground.

Intuitively, Earlemes behave the way that a zero-based idea of parse location should. Each Earley set has a unique earleme location, and higher numbered Earley sets are guaranteed to have higher numbered locations. Earley set 0 is always at earleme 0.

Technically, earleme location is an order-preserving function of Earley set ID. That implies that if Earley set N is at earleme X and Earley set N+1 is at earleme Y, then X<Y.

Earlemes also exist in the standard (default) input model, but in the standard input model, earleme location is always precisely the same as Earley set ID (aka G1 location). The standard input model is also called the lexeme-per-earleme model. Most of the Marpa::R3 documents assume the standard input model, and therefore have no need to mention earlemes.

Alternative input models

By explicitly using earlemes in the low level scanning methods, applications can use input models other than the standard one. An alternative input model is anything that is not the default, lexeme-per-earleme model. Marpa allows variable-length lexemes and ambiguous lexemes, and an alternative input model is any input model which

  • allows a lexeme whose length is not exactly 1, or

  • allows Earley sets which have more than one lexeme.

Many concepts, such as parsing location, parse ambiguity, parse exhaustion, and the end of parsing, become more complicated when alternative input models are involved. In the main document for the recognizer these concepts were explained on the assumption that the default input model was in use. This document revises those explanations as necessary to take into account the alternative input models.

The most popular alternative input model ties location to the physical input. That is the character-per-earleme model. The strict version of this would restrict physical input to a single block, with no jumps backward. This is the way most applications use physical input, but earlemes allow for a more complicated relationship between parse location and physical input.

There are many possibilities for alternative models. They are explored in detail in Marpa::R3::External::Model.

How low level scanning works.

Low level scanning divides scanning into two operations. First, it reads a lexeme and adds it to a queue (the "pending lexeme queue"). Second, it creates a new Earley set, pops a set of lexemes from the pending lexeme queue, and adds those lexemes to the Earley set, "completing" that Earley set.

Reading a lexeme

Whenever lexeme_alternative() reads a lexeme, it assigns that lexeme a start and an end earleme. The start earleme of a lexeme is the current earleme. The current earleme is the earleme of the current Earley set.

lexeme_alternative() specifies a lexeme length, either explicitly or implicitly. Lexeme length is always an integer greater than zero. The end earleme of a lexeme is the start earleme plus its length. In other words, if S is the start earleme, and L is the lexeme length, then the end earleme is S+L.

(Note that everything said in this section applies to the lexeme_alternative_literal() method as well. For brevity, we will speak only of the lexeme_alternative() method.)

When lexeme_alternative() reads a lexeme, it places it into the pending lexeme queue, according to its end earleme. The pending lexeme queue is a queue of pending lexeme sets. All the lexemes in a pending lexeme set share the same end earleme. Therefore we can think of, and sometimes will speak of, the pending lexeme queue as a queue of earlemes.

The pending lexeme queue is ordered from the closest earleme to the furthest earleme. The closest earleme is the earleme in the pending lexeme queue that is "closest" to the current earleme -- that is, it is the lowest numbered earleme in the queue. The furthest earleme is the earleme in the pending lexeme queue that is "furthest" from the current earleme -- in other words, it is the highest numbered earleme in the queue.

In addition to tracking the current earleme, the Marpa::R3 recognizer always has a closest and furthest earleme. As a special case, if the pending lexeme queue is empty, the closest and furthest earlemes are both the same as the current earleme.

Let a non-empty pending lexeme queue be

    lexset[1] ... lexset[n],

where n >= 1. Then lexset[1] is the set of lexemes at the closest earleme, and lexset[n] is the set of lexemes at the furthest earleme. If n = 1, then the closest and furthest earlemes are equal.

Completing the Earley set

When the lexeme_complete() method is called, it creates a new Earley set, and pops the closest pending lexeme set, adding those lexemes to the Earley set. It then "completes" the Earley set, fixing its context, so that no more lexemes can be added.

The earleme number of the new Earley set will be the end earleme of the lexemes that were added to it. It is a fatal error if lexeme_complete() is called when the pending lexeme queue is empty.

Pending lexemes and the standard input model

Note that the process described in this section also occurs in the standard input model, but in that case it is invisible. In the standard input model, if the current earleme is c, the lexeme being read starts at c. All lexemes in the standard input model have length 1, so that the lexeme will end at earleme c + 1.

The lexeme being read is added to the pending lexeme queue, which must be empty. (See "Alternative models and high level scanning".) A lexeme set for earleme c + 1 is created whose only element is the one lexeme.

Next, an Earley set is created. If the ID of the last Earley set was esid, the ID of the new Earley set will be esid + 1. The lexeme set for earleme c + 1 is popped from the pending lexeme queue, its lexeme is added to the new Earley set, and the new Earley set is completed. Only at this point might control be returned to the application using the standard input model.

In the standard input model, the earleme of an Earley set is always the same as its Earley set ID (aka G1 location). A simple mathematical induction shows this. Current earleme and Earley set ID both begin at 0. And in the standard input model, they are always incremented together.

Again, in the standard input model, this whole process occurs without returning control to the user. Therefore, from the point of view of an application using the standard input model, the pending lexeme queue is always empty, and the closest and furthest earlemes are always equal to the current earleme. Because, in the standard input model, the current earleme will always be the same as the Earley set ID, applications using the standard input model can ignore the pending lexeme queue. Applications can also treat current, closest, and furthest earleme as irrelevant.

Alternative models and high level scanning

A recognizer can alternate calls to the high level external scanning methods methods with calls to the low level scanning methods, with one restriction: If a low level scan at a particular earleme has been started with lexeme_alternative(), or lexeme_alternative_literal(), it must be completed with a call to lexeme_complete() before any high level scanning method can be called. Otherwise a fatal error results.

Methods

closest_earleme()

    my $closest_earleme = $recce->closest_earleme();

Returns the closest earleme.

current_earleme()

    my $current_earleme = $recce->current_earleme();

Returns the current earleme.

earleme()

  my $current_earley_set = $recce->g1_pos();
  $current_earleme = $recce->earleme($current_earley_set);

Given an Earley set ID as its argument, the earleme() recognizer method returns the corresponding earleme. Every integer in the range from 0 to the ID of the current Earley set is a valid Earley set ID, and every valid Earley set ID corresponds to an earleme. If the argument of earleme() is greater than the current Earley set ID, earleme() returns Perl undef.

There is currently no method that translates from earleme to Earley set. Earley set to earleme translation is a well-behaved one-to-one (injective) function in all input models -- for every Earley set there is a earleme, and every earleme is mapped to by at most one Earley set. Earleme to Earley set translation is far less well-behaved. In many input models, it is a partial function -- there are some earlemes that are in the valid range of earlemes but do not map to any Earley set. (See "Empty earlemes".)

As an alternative to the earleme() method, an application can implement its own Earley set to earleme mapping. This can allows an application to take advantage of what it knows about its choice of input model.

furthest_earleme()

    my $furthest_earleme = $recce->furthest_earleme();

Returns the furthest earleme.

g1_pos()

  my $current_earley_set = $recce->g1_pos();

The g1_pos() method returns the ID of the current Earley set. The full description of the g1_pos() method is elsewhere.

lexeme_alternative()

    my $ok = $recce->lexeme_alternative( $symbol_name, $value );
    if (not defined $ok) {
        my $literal = $recce->literal( $block_id, $offset, $length );
        die qq{Parser rejected symbol named "$symbol_name" },
            qq{at position $offset, before lexeme "$literal"};
    }
    $ok = $recce->lexeme_alternative( 'A', 42, 2 );

lexeme_alternative() takes up to three arguments. The first two are, in order, $symbol_name and $value. $symbol_name is required and must be the symbol name of the lexeme to be read starting at the current earleme. $value is optional, and specifies the value of the lexeme. If $value is missing, the value of the lexeme will be a Perl undef.

The third, optional, argument specifies the length of the lexeme in earlemes. Earleme length must be always be greater than or equal to 1. By default, the specified earleme length is 1.

The lexeme_alternative() method adds the lexeme described to the pending lexeme set for its end earleme, creating a new pending lexeme set if necessary. The end earleme is C+L, where C is the current earleme, and L is the specified earleme length.

lexeme_alternative() has a soft failure if it rejects $symbol_name. All other failures are hard failures.

Return values: Returns undef if the lexeme was rejected. On success, returns a value reserved for future use. The value on success will not necessarily be a Perl true, so that apps testing for rejection must test for a Perl undef explicitly. Failures are thrown as exceptions.

lexeme_alternative_literal()

    my $ok = $recce->lexeme_alternative_literal($symbol_name);
    die qq{Parser rejected lexeme "$long_name" at position $start_of_lexeme, before "},
        $recce->literal( $main_block, $start_of_lexeme, 40 ), q{"}
            if not defined $ok;
    $ok = $recce->lexeme_alternative_literal( 'A', 3 );

lexeme_alternative_literal() takes up to two arguments, call them $symbol_name and $earleme_length. $symbol_name is required and specifies the symbol name of a lexeme.

$earleme_length is optional, and specifies the earleme length of the lexeme. By default, the specified earleme length is 1. Earleme length must be always be greater than or equal to 1.

For the lexeme specified by lexeme_alternative_literal(), the value of the lexeme will be the same as its literal equivalent. This literal equivalent will be set by the next call to lexeme_complete().

The lexeme_alternative_literal() method adds the specified lexeme to the pending lexeme set for its end earleme, creating a new pending lexeme set if necessary. The end earleme is C+L, where C is the current earleme, and L is the specified earleme length.

lexeme_alternative_literal() has a soft failure if it rejects $symbol_name. All other failures are hard failures.

lexeme_alternative_literal() and lexeme_alternative() differ from each other only in their arguments, and in how they set the value of the lexeme.

Return values: Returns undef if the lexeme was rejected. On success, returns a value reserved for future use. The value on success will not necessarily be a Perl true, so that apps testing for rejection must test for a Perl undef explicitly. Failures are thrown as exceptions.

lexeme_complete()

    my $new_offset = $recce->lexeme_complete( $block_id, $offset, $length );

The lexeme_complete() method accepts three optional arguments. Call them, in order, $block_id, $offset and $length. These specify a block span. A zero $length is acceptable.

lexeme_complete() reads lexemes from the pending lexeme queue. It will read those lexemes into the pending lexeme set for the closest earleme. If the pending lexeme queue is empty, a hard failure is thrown.

The specified block span is used to set the literal equivalent for the set of alternative lexemes completed by the lexeme_complete() call. If $block_id is not defined, the specified block is the current block. If $offset is not defined, the specified offset is the current offset of the specified block. If $length is not defined, the specified length is the end-of-block of the specified block, less the specified offset -- in other words, the specified block span includes the entire remaining specified block.

Return values: On success, lexeme_complete() returns the new current location. All failures in lexeme_complete() are hard failures. Failure is always thrown.

Catching up

An application "catches up" to the furthest earleme by calling the lexeme_complete method until the furthest earleme is equal to the current earleme. In the standard input model, the furthest earleme is always equal to the current earleme, so the need to "catch up" never arises.

Circumstances in which an application may want to catch up to the furthest earleme include finding the intended end of parsing, ensuring all desired events are generated, and ensuring that a parse is exhausted.

End of parsing

The default end of parsing is always at the current earley set, and therefore at the current earleme. This means that lexemes in the pending lexeme queue will be ignored in evaluating the parse.

"Ignoring" lexemes may be the intended behavior. If not, the application must ensure that processing of input catches up to the furthest earleme.

Parse exhaustion

A parse is never exhausted as long as there are lexemes in the pending lexeme queue. In the standard input model, there never are lexemes in the pending lexeme queue, so that a parse is exhausted if there are no acceptable lexemes at the current Earley set. (The current Earley set is always at the current earleme.)

In alternative input models, there may be lexemes in the pending lexeme queue. Therefore, in alternative input models, a parse may find no acceptable lexemes at the current earleme, but still not be exhausted. Applications which need to ensure that a parse is run until exhaustion will need to make sure that they catch up to the furthest earleme.

Ambiguous lexing

Marpa allows ambiguous lexemes. Several Marpa lexemes can start at a single parsing location. Ambiguous lexemes can be of various lengths. Lexemes can also overlap.

Potentially ambiguous lexing occurs when more than one lexeme ends at a single earleme. When potentially ambiguous lexing occurs, it becomes possible for there to be more than one sequence of lexemes.

An actual lexical ambiguity only occurs if more than one of the potential lexeme sequences is consistent with the grammar. If there is no actual lexical ambiguity, Marpa will use the only lexeme choice that is consistent with the grammar.

When lexing is actually ambiguous, Marpa will use all the alternatives consistent with the grammar. When the lexing in a parse is actually ambiguous, the parse will be ambiguous. From the point of view of Marpa's semantics, ambiguities caused by lexing look the same as ambiguities caused by an ambiguous grammar.

In the standard terminology, if a grammar produces more than one parse tree for any input, then that grammar must be ambiguous. In Marpa this is not strictly true. In Marpa, if the input is ambiguous, even an unambiguous grammar can produce more than one parse.

Duplicate lexemes

A duplicate lexeme is a lexeme of the same type and the same length as another lexeme that was read at the same earleme. Duplicate lexemes are impossible in the standard input model. This is because in the standard input model only one lexeme can be read at each earleme.

In alternative models, more than one lexeme may be read at an earleme, and duplicates are possible. Marpa detects duplicate lexemes and treats them as "hard errors" -- Marpa throws an exception when it sees a duplicate lexeme. Marpa's assumption is that duplicate lexemes indicate an error at the application level.

An application can retry input after a duplicate lexeme, if it catches the exception. In the future, if recovery from duplicate lexemes is found to be a useful technique, Marpa may provide an option to change its behavior, so that a soft failure is returned when there is a duplicate lexeme.

Empty earlemes

As mentioned, there is a function from Earley sets to earlemes. In the standard input model, this function has an inverse -- there is a function from earlemes to Earley sets.

In alternative input models, some earlemes may not correspond to an Earley set, so there is not necessarily a function from earlemes to Earley sets.

In particular, in the character-per-earleme model, every character is treated as being exactly one earleme in length. If a lexeme is more than one character in length, that lexeme will span earlemes. The earlemes in between the start earleme of the lexeme and the end earleme of the lexeme may not correspond to any Earley set -- they may be "empty".

An one example, consider a straightforward character-per-earleme implementation of a grammar for a language that allows comments. No lexemes will start at any earlemes which correspond to character locations inside the comment, so there will be no Earley sets at those earlemes. The earlemes between the start and end earlemes of the comment will be "empty".

Equivalents of high level methods

Low level equivalent of lexeme_read_block()

In terms of low-level external scanning methods,

    $recce->lexeme_read_block($symbol, $start, $length, $value)

is roughly equivalent to

    sub read_block_equivalent {
        my ( $recce, $symbol_name, $value, $block_id, $offset, $length ) = @_;
        return if not defined $recce->lexeme_alternative( $symbol_name, $value );
        return $recce->lexeme_complete( $block_id, $offset, $length );
    }

Low level equivalent of lexeme_read_literal()

In terms of low-level external scanning methods, lexeme_read_literal() is roughly equivalent to

    sub read_literal_equivalent_lo {
        my ( $recce, $symbol_name, $block_id, $offset, $length ) = @_;
        return if not defined $recce->lexeme_alternative_literal( $symbol_name );
        return $recce->lexeme_complete( $block_id, $offset, $length );
    }

Low level equivalent of lexeme_read_string()

In terms of low-level external scanning methods, lexeme_read_string() is roughly equivalent to

    sub read_string_equivalent_lo {
        my ($recce, $symbol_name, $string) = @_;
        my ($save_block) = $recce->block_progress();
        my $lexeme_block = $recce->block_new( \$string );
        return if not defined $recce->lexeme_alternative( $symbol_name, $string );
        my $return_value = $recce->lexeme_complete( $lexeme_block );
        $recce->block_set($save_block);
        return $return_value;
    }

The example just above shows the value of the lexeme being set to a string in the lexeme_alternative() call. Note that this is not efficient for very long strings.

Copyright and License

  Copyright 2018 Jeffrey Kegler
  This file is part of Marpa::R3.  Marpa::R3 is free software: you can
  redistribute it and/or modify it under the terms of the GNU Lesser
  General Public License as published by the Free Software Foundation,
  either version 3 of the License, or (at your option) any later version.

  Marpa::R3 is distributed in the hope that it will be useful,
  but WITHOUT ANY WARRANTY; without even the implied warranty of
  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
  Lesser General Public License for more details.

  You should have received a copy of the GNU Lesser
  General Public License along with Marpa::R3.  If not, see
  http://www.gnu.org/licenses/.