The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

Name

Marpa::R2::Scanless::R - Scanless interface recognizers

Synopsis

    my $recce = Marpa::R2::Scanless::R->new( { grammar => $grammar } );
    my $self = bless { grammar => $grammar }, 'My_Actions';
    $self->{recce} = $recce;
    local $My_Actions::SELF = $self;

    if ( not defined eval { $recce->read($p_input_string); 1 }
        )
    {
        ## Add last expression found, and rethrow
        my $eval_error = $EVAL_ERROR;
        chomp $eval_error;
        die $self->show_last_expression(), "\n", $eval_error, "\n";
    } ## end if ( not defined eval { $event_count = $recce->read...})

    my $value_ref = $recce->value();
    if ( not defined $value_ref ) {
        die $self->show_last_expression(), "\n",
            "No parse was found, after reading the entire input\n";
    }
    package My_Actions;

    our $SELF;
    sub new { return $SELF }

    sub do_parens    { shift; return $_[1] }
    sub do_add       { shift; return $_[0] + $_[2] }
    sub do_subtract  { shift; return $_[0] - $_[2] }
    sub do_multiply  { shift; return $_[0] * $_[2] }
    sub do_divide    { shift; return $_[0] / $_[2] }
    sub do_pow       { shift; return $_[0]**$_[2] }
    sub do_first_arg { shift; return shift; }
    sub do_script    { shift; return join q{ }, @_ }

About this document

This page is the reference document for the recognizer objects of Marpa's SLIF (Scanless interface).

Internal and external scanning

The Scanless interface is so-called because it does not require the application to supply a scanner (lexer). The SLIF contains its own lexer, one whose use is integrated into its syntax. In this document, use of the SLIF's internal scanner is called internal scanning.

The SLIF allows applications that find it useful to do their own scanning. When an application bypasses the SLIF's internal scanner and does its own scanning, this document calls it external scanning. An application can use external scanning to supplement internal scanning, or to replace the SLIF's internal scanner entirely.

Locations

Input stream locations

An input stream location is the offset of a codepoint in the input stream. When the input stream is being treated as a string, input stream location corresponds to Perl's pos() location. In this document, the word "location" refers to location in the input stream unless otherwise specified.

Negative locations

Several methods allow locations and lengths to be specified as negative numbers. A negative location is a location counted from the end, so that -1 means location before the last character of the string, -2 the location before the second to last character of a string, etc. A negative length indicates a distance to a location counted from the end. A length of -1 indicates the distance to the end of the string, -2 indicates the distance to the location just before the last character of the string, etc.

For example, suppose that we are dealing with input stream locations. The span (0, -1) is the entire input stream. The span (-1, -1) is the last character of input stream. The span (-2, -1) is the last two characters of the input stream. The span (-2, 1) is the second to last character of the input stream.

G1 locations

In addition to input stream location, the SLIF also tracks G1 location. G1 location starts at zero, and increases by exactly one as each lexeme is read. G1 location is usually not the same as input stream location. There is also a concept of G1 length, which is simply length calculated in terms of G1 locations.

G1 location can be ignored most of the time, but it does become relevant to a small degree when dealing with ambiguous terminals, and to a greater degree when tracing the G1 grammar. (For those more familiar with Marpa's internals, the G1 location is the G1 Earley set index.)

Current location

The SLIF tracks the current location in the input stream, more usually simply called the current location. Locations are zero-based, so that location 0 is the start of the input stream. A location is said to point to the character after it, if there is such a character. For example, location 0 points to the first character of the input stream, unless the stream is of zero length, in which case there is no first character.

A current location equal to the length of the input stream indicates EOS (end of stream). In a zero length stream, location 0 is EOS. The EOS location never points to a character.

In the SLIF, when the current input stream location moves, it does not necessarily advance -- it can skip forward, or can be positioned to an earlier location. The application can skip sections of the input stream. The application is also free to revisit spans of the input stream as often as it wants.

Here are the guarantees:

  • Initially, the current location is 0.

  • The current location will never be negative.

  • The current location will never be greater than EOS.

How internal scanning works

The SLIF always starts scanning using the read() method. Pedantically, this means scanning always begins with a phase of internal scanning. But that first phase may be of zero length, and after that, internal scanning does not have to be resumed.

Internal scanning can be resumed with the resume() method. Both the read() and resume() methods require the application to specify a span in the input stream. The read() method sets the input stream, and that input stream is the one used by all resume() method calls for that recognizer.

In what follows, the term "internal scanning method" refers to either the read() or the resume() method. After an internal scanning method, the current location will indicate how far in the input stream the internal scanning method actually read. If the internal scanning method was paused, the current location will be the one at which it paused. If the internal scanning method reaches EOS, the current location will be EOS. The return value of the read() and the resume() method is the current location.

The following is a list of the events which cause internal scanning to pause or to end. More than one of these events may occur at a location, in which case the one listed first is the one that applies.

  • An internal scanning method will pause when it encounters certain lexemes, as specified in the DSL using the pause adverb. The pause lexeme will be the one of those which caused the pause. (A pause can be caused by more than one lexeme.) The pause span will be the span of the input stream occupied by the pause lexeme.

  • Scanning always ends at EOS. Reaching EOS is not considered a pause.

  • If its $length argument was non-negative, internal scanning will pause when the current location is $start+$length. The pause lexeme and pause span will be undefined.

  • If its $length argument was negative, an internal scanning method will pause when the current location is $length + 1 + length $input_string. The pause lexeme and pause span will be undefined.

  • Named events pause internal scanning. Named events are described in the SLIF DSL. Named events will always have an event name, and may be queried using the the Scanless recognizer's event() method.

The input stream

For error message and other purposes, even external lexemes are required to correspond to a span of the input stream. An external scanner must set up a relationship to the input stream, even if that relationship is completely artificial.

One way to do this is to put a artificial preamble in front of the input stream. For example, the first 7 characters of the input stream could be an preamble containing the characters "NO TEXT". This preamble could be immediately followed by what is seen as the text from a more natural point of view. In this case, the initial call to the read() method could take the form $slr->read($input_string, 7). Lexemes corresponding to the artificial preamble would be read using a method call similar to $slr->lexeme_read($symbol_name, 0, 7, $value).

Constructor

    my $recce = Marpa::R2::Scanless::R->new( { grammar => $grammar } );

The new() method is the constructor for SLIF recognizers. The new() constructor accepts a hash of named arguments. The grammar named argument is required. All other named arguments are optional.

The following named arguments are allowed:

grammar

The new method is required to have a grammar named argument. Its value must be an SLIF grammar object.

ranking_method

The value must be a string: one of "none", "rule", or "high_rule_only". When the value is "none", Marpa returns the parse results in arbitrary order. This is the default. The ranking_method named argument is not allowed once evaluation has begun.

The "rule" and "high_rule_only" ranking methods allows the user to control the order in which parse results are returned by the value method, and to exclude some parse results from the parse series. For details, see the document on parse order.

too_many_earley_items

The too_many_earley_items argument is optional, and very few applications will need it. If specified, it sets the Earley item warning threshold to a value other than its default. If an Earley set becomes larger than the Earley item warning threshold, a recognizer event is generated, and a warning is printed to the trace file handle.

Marpa parses from any BNF, and can handle grammars and inputs which produce very large Earley sets. But parsing that involves very large Earley sets can be slow. Large Earley sets are something most applications can, and will wish to, avoid.

By default, Marpa calculates an Earley item warning threshold for the G1 recognizer based on the size of the G1 grammar, and for each G0 recognizer based on the size of the G0 grammar. The default thresholds will never be less than 100. The default is the result of considerable experience and almost all users will be happy with it.

If the Earley item warning threshold is changed from its default, the change applies to both G0 and G1 -- currently there is no way to set them separately. If the Earley item warning threshold is set to 0, no recognizer event is generated, and warnings about large Earley sets are turned off. An Earley item threshold warning almost always indicates a serious issue, and turning these warnings off will rarely be what an application wants.

trace_terminals

If non-zero, traces the lexemes -- those tokens passed from the G0 parser to the G1 parser. This named argument is the best way to follow what the G0 parser is doing, and it is also very helpful for tracing the G1 parser.

trace_values

This named argument is passed on to the G1 recognizer. See "trace_values" in Marpa::R2::Recognizer

trace_file_handle

The value is a file handle. Trace output and warning messages go to the trace file handle. By default the trace file handle is STDERR.

Basic mutators

read()

    $recce->read($p_input_string);
    $recce->read( \$string, 0, 0 );

Given a pointer to an input stream, read() parses it according to the grammar. Only a single call to read() is allowed for a scanless recognizer.

read() recognizes optional second and third arguments. The second argument is a location in the input stream at which internal scanning will start. The third argument is the length of the section of the input stream to be scanned before pausing. The default start location is zero. The default length is -1. Negative locations and lengths have the standard interpretation, as described above.

Start location and length can both be zero. This pauses internal scanning immediately and can be used to hand complete control of scanning over to an external scanner.

Completion named events can occur during the read() method. When a named event occurs, the read() method pauses. Named events can be queried using the Scanless recognizer's event() method. The read() method also pauses as specified with the Scanless DSL's pause adverb.

On failure, throws an exception. The call is considered successful if it ended because a parse was found, or because internal scanning was paused. On success, read() returns the location in the input stream at which internal scanning ended. This value may be zero.

value()

    my $value_ref = $recce->value();

Has the same effect as a value() call on the G1 recognizer. See "value()" in Marpa::R2::Recognizer.

Mutators for external scanning

activate()

        $slr->activate($_, 0) for @events;

The activate() method allows the recognizer to deactivate and reactivate named events. Named events allow the recognizer to stop for external scanning at conveniently defined locations. Named events can be defined for the prediction and completion of non-zero-length symbols, and nulled events can be defined to trigger when zero-length symbols are recognized.

The activate() method takes two arguments. The first is the name of an event, and the second (optional) argument is 0 or 1. If the argument is 0, the event is deactivated. If the argument is 1, the event is reactivated. An argument of 1 is the default. but, since an SLIF recognizer always starts with all defined events activated, 0 will probably be more common as the second argument to activate()

Location 0 events are triggered in the SLIF recognizer's constructor, before the activate() method can be called. This means that currently there is no way to deactivate location zero events.

The overhead imposed by events can be reduced by using the activate() method. But making many calls to the the activate() method purely for efficiency purposes will be counter-productive. Also, deactivated events still impose some overhead, so if an event is never used it should be commented out in the SLIF DSL.

lexeme_alternative()

            if ( not defined $recce->lexeme_alternative($token_name) ) {
                die
                    qq{Parser rejected token "$long_name" at position $start_of_lexeme, before "},
                    substr( $string, $start_of_lexeme, 40 ), q{"};
            }

The lexeme_alternative() method allows an external scanner to read ambiguous tokens. Most applications will prefer the simpler lexeme_read().

lexeme_alternative() takes one or two arguments. The first required argument, which is required, is the name of a symbol to be read at the current location. The second argument, which is optional, is the value of the symbol. The value argument is interpreted as described for lexeme_read().

Any number of tokens may be read using lexeme_alternative() without advancing the current location. This allows an application to use ambiguous tokens. To complete reading at a G1 location, and advance the current G1 location to the next G1 location, use the lexeme_complete() method.

On success, returns a non-negative number. Returns undef if the token was rejected. Failures are thrown as exceptions.

lexeme_complete()

            next TOKEN
                if $recce->lexeme_complete( $start_of_lexeme,
                        ( length $lexeme ) );

The lexeme_complete() method allows an external scanner to read ambiguous tokens. Most applications will prefer the simpler lexeme_read().

The lexeme_complete() method requires two arguments, a input stream start location and a length. These are interpreted as described for the corresponding second and third arguments to lexeme_read(). The lexeme_complete() method completes the reading of alternative tokens at the current G1 location, and advances the current G1 location by one. Current location in the input stream is moved to the location after the new lexeme, as indicated by the arguments.

Completion named events can occur during the lexeme_complete() method. Named events can be queried using the Scanless recognizer's event() method.

Return value: On success, lexeme_complete() returns a number greater than 0. On unthrown failure, lexeme_complete() returns 0.

lexeme_read()

   $re->lexeme_read('lstring', $start, $length, $value) // die;

The lexeme_read() method reads a single, unambiguous, lexeme. It takes four arguments, only the first of which is required. The first argument is the lexeme's symbol name. The second and third arguments specify the span in the input stream to be associated with the lexeme. The last argument indicates its value.

The second and third arguments are, respectively, the start and length of a span in the input stream. The start defaults to the current location. If the pause span is defined, and the start of the pause lexeme is the same as the current location, length defaults to the length of the pause span. Otherwise length defaults to -1.

Negative values are allowed and are interpreted as described above. This span will be treated as the section of the input stream that corresponds to the tokens read at the current location. This correspondence may be artificial, but a span must always be specified.

The fourth argument specifies the value of the lexeme. If the value argument is omitted, the token's value will be a string containing the corresponding substring of the input stream. Omitting the value argument does not have the same effect as passing an explicit Perl undef. If the value argument is an explicit Perl undef, the value of the lexeme will be a Perl undef.

    $slr->lexeme_read($symbol, $start, $length, $value)

is the equivalent of

    $slr->lexeme_alternative($symbol, $value)
    $slr->lexeme_complete($start, $length)

Current location in the input stream is moved to the place where read() paused or, if it never pauses, to $start+$length. Current G1 location is advanced by one.

Completion named events can occur during the lexeme_read() method. Named events can be queried using the Scanless recognizer's event() method.

Return value: On success, read() returns the new current location. This will never be location zero, because lexemes cannot be zero length. If the token was rejected, returns a Perl undef. On other unthrown failure, returns 0.

resume()

    my $re = Marpa::R2::Scanless::R->new( { grammar => $self->{grammar} } );
    my $length = length $string;
    for ( my $pos = $re->read(\$string); $pos < $length; $pos = $re->resume()) {
       my ($start, $length) = $re->pause_span();
       my $value = substr $string, $start+1, $length-2;
       $value = decode_string($value) if -1 != index $value, '\\';
       $re->lexeme_read('lstring', $start, $length, $value) // die;
    }
    my $value_ref = $re->value();
    return ${$value_ref};

The resume() method takes two arguments, a start location and a length. The default start location is the current location. The default length is -1. Negative arguments are interpreted as described above.

The resume() method resumes the SLIF's internal scanning, as described above.

Completion named events can occur during the resume() method. When a named event occurs, the resume() method pauses. Named events can be queried using the Scanless recognizer's event() method. The resume() method also pauses as specified with the Scanless DSL's pause adverb.

On success, resume() moves the current location to where it paused, or to the EOS. The return value is the new current location. On unthrown failure, resume() return a Perl undef.

Accessors

current_g1_location()

    my $current_g1_location = $recce->current_g1_location();

Returns the current G1 location.

event()

        EVENT: for (
            my $event_ix = 0;
            my $event    = $slr->event($event_ix);
            $event_ix++
            )
        {
            my ($name) = @{$event};
            if ( $name eq 'subtext' ) {
                push @actual_events, show_last_subtext($slr);
                next EVENT;
            }
        } ## end for ( my $event_ix = 0; my $event = $slr->event($event_ix...))

The event() method requires one argument, an event index. It returns a descriptor of the named event with that index, or a Perl undef if there is no such event.

The named event descriptor is a reference to an array of one or more elements. The first element of every named event descriptor is a string containing the name of the event. There may be other elements of a named event descriptor, as described for the type of named event. Named events are described in the SLIF DSL.

Events occur during the the Scanless recognizer's read(), resume(), lexeme_complete(), and lexeme_read() methods. Any subsequent call to an SLIF recognizer mutator may clear the list of triggered events, The assumption is that an application interested in events will call the event() method almost as soon as control is returned to it.

Named events are returned in order by type. Completion events are first. They are followed by the nulled events. These are in turn followed by prediction events. Within each type, the order of events is arbitrary.

Applications may find it convenient to turn specific events off, temporarily or permanently. Events may be activated or deactivated with the SLIF recognizer's activate() method.

g1_location_to_span()

        my ( $span_start, $span_length ) =
            $recce->g1_location_to_span($g1_location);

G1 locations do not correspond to a single input stream location, but to a span of them. The g1_location_to_span() method returns an array of two elements, representing a span in the input stream. The first element of the array is the input stream location where the span starts. The second element of the array is the length of the span. As a special case, the input stream span for G1 location 0 is always (0,0).

Sometimes it is convenient to think of G1 location as corresponding to a single input stream location. When this is the case, what is usually intended is the last input stream location of the span. The last input stream location of the span will always be $span_start+$span_length.

last_completed()

    sub show_last_expression {
        my ($self) = @_;
        my $recce = $self->{recce};
        my ( $g1_start, $g1_length ) = $recce->last_completed('Expression');
        return 'No expression was successfully parsed' if not defined $g1_start;
        my $last_expression = $recce->substring( $g1_start, $g1_length );
        return "Last expression successfully parsed was: $last_expression";
    } ## end sub show_last_expression
    my ( $g1_start, $g1_length ) = $recce->last_completed('Expression');

Given the name of a symbol, returns the start G1 location and the length in G1 locations of the most recent match. If there was more than one most recent match, it returns the longest. If there was no match, returns the empty array in array context and a Perl false in scalar context.

line_column()

       my ($start, $length) = $re->pause_span();
       my ($line, $column) = $re->line_column($start);

The line_column() method accepts one, optional, argument: a location in the input stream. The location defaults to the current location. line_column() returns the corresponding line and column position, as a 2-element array. The first element of the array is the line position, and the second element is the column position.

Numbering of lines and columns is 1-based, following UNIX editor tradition. A line is considered to end with any newline sequence as defined in the Unicode Specification 4.0.0, Section 5.8. Specifically, a line ends with one of the following:

  • a LF (line feed U+000A);

  • a CR (carriage return, U+000D), when it is not followed by a LF;

  • a CRLF sequence (U+000D,U+000A);

  • a NEL (next line, U+0085);

  • a VT (vertical tab, U+000B);

  • a FF (form feed, U+000C);

  • a LS (line separator, U+2028) or

  • a PS (paragraph separator, U+2029).

literal()

   my $literal_string = $re->literal($start, $length);

The literal() method accepts two arguments, the start location and length of a span in the input stream. It returns the substring of the input stream corresponding to that span.

pause_lexeme()

   my $lexeme = $re->pause_lexeme();

The pause_lexeme() method accepts no arguments, and returns the name of the lexeme which caused the most recent pause. The pause lexeme is initially undefined and it is reset to undefined at the beginning of each call to the read() or resume() methods.

More than one lexeme may cause a pause. When this is the case, all the causal lexemes will be acceptable to the G1 grammar, and all the causal lexemes will have the same lexeme priority. When more than one lexeme causes a pause, the choice of pause lexeme is arbitary. Applications may not rely on a particular choice, or on that choice being repeated, even when the choice is made in similar or identical circumstances.

Not every pause is caused by a lexeme. A pause often occurs because of the length argument of an internal scanning method. When the most recent pause was not caused by a lexeme, the pause lexeme is undefined. pause_lexeme() returns a Perl undef when the pause lexeme is undefined.

pause_span()

   my ($start, $length) = $re->pause_span();

The pause_span() method accepts no arguments, and returns the "pause span" as a 2-element array. The "pause span" is the start location and length of the lexeme which caused the most recent pause. The pause span is initially undefined and it is reset to undefined at the beginning of each call to the read() or resume() methods.

A pause is not always caused by a lexeme -- internal scanning may be paused because of the length argument of an internal scanning method. When the most recent pause was not caused by a lexeme, no span can be associated with it, and the pause span is undefined. pause_span() returns a Perl undef if the pause span is undefined.

progress()

    my $progress_output = $recce->progress();

Returns an array that describes the progress of a parse at a location. With no argument, progress() reports progress at the current location. If a G1 location is given as its argument, progress() reports progress at that G1 location. The G1 location may be negative. An argument of -X will be interpreted as location N+X+1, where N is the current G1 location. In other words, an argument of -1 indicates the current G1 location, an argument of -2 indicates the G1 location just before the current one, etc.

The SLIF progress method is the equivalent of the identically named progress() method for Marpa::R2::Recognizer objects in the standard interface. Details on progress reports can be found in their own document.

The progress reports returned by the progress() method identify rules by their G1 rule ID. G1 rules IDs can be converted to a list of the rule's symbols using the rule() method of the SLIF grammar.

show_progress()

    my $show_progress_output = $recce->show_progress();

Has the same effect as a show_progress() call on the G1 recognizer. See "show_progress()" in Marpa::R2::Recognizer. If locations are specified as arguments to show_progress(), they need to be G1 locations.

substring()

    my $last_expression = $recce->substring( $g1_start, $g1_length );

Given a G1 span -- that is, a G1 start location and a length in G1 locations -- the substring() method returns a substring of the input stream. A G1 length of zero will produce the zero-length string.

The substring of the input stream is determined on the assumption that the application, while reading the sequence of G1 locations from $g1_start to $g1_start+$g1_length, read forward smoothly in the input stream. This is the usual case, but applications are free to jump around in the input stream.

Here, in precise terms, is how substring() determines the string to be returned: The string will start at the first input stream location in the span for G1 location $g1_start+1. The end of the string will be at the last input stream location in the span for G1 location $g1_start+$g1_length. When an application moves backward in the input, the end of the string, as calculated above, may be before the start of the string. When the end of a string is before its start, substring() returns the zero-length string.

Applications are free to skip forward in the stream, jump backward, reread the same input multiple times, etc., etc. Applications which do such things, but which also wants to associate spans of G1 locations with the input stream, will need to reassemble the input based on their own ideas. The "literal()" method can assist in this process.

Deprecated methods

last_completed_range()

Use of this method is discouraged in favor of "last_completed()". Given the name of a symbol, last_completed_range() returns the G1 start and G1 end locations of the most recent match. If there was more than one most recent match, last_completed_range() returns the longest. If there was no match, last_completed_range() returns the empty array in array context and a Perl false in scalar context.

range_to_string()

Use of this method is discouraged in favor of "substring()". Given a G1 start and a G1 end location, range_to_string() returns the substring of the input stream that is between the two. The range_to_string() method assumes that the application read forward smoothly in the input stream, while reading the sequence of G1 locations. When that is not the case, range_to_string() behaves in much the same way as described above for "substring()".

Copyright and License

  Copyright 2013 Jeffrey Kegler
  This file is part of Marpa::R2.  Marpa::R2 is free software: you can
  redistribute it and/or modify it under the terms of the GNU Lesser
  General Public License as published by the Free Software Foundation,
  either version 3 of the License, or (at your option) any later version.

  Marpa::R2 is distributed in the hope that it will be useful,
  but WITHOUT ANY WARRANTY; without even the implied warranty of
  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
  Lesser General Public License for more details.

  You should have received a copy of the GNU Lesser
  General Public License along with Marpa::R2.  If not, see
  http://www.gnu.org/licenses/.