The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

Name

Marpa::R3::External::Basic - External scanning, basics

Synopsis

    my @pause_location;
    my $recce = Marpa::R3::Recognizer->new(
        {
            grammar        => $parser->{grammar},
            event_handlers => {
                'before lstring' => sub () {
                    ( undef, undef, undef, @pause_location ) = @_;
                    'pause';
                },

            }
        }
    );
    my $length = length $string;
    for (
        my $pos = $recce->read( \$string );
        $pos < $length;
        $pos = $recce->resume()
        )
    {
        my $start = $pause_location[1];
        my $length = $pause_location[2];
        my $value = substr $string, $start + 1, $length - 2;
        $value = decode_string($value) if -1 != index $value, '\\';
        $recce->lexeme_read_block( 'lstring', $value, undef, $start, $length ) // die;
    } ## end for ( my $pos = $recce->read( \$string ); $pos < $length...)
    my $per_parse_arg = bless {}, 'MarpaX::JSON::Actions';
    my $value_ref = $recce->value($per_parse_arg);
    return ${$value_ref};

About this document

This page describes external scanning. By default, Marpa::R3 scans based on the L0 grammar in its DSL. This DSL-driven scanning is called internal scanning.

But many applications find it useful or necessary to do their own scanning in procedural code. In Marpa::R3 this is called external scanning. External scanning can be used as a replacement for internal scanning. Marpa::R3 also allows applications to switch back and forth between internal and external scanning.

Lexemes

In external scanning, the app controls tokenization directly. External scanning might also be called one-by-one scanning because, in external scanning, the app feeds lexemes to Marpa::R3 one-by-one. This differs from internal scanning -- in internal scanning Marpa::R3 tokenizes a string for the app.

Every lexeme has three things associated with it:

  1. A symbol name, which is required. The symbol name must be the name of a lexeme in both the L0 and G1 grammars. The symbol name tells the parser which symbol represents this lexeme to the Marpa semantics. The symbol name, in other words, connects the lexeme to the grammar.

  2. A symbol value or value, which may be undefined. The value of the lexeme is also seen by the semantics.

  3. A literal equivalent, which is required and must be a span in the input. The literal equivalent is needed for the messages produced by tracing, debugging, error reporting, etc. If more than one lexeme ends at the same G1 location -- which can happen if lexemes are ambiguous -- all of the lexemes must have the same literal equivalent.

High level and low level methods

In scanning externally, you can use high level or low level methods. The simpler methods, and the ones which most users will want, are the high level methods. They are described in this document. The low level external scanning methods are described in Marpa::R3::External::Low.

High level methods in general

The high levels scanning methods have almost all of their behaviors in common. For convenience, therefore, the usual behaviors of the completion methods are described in the section, and exceptions to these behaviors are noted in the descriptions of the individual methods.

Every call of an external scanning method is made with a specified block span. How that the block span is specified varies by method. Unless the call results in a hard failure, on return it leaves a valid current block span.

For the purposes of this section, let the specified block span be <$block_id, $offset, $length>. Also for the purposes of this section, we will define eolexeme, or "end of lexeme", as $offset + $length.

External scanning can succeed or fail. If external scanning fails, the failure may be hard or soft. The only soft failure that can occur in external scanning is the rejection of a lexeme.

Successful reads

An external scanning methods reads a lexeme if it is successful and no pre-lexeme event occurs. In the case of a successful read, the current block is set to $block_id; the offset of the current block is set to eolexeme; and the eoread of the current block will not be changed.

Also in the case of a successful read, the current G1 location is advanced by one. The lexeme just read will start at the previous G1 location and end at the new current G1 location. When we speak simply of the G1 location of a lexeme, we refer to its end location, so that the G1 location of the lexeme is considered to be the new current G1 location.

A parse event may occur during a successful read. The parse event may call an event handler. The event handler will see the block and G1 locations as just described.

Pre-lexeme events

An external scanning method may succeed with a pre-lexeme event. In that case, the current block is set to $block_id. The offset of the current block is set to the event location, which will be the same as $offset. The eoread of the current block will not be changed.

If a pre-lexeme event does occur, no lexeme is read. The current G1 location will remain where it was before external scanning.

The pre-lexeme parse event may call an event handler. The event handler will see the block and G1 locations as just described.

Soft failure

If a high level external scanning method rejects a lexeme, then that method results in a soft failure. In this case the current block data remains unchanged.

On soft failure, no lexeme is read. The current G1 location will remain where it was before the method call.

Hard failure

Any failure in external scanning completion, other than lexeme rejection, is a hard failure. In the case of a hard failure, no guarantee is made about the current block data, or about the current G1 location.

On hard failure, Marpa::R3 will attempt to leave the block location at an "error location" -- a location as relevant as possible to the error. Marpa::R3 will attempt to leave the current G1 location valid and unchanged.

High-level mutators

Most applications doing external scanning will want to use the high-level methods. The $recce->lexeme_read_string() method allows the reading of a string, where the string is both the literal equivalent of the input, and its value for semantics. The $recce->lexeme_read_literal() method is similar, but the string is specified as a block span.

The $recce->lexeme_read_block() method is the most general of the high-level external scanning methods. lexeme_read_block() allows the app to specify the literal equivalent and the value separately.

lexeme_read_block()

    my $ok = $recce->lexeme_read_block($symbol_name, $value,
        $main_block, $start_of_lexeme, $lexeme_length);
    die qq{Parser rejected lexeme "$long_name" at position $start_of_lexeme, before "},
      $recce->literal( $main_block, $start_of_lexeme, 40 ), q{"}
          if not defined $ok;

lexeme_read_block() is the basic method for external scanning. It takes five arguments, only the first of which is required. Call them, in order, $symbol_name, $value, $block_id, $offset, and $length.

The $symbol_name argument is the symbol name of the lexeme to scan. The $value argument will be the value of the lexeme. If $value is missing or undefined, the value of the lexeme will be a Perl undef. The $block_id, $offset, and $length arguments are the literal equivalent of the lexeme, as a block span. lexeme_read_block() is a high level method and details of its behavior are as described above.

Return values: On success, lexeme_read_block() returns the new current offset. Soft failure occurs if and only if the lexeme was rejected. On soft failure, lexeme_read_block() returns a Perl undef. Other failures are thrown as exceptions.

lexeme_read_literal()

    my $ok = $recce->lexeme_read_literal($symbol_name, $main_block, $start_of_lexeme, $lexeme_length);
    die qq{Parser rejected lexeme "$long_name" at position $start_of_lexeme, before "},
       $recce->literal( $main_block, $start_of_lexeme, 40 ), q{"}
           if not defined $ok;

lexeme_read_literal() takes four arguments, only the first of which is required. Call them, in order, $symbol_name, $block_id, $offset, and $length. The $symbol_name argument is the symbol name of the lexeme to scan. The $block_id, $offset, and $length arguments are the literal equivalent of the lexeme, as a block span. The value of the lexeme will be the same as its literal equivalent. lexeme_read_literal() is a high level method and details of its behavior are as described above.

    $recce->lexeme_read_literal($symbol, $start, $length, $value)

is roughly equivalent to

    sub read_literal_equivalent_hi {
        my ( $recce, $symbol_name, $block_id, $offset, $length ) = @_;
        my $value = $recce->literal( $block_id, $offset, $length );
        return $recce->lexeme_read_block( $symbol_name, $value, $block_id, $offset, $length );
    }

Return values: On success, lexeme_read_literal() returns the new current offset. Soft failure occurs if and only if the lexeme was rejected. On soft failure, lexeme_read_literal() returns a Perl undef. Other failures are thrown as exceptions.

lexeme_read_string()

    my $ok = $recce->lexeme_read_string( $symbol_name, $lexeme );
    die qq{Parser rejected lexeme "$long_name" at position $start_of_lexeme, before "},
      $recce->literal( $main_block, $start_of_lexeme, 40 ), q{"}
         if not defined $ok;

The lexeme_read_string() method takes 2 arguments, both required. Call them, in order, $symbol_name and $string. $symbol_name is the symbol name of the lexeme to be read. $string is a string which becomes both the value of the lexeme and its literal equivalent. lexeme_read_string() is a high level method and, with two important exceptions, the details of its behavior are as described above.

The first difference is that, on success, lexeme_read_string() creates a new input text block, using $string as its text. We'll call this block the "per-string block". The literal equivalent of the lexeme will be the per-string block, starting at offset 0 and ending at eoblock.

The second difference is that, after a successful call to lexeme_read_string(), the per-string block does not become the new current block. The current block data after a call to lexeme_read_string() will be the same as it was before the call to lexeme_read_string().

For most purposes, then, the per-string block is invisible to the app that called lexeme_read_string(). Apps which trace or keep track of the details of the input text blocks may notice the additional block. Also, event handlers which trigger during the lexeme_read_string() method will see the per-string block.

    $recce->lexeme_read_string($symbol, $string)

is roughly equivalent to

    sub read_string_equivalent_hi {
        my ( $recce, $symbol_name, $string ) = @_;
        my ($save_block) = $recce->block_progress();
        my $new_block = $recce->block_new( \$string );
        my $return_value = $recce->lexeme_read_literal( $symbol_name, $new_block );
        $recce->block_set($save_block);
        return $return_value;
    }

lexeme_read_string() is not designed for very long values of $string. For efficiency with long strings, use the equivalent in terms of lexeme_read_literal(), as just shown. lexeme_read_literal() sets the value of the lexeme to a span of an input text block, while lexeme_read_string() sets the value of the lexeme to a string. Marpa::R3 optimizes lexeme values when they are literals in its input text blocks.

Return values: On success, lexeme_read_string() returns the new current offset. Soft failure occurs if and only if the lexeme was rejected. On soft failure, lexeme_read_string() returns a Perl undef. Other failures are thrown as exceptions.

COPYRIGHT AND LICENSE

  Marpa::R3 is Copyright (C) 2018, Jeffrey Kegler.

  This module is free software; you can redistribute it and/or modify it
  under the same terms as Perl 5.10.1. For more details, see the full text
  of the licenses in the directory LICENSES.

  This program is distributed in the hope that it will be
  useful, but without any warranty; without even the implied
  warranty of merchantability or fitness for a particular purpose.