++ed by:
DRTECH AZAWAWI PERLANCAR DOUGDUDE

4 PAUSE users
1 non-PAUSE user.

Author image Jeffrey Kegler
and 1 contributors

NAME

Marpa::API::Recognizer - Marpa Recognizer Objects

SYNOPSIS

    my $recce = Marpa::Recognizer->new( { grammar => $grammar } );

    my @tokens = (
        [ 'Number', 42 ],
        [ 'Multiply', ],
        [ 'Number', 1 ],
        [ 'Add', ],
        [ 'Number', 7 ],
    );

    $recce->tokens( \@tokens );

DESCRIPTION

The Marpa::Recognizer::new constructor creates a recognizer object from a precomputed Marpa grammar object. The Marpa::Recognizer::tokens method recognizes tokens.

Pedantically, recognition and evaluation are distinct things. Recognition is determining whether there is a parse. Evaluation is determining the value of a parse. But pure recognition in the pedantic sense is of little practical interest. Marpa::Recognizer incorporates many evaluation functions.

In particular, Marpa::Recognizer has a built-in "Single Parse Evaluator", which works on unambiguous grammars. The Single Parse Evaluator also works for ambiguous parses if the grammar is not infinitely ambiguous. The Single Parse Evaluator only returns one parse. For ambiguous grammars, the Single Parse Evaluator arbitrarily selects one of the parses.

The Single Parse Evaluator is usually the best choice when there is only one parse. The Single Parse Evaluator is also usually the best choice for an application that only needs one parse, and is not fussy about which one.

Location

In traditional parsing, location is position in a token stream. This document will assume that Marpa is using the traditional, token-stream, model of the input, unless it states otherwise. Marpa supports other input models, which are discussed in a separate document.

The current parse location, or current earleme is the location at which the next input is expected. Intuitively, the current earleme can be thought of as the recognizer's current position. Marpa will have completed its tables (or Earley sets) out to the current earleme, so the current earleme is often referred to as the last completed earleme.

At times, this document will refer to the location of the furthest earleme. Users sticking to the token-stream model can ignore the furthest earleme -- for them it will always be the same as the current earleme. Alternative input models in which the furthest earleme is important, and the reasons for the term earleme itself, are presented in the separate document on alternative input models.

CONSTRUCTOR

new

The new method's arguments are references to hashes of named arguments. In the key/value pairs of these hashes, the key is the argument name, and the hash value is the value of the argument. The new method either returns a new parse object or throws an exception. Details of the named arguments are below.

MUTATORS

end_input

Indicates that input is finished. The end_input method takes no arguments. Calling end_input is not necessary in the recognizer's default mode. The end_input method returns a Perl true value on success. On failure, it throws an exception.

After a successful call to the end_input method, the current earleme will be the same as the furthest earleme. Any further calls to tokens will cause an exception to be thrown. The end_input method can only usefully be called once per recognizer. Subsequent calls to the end_input method have no effect and return a Perl true.

set

The set method's arguments are references to hashes of named arguments. It can be used to set or change named arguments after the recognizer has been created. Details of the named arguments are below.

strip

To optimize reading input and building tables, the recognizer creates several large internal data structures. These are no longer needed when input is finished.

The strip method removes the recognizer's internal data structures, leaving only the earley sets and other data that is needed for the evaluation phase. Stripping a recognizer greatly reduces the amount of memory it uses. Attempting to strip a recognizer before input is finished will cause an exception.

tokens

    my @tokens = (
        [ 'Number', 42 ],
        [ 'Multiply', ],
        [ 'Number', 1 ],
        [ 'Add', ],
        [ 'Number', 7 ],
    );

    $recce->tokens( \@tokens );

The tokens method takes two arguments. The first is a reference to an array of token descriptors. The second, optional, argument is a reference to an index into that array, used when the call is interactive.

Tokens descriptors are references to arrays. The elements of the token descriptor array are, in order, the token name, the token value, the token length and the token offset. The token name must be the name of a symbol which is valid as a terminal in the grammar. In a token descriptor, only the token name is required -- the other elements may omitted.

The token value can be any Perl scalar. If omitted, the token value is a Perl undef.

The token length and the token offset arguments are not used in the traditional, token-stream, model of the input. Their use is described in the document on alternative input models.

Parsing may reach a point where it is exhausted. "Exhausted" means that there is no parse at the current location, and no possible input could produce any parses at a subsequent location. If a recognizer is not exhausted, it is active.

In the evaluators, the default end of parsing is the end of input, and that means that if the evaluators are called using the default end of parsing, they will find no parses in an exhausted recognizer. The only way an exhausted recognizer can produce a valid parse is if end of parsing is set to a location before the end of input.

If the recognizer is exhausted, in scalar context, the tokens method returns undef. If the recognizer is exhausted, in array context, the tokens method returns an empty array.

If the recognizer is active, in scalar context, the tokens method returns the number of the current earleme. If the recognizer is active, in array context, the tokens method returns an array of two elements. The first element of the return array is the number of the current earleme. The second element is a reference to a list of expected tokens.

The list of expected tokens returned by the tokens method is used for prediction-driven lexing. It is an array of strings. The strings are the names of the symbols that will be valid as token names at the current parse location. For more detail on how to use this list, see the section on interactive input.

value

The value mutator implements the Single Parse Evaluator. It is described in its own section.

ACCESSOR

check_terminal

Returns a Perl true when its argument is the name of a terminal symbol. Otherwise, returns a Perl false. Not usually needed, but in unusual sitations a lexer may find this the easiest way to determine if a symbol is a terminal.

TRACE ACCESSORS

show_earley_sets

    print $recce->show_earley_sets()
        or Carp::croak "print failed: $OS_ERROR";

Returns a multi-line string listing every Earley item in every Earley set.

NAMED ARGUMENTS

grammar

The grammar named argument is required. Its value must be a precomputed Marpa grammar object.

mode

The mode named argument is optional. If present, it must be a string, either "default" or "stream".

In default mode (which is the default), only one call to the tokens method is allowed for a recognizer object. The input is automatically finished after that call, and calling the end_input method to indicate that input is finished is not necessary.

In stream mode, the tokens method may be called repeatedly. To indicate that input is finished, it is necessary to call the end_input method.

too_many_earley_items

The too_many_earley_items argument is optional. If specified, it sets the earley items warning threshold. If an earley set becomes larger than the earley items warning threshold, a warning is printed to the trace file handle.

Marpa parses from any BNF, and can handle grammars and inputs which produce large earley sets. But parsing that involves large earley sets can be slow. Large earley sets will be something most users can work around and will wish to avoid.

By default, the earley items warning threshold is set to a number calculated based on the size of the grammar, but never less than 100. If the earley items warning threshold is set to 0, warnings about large earley sets are turned off.

trace_earley_sets

A boolean. If true, causes each earley set to be written to the trace file handle as it is completed.

trace_file_handle

The value is a file handle. Trace output and warning messages go to the trace file handle. By default the trace file handle is inherited from the grammar used to create the recognizer.

trace_terminals

Traces terminals as they are accepted or rejected by the recognizer. Very handy in debugging, and often useful even when the problem is not in the lexing. Requires little knowledge of Marpa internals to interpret.

warnings

The value is a boolean. Warnings are written to the trace file handle. By default, the recognizer's warnings are on. Usually, an application will want to leave them on.

INTERACTIVE INPUT

    RECCE_RESPONSE: for ( my $token_ix = 0;; ) {

        my ( $current_earleme, $expected_tokens ) =
            $recce->tokens( \@tokens, \$token_ix );

        last RECCE_RESPONSE if $token_ix > $#tokens;

        fix_things( \@tokens, $expected_tokens )
            or Carp::croak(q{Don't know how to fix things});

    } ## end for ( my $token_ix = 0;; )

Marpa allows prediction-driven lexing. That is, Marpa can tell the lexer what symbols will be acceptable as tokens at the next location in the parse. This can be very useful.

Interactive input takes place, like all input, via the tokens method. When a second argument is given to the tokens method, it is interactive. Interactive calls to the tokens are only allowed in stream mode.

In interactive input, the first argument is a reference to a token descriptor array, and the second argument is a reference to a token descriptor index. The tokens method will read token descriptors starting at the token descriptor index. Elements in the token descriptor array before the token descriptor index are ignored.

In non-interactive calls, if the tokens method sees a token descriptor for an invalid token, it throws an exception. In interactive calls, tokens returns and leaves the token descriptor index at the location where it had a problem. If, in an interactive call, tokens reaches the end of the token descriptor array, the token descriptor index will be set to one past the end of the token descriptor array.

The behavior of the token descriptor index is designed to facilitate loops like the one in the example above. In that loop, the token descriptor index is initialized to zero, and is never directly changed after that. The loop ends which the token descriptor index points past the end of the token descriptor array.

The return values for interactive calls to tokens are the same as the return values for non-interactive calls. Interactive calls to tokens can be made in scalar context, but most often applications will want to use the list of expected tokens, which is only available in array context.

Marpa's HTML parser, Marpa::UrHTML, is an example of interactive input applied to a non-trivial, real-life application. When an interactive tokens call returns due to a problem with a token descriptor, Marpa::UrHTML tries to "fix" things in two ways.

First, it looks at the expected tokens list, trying to find a expected token that it can create as a "virtual" token. If Marpa::UrHTML finds an acceptable virtual token, it will create it "on the fly". Effectively, Marpa::UrHTML is supplying tokens that the parser wants but which are missing in the physical input. Marpa::UrHTML inserts virtual tokens into the input with separate, non-interactive, calls to tokens. After inserting the virtual tokens, Marpa::UrHTML loops back to the main, interactive, tokens call, resuming the original input at the point where it left off.

Second, and if the first technique fails, Marpa::UrHTML will change the next token descriptor in the token descriptor array so that it fits the parser's expectations. It then loops back to the main interactive call to tokens, again resuming where it left off.

The recognizer keeps no state for parts of the token descriptor array which it has yet to reach. This means that an application can be aggressive about adding input on the fly. An application is also free to change the portion of the token descriptors array which has yet to be read. Multiple interactive calls to tokens can be "in progress" at once. Those interactive calls can be interspersed with as many non-interactive calls as the user wishes.

SINGLE PARSE EVALUATOR

    my $value_ref = $recce->value;
    my $value = $value_ref ? ${$value_ref} : 'No Parse';

The Single Parse Evaluator is implemented as the value method call. Its arguments are zero or more hashes of named arguments. It returns a reference to the value of a parse, or undef if there was no parse.

If the parse is not ambiguous, the reference returned will be to the value of the only valid parse. If the parse is ambiguous, the reference returned will be to the value of a parse chosen arbitrarily.

These are the named arguments available to the value method call:

end

value's end named argument is the earleme (that is, the location) where the parse should end. The default is for the parse to end where the input did, so that the parse returned is of the entire input.

closure

value's closure named argument is a reference to a hash. In the key/value pairs of this hashes, the key must be an action name from the action properties of a rule; from the default_action named argument of the grammar; from the ranking_action property of a rule; or from the ranking_action property of a symbol. The value in the hashes key/value pair must be a CODE ref.

When an action name is specified in the closure named argument, the usual action resolution mechanism of the semantics is bypassed. The value of a hash entry in the closure named argument can be an anonymous subroutine. The most common use of the closure named argument is to allow anonymous subroutines to be semantics actions. For more details, see the document on semantics.

trace_actions

value's trace_values named argument is a boolean. If the boolean value is true, Marpa traces the resolution of semantic actions. A boolean value of false turns tracing off, which is the default. The traces are written to the trace file handle.

trace_values

value's trace_values named argument is a numeric trace level. If the numeric trace level is 1 or more, Marpa traces values as they are computed in the evaluation stack. A trace level of 0 turns value tracing off, which is the default. The traces are written to the trace file handle.

Tracing becomes more verbose as the levels increase. At a level of 3, trace_values becomes extremely verbose, writing out the entire stack every time it changes.

LICENSE AND COPYRIGHT

Copyright 2007-2010 Jeffrey Kegler, all rights reserved. Marpa is free software under the Perl license. For details see the LICENSE file in the Marpa distribution.