Marpa::Recognizer - Marpa Recognizer Objects
my $recce = Marpa::Recognizer->new( { grammar => $grammar } ); my @tokens = ( [ 'Number', 42 ], [ 'Multiply', ], [ 'Number', 1 ], [ 'Add', ], [ 'Number', 7 ], ); $recce->tokens( \@tokens );
The Marpa::Recognizer::new constructor creates a recognizer object from a precomputed Marpa grammar object. The Marpa::Recognizer::tokens method recognizes tokens.
Marpa::Recognizer::new
Marpa::Recognizer::tokens
Pedantically, recognition and evaluation are distinct things. Recognition is determining whether there is a parse. Evaluation is determining the value of a parse. In practice the two are usually intertwined. A lot of what Marpa::Recognizer does can be regarded as "evaluation".
In particular, Marpa::Recognizer has a built-in Single Parse Evaluator, which works on unambiguous grammars. The Single Parse Evaluator also works for ambiguous parses if the grammar is not infinitely ambiguous. The Single Parse Evaluator only returns one parse. For ambiguous grammars, the Single Parse Evaluator arbitrarily selects a parse.
The Single Parse Evaluator is usually the best choice when there is only one parse. The Single Parse Evaluator is also usually the best choice for an application that only needs one parse, and is not fussy about which one.
In traditional parsing, location is position in a token stream. This document will assume that Marpa is using the traditional, token-stream, model of the input, unless it states otherwise. Marpa supports other input models, which are discussed in a separate document.
The current parse location, or current earleme, is the location at which the next input is expected. Intuitively, the current earleme can be thought of as the recognizer's current position. Alternative input models, and the reasons for the term earleme itself, are presented in the document on alternative input models.
Recognition may reach a point where the recognizer is exhausted. A recognizer is exhausted if there is no parse at the current location, and if no possible additional input could produce parses at any subsequent location. If a recognizer is not exhausted, it is active.
The evaluators look for a parse from the parse start location to the parse end location. By default, the parse start location is the beginning of the input. By default, the parse end location is the end of the input. In other words, the default is to parse the entire input.
When an evaluator is called using the default parse end location, the evaluator will find no parses in an exhausted recognizer. It may be possible for an exhausted recognizer to produce a valid parse when the parse end location is set to a location before the end of input.
Failed parses do not necessarily show up in the recognizer. Despite its name, Marpa's recognizer does not actually indicate whether or not it recognized its input. Recognition is not decided until an evaluator is called. There are two reasons for this.
The location of the end of parsing is not known for certain until the evaluator is called. Whether an input is recognized or not cannot be decided until the parse end location is known.
Recognition for its own sake is an academic issue. Applications are concerned about whether there were any parse results. And producing parse results is the job of the evaluators.
The new method's arguments are references to hashes of named arguments. In the key/value pairs of these hashes, the key is the argument name, and the hash value is the value of the argument. The new method either returns a new recognizer object or throws an exception. The named arguments are described in a section below.
new
Indicates that input is finished. The end_input method takes no arguments. Calling end_input is not useful in the recognizer's default mode. The end_input method returns a Perl true value on success. On failure, it throws an exception.
end_input
default
In stream mode, the end_input method is needed to indicate the end of input. Calling end_input multiple times on the same recognizer object is harmless, but useless. The second and subsequent calls will return a Perl true, but will have no effect. Once end_input has been called, calls to tokens on the same recognizer object will cause an exception to be thrown.
stream
tokens
The set method's arguments are references to hashes of named arguments. The set method can be used to set or change named arguments after the recognizer has been created. Details of the named arguments are below.
set
The recognizer generates a considerable amount of internal data while building tables. These internal data are no longer needed when input is finished and the recognizer's tables are complete.
The strip method deletes much of the recognizer's internal data, but leaves the tables and other data needed for the evaluation phase. Stripping a recognizer greatly reduces the amount of memory it uses. Attempting to strip a recognizer before input is finished will cause an exception.
strip
my @tokens = ( [ 'Number', 42 ], [ 'Multiply', ], [ 'Number', 1 ], [ 'Add', ], [ 'Number', 7 ], ); $recce->tokens( \@tokens );
The tokens method takes two arguments. The first is a reference to an array of token descriptors. The second, optional, argument is a reference to an index into that array, used when the call is interactive.
Tokens descriptors are also references to arrays. The elements in each token descriptor are, in order:
Token name
Required. Must be the name of a valid terminal symbol. For details about terminal symbols, see "Terminals" in Marpa::Grammar.
Token value
Optional. Can be any Perl scalar. If omitted, the token value is a Perl undef.
undef
Token length
Optional, and not used in the traditional, token-stream, model of the input. Token length is described in the document on alternative input models.
Token offset
Optional, and not used in the traditional, token-stream, model of the input. Token offset is described in the document on alternative input models.
In scalar context, if the recognizer remains active, the tokens method returns the number of the current earleme. In array context, if the recognizer remains active, the tokens method returns an array of two elements. The first element of the returned array is the number of the current earleme. The second element is a reference to a list of expected tokens.
The list of expected tokens returned by the tokens method is used for prediction-driven lexing. It is an array of strings. The strings are the names of the symbols that are valid as token names at the current parse location. For detail on how to use the list of expected tokens, see the section on interactive input.
When a call to the tokens method in scalar context exhausts the recognizer, the tokens method returns undef. When a call to the tokens method in array context exhausts the recognizer, the tokens method returns the empty array.
The value mutator implements the Single Parse Evaluator. It is described in its own section.
Returns a Perl true when its argument is the name of a terminal symbol. Otherwise, returns a Perl false. Not usually needed, but in unusual sitations a lexer may find this the easiest way to determine if a symbol is a terminal.
print $recce->show_earley_sets() or Carp::croak "print failed: $OS_ERROR";
Returns a multi-line string listing every Earley item in every Earley set.
The grammar named argument is required. Its value must be a precomputed Marpa grammar object.
grammar
The mode named argument is optional. If present, it must be a string, either "default" or "stream".
mode
In stream mode, the tokens method may be called repeatedly. To indicate that input is finished, it is necessary to call the end_input method.
In default mode, only one call to the tokens method is allowed for each recognizer object. The input is automatically finished after that one call. In default mode, the end_input method is not useful. As the name indicates, default mode is the default.
The too_many_earley_items argument is optional. If specified, it sets the earley item warning threshold. If an earley set becomes larger than the earley item warning threshold, a warning is printed to the trace file handle.
too_many_earley_items
Marpa parses from any BNF, and can handle grammars and inputs which produce large earley sets. But parsing that involves large earley sets can be slow. Large earley sets are something most users can, and will wish to, avoid.
By default, Marpa calculates an earley item warning threshold based on the size of the grammar. This default, calculated, threshold will never be less than 100. If the earley item warning threshold is set to 0, warnings about large earley sets are turned off. For details about earley sets, see the implementation document.
A boolean. If true, causes each earley set to be written to the trace file handle as it is completed. For details about earley sets, see the implementation document.
The value is a file handle. Traces and warning messages go to the trace file handle. By default the trace file handle is inherited from the grammar used to create the recognizer.
Traces terminals as they are accepted or rejected by the recognizer. Very handy in debugging, and often useful even when the problem is not in the lexing. Requires little knowledge of Marpa internals to interpret.
The value is a boolean. Warnings are written to the trace file handle. By default, the recognizer's warnings are on. Usually, an application will want to leave them on.
RECCE_RESPONSE: for ( my $token_ix = 0;; ) { my ( $current_earleme, $expected_tokens ) = $recce->tokens( \@tokens, \$token_ix ); last RECCE_RESPONSE if $token_ix > $#tokens; fix_things( \@tokens, $expected_tokens ) or Carp::croak(q{Don't know how to fix things}); } ## end for ( my $token_ix = 0;; )
Marpa allows prediction-driven lexing. That is, Marpa can tell the lexer which symbols are acceptable as tokens at the next location in the parse. This can be very useful.
Interactive input takes place, like all input, via the tokens method. If a call of the tokens method passes a second argument, that call of the tokens method is interactive. Interactive calls to the tokens method are allowed only on recognizers that are in stream mode.
In interactive input, the first argument is a reference to a token descriptor array, and the second argument is a reference to a token descriptor index. The tokens method will read token descriptors starting at the token descriptor index. Token descriptors before the token descriptor index are ignored.
In interactive calls, if a token descriptor has a token name that is not a valid terminal, the tokens method returns, leaving the token descriptor index where it was when tokens had the problem. (In non-interactive calls, an invalid token name causes an exception to be thrown.) If tokens reads the entire token descriptor array without a problem, the token descriptor index will be set to one past the end of the token descriptor array.
The behavior of the token descriptor index is designed to facilitate loops like the one in the example above. In that loop, the token descriptor index is initialized to zero, and is never directly modified after that. The loop ends when the token descriptor index points past the end of the token descriptor array.
The return values for interactive calls to tokens are the same as the return values for non-interactive calls. Interactive calls to tokens can be made in scalar context, but most often applications will want to use the list of expected tokens, which is only returned in array context.
Recognizers in stream mode are very flexible. Both interactive and non-interactive tokens method calls can be made to the same recognizer. A single recognizer can read input from any number of token descriptor arrays. Between tokens's calls, the application is allowed to edit the token descriptor array, and to change the token descriptor index.
This flexiblity is possible because the recognizer is stateless with respect to token descriptor arrays and their token descriptor indexes. The tokens method does not keep track of which token descriptor arrays it has seen. It does not keep track of which token descriptors it has already read. It does not keep track of where the token descriptor indexes have been in the past. Statelessness means that, regardless of what an application does with its token descriptor arrays, it cannot make them inconsistent with the recognizer's internal data.
Marpa's HTML parser, Marpa::UrHTML, is an example of how this flexibility can help with a non-trivial, real-life application. When an interactive tokens call returns due to an invalid token name, Marpa::UrHTML tries to fix things in two ways.
First, it looks at the expected tokens list for the name of a token that will work as a "virtual" token. If Marpa::UrHTML finds an acceptable virtual token name, it will create a token descriptor for it "on the fly". In effect, Marpa::UrHTML supplies tokens that the parser wants but which are missing in the physical input. Marpa::UrHTML inserts the virtual tokens into the input using non-interactive tokens method calls. After inserting the virtual tokens, Marpa::UrHTML loops back to the main, interactive, tokens call, resuming the original input at the point where it left off.
Second, under some circumstances Marpa::UrHTML will change the next token descriptor in the token descriptor array to fit the parser's expectations. Marpa::UrHTML then loops back to the main interactive call to tokens. When the main interactive call to tokens resumes, the token descriptor index is unchanged, but points to an altered token descriptor.
my $value_ref = $recce->value; my $value = $value_ref ? ${$value_ref} : 'No Parse';
The Single Parse Evaluator is implemented as the value method call. Its arguments are zero or more hashes of named arguments. It returns a reference to the value of a parse, or undef if there was no parse.
value
If the parse is not ambiguous, the reference returned will be to the value of the only valid parse. If the parse is ambiguous, the reference returned will be to the value of a parse chosen arbitrarily.
These are the named arguments available to the value method call:
The value method's end named argument specifies the parse end location. The default is for the parse to end where the input did, so that the parse returned is of the entire input.
end
The value method's closures named argument is a reference to a hash. In the key/value pairs of this hash, the key must be an action name. The hash value must be a CODE ref.
closures
Sources of action names include
The action properties of rules
action
The default_action named argument of grammars
default_action
The lhs properties of rules
lhs
The ranking_action properties of rules
ranking_action
The ranking_action properties of symbols
For its new method, the action_object named argument of grammars
action_object
When an action name is a key in the closures named argument, the usual action resolution mechanism of the semantics is bypassed. A common use of the closures named argument is to allow anonymous subroutines to be semantic actions. For more details, see the document on semantics.
The value method's trace_actions named argument is a boolean. If the boolean value is true, Marpa traces the resolution of action names to Perl closures. A boolean value of false turns tracing off, which is the default. Traces are written to the trace file handle.
trace_actions
The value method's trace_values named argument is a numeric trace level. If the numeric trace level is 1, Marpa traces values as they are computed in the evaluation stack. A trace level of 0 turns value tracing off, which is the default. Traces are written to the trace file handle.
trace_values
Copyright 2007-2010 Jeffrey Kegler, all rights reserved. Marpa is free software under the Perl license. For details see the LICENSE file in the Marpa distribution.
To install Marpa, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Marpa
CPAN shell
perl -MCPAN -e shell install Marpa
For more information on module installation, please visit the detailed CPAN module installation guide.