The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Marpa::Recognizer - Marpa Recognizer Objects

SYNOPSIS

    my $recce = Marpa::Recognizer->new( { grammar => $grammar } );

    my $fail_offset = $recce->text('2-0*3+1');
    if ( $fail_offset >= 0 ) {
        Marpa::exception("Parse failed at offset $fail_offset");
    }

    my $recce = Marpa::Recognizer->new( { grammar => $grammar } );

    my $op     = $grammar->get_symbol('Op');
    my $number = $grammar->get_symbol('Number');

    my @tokens = (
        [ $number, 2,    1 ],
        [ $op,     q{-}, 1 ],
        [ $number, 0,    1 ],
        [ $op,     q{*}, 1 ],
        [ $number, 3,    1 ],
        [ $op,     q{+}, 1 ],
        [ $number, 1,    1 ],
    );

    TOKEN: for my $token (@tokens) {
        next TOKEN if $recce->earleme($token);
        Marpa::exception( 'Parsing exhausted at character: ', $token->[1] );
    }

    $recce->end_input();

DESCRIPTION

Marpa parsing takes place in three major phases: grammar creation, input recognition and parse evaluation. Once a grammar object has rules, a recognizer object can be created from it. The recognizer accepts input and can be used to create a Marpa evaluator object.

Tokens and Earlemes

Marpa allows ambiguous tokens. Several Marpa tokens can start at a single parsing location. Marpa tokens can be of various lengths. Marpa tokens can even overlap.

For most parsers, position is location in a token stream. To deal with variable-length and overlapping tokens, Marpa needs a more flexible idea of location. Marpa's idea of position is location in an earleme stream. Earlemes are named after Jay Earley, the inventor of the first algorithm in Marpa's lineage.

While scanning, Marpa keeps track of the current earleme. Earlemes in an earleme start at earleme 0 and increase numerically. The earleme immediately following earleme 0 is earleme 1, the earleme immediately following earleme 1 is earleme 2, and so on. The earleme immediately following earleme N is always earleme N+1.

Distance in the earleme stream are what you'd expect. The distance between earleme X and earleme Y is the absolute value of the difference between X and Y, |X-Y|. The distance from earleme 3 to earleme 6, for example, is 3 earlemes.

Whenever a token is given to Marpa to be scanned, it starts at the current earleme. In addition to the type and value of the token, Marpa must be told token's length in earlemes. The length of a Marpa token must be greater than zero. This earleme length will become the distance from the start of the token to the end of the token.

The start of the token is put at the current earleme. If the length of the token is L, and the number of the current earleme is C, the end of the token will be at the earleme number C+L.

The One-Character-Per-Earleme Model

Many different models of the relationship between tokens and earlemes are possible, but two are particularly important. One is the one-token-per-earleme model. The other is the one-character-per-earleme model. If you do your lexing with the text method, you will use a one-character-per-earleme model.

Using the text method, Marpa receives the input as the series of strings and string reference. provided in the one or more calls to the text method. The raw input can be thought of the concatenation of these strings, even though the strings are not physically concatenated. When the text method is used, character position in this raw input will correspond exactly one-to-one with the earleme position.

Every character will be treated as being exactly one earleme in length. Any tokens which are more than one several character in length, will span earlemes.

It is common, when a one-character-per-earleme model of input is used, for there to be many earlemes at which no tokens start. For example, in a standard implementation of a grammar for a language which allows comments, no tokens will start at any earlemes which corresponds to character locations inside a comment.

Other Models

Marpa is not restricted to the one-character-per-earleme model. Most parser generators treat location as position in a token stream. In Marpa, this correspoind to a one-token-per-earleme model.

If you use the earleme method, you can structure your input in almost any way you like. There are only four restrictions:

  1. Scanning always starts at earleme 0.

  2. Earleme N is always scanned immediately before earleme N+1. In other words, the earlemes are scanned one by one in increasing numerical order.

  3. When an earleme is scanned, all tokens starting at that earleme must be added. It is perfectly acceptable for there to be no tokens starting at a given earleme. However, once earleme N is scanned, it is no longer possible to add a token starting at any of the earlemes from 0 to N.

  4. With every token, a length in earlemes must be given, and this length cannot be zero or negative.

Exhaustion

At the start of parsing, the furthest earleme is earleme 0. When a token is recognized, its end earleme is determined by adding the token length to the current earleme. If the new token's end earleme is after the furthest earleme, the furthest earleme is set at the new token's end earleme.

If, after scanning all the tokens at an earleme, the current earleme has reached the furthest earleme, no more successful parses are possible. At this point, the recognizer is said to be exhausted. A recognizer is active if and only if it is not exhausted.

Parsing is said to be exhausted, when the recognizer is exhausted. Parsing is said to be active, when the recognizer is active.

Exhausted parsing does not mean failed parsing. In particular, parsing is often exhausted at the point of a successful parse. An exhausted recognizer may also contain successful parses both prior to the current earleme.

Conversely, active parsing does not mean successful parsing. A recognizer remains active as long as some potential input might produce a successful parse. This does not mean that it ever will.

Marpa parsing can remain active even if no token is found at the current earleme. In the one-character-per-earleme model, the current earleme might fall in the middle of a previously recognized token and parsing will remain active at least until the end of that token is reached. In the one-character-per-earleme model, stretches where no token either starts or ends can be many earlemes in length.

Cloning

The new constructor requires a grammar to be specified in one of its arguments. By default, the new constructor clones the grammar object. This is done so that recognizers do not interfere with each other by modifying the same data. Cloning is the default behavior, and is always safe.

While safe, cloning does impose an overhead in memory and time. This can be avoided by using the clone option with the new constructor. Not cloning is safe if you know that the grammar object will not be shared by another recognizer or used by more than one evaluator.

It is very common for a Marpa program to have simple flows of data, where no more than one recognizer is created from any grammar, and no more than one evaluator is created from any recognizer. When this is the case, cloning is unnecessary.

METHODS

new

    my $recce = Marpa::Recognizer->new(
        {    grammar      => $grammar,
             lex_preamble => $new_lex_preamble,
        }
    );

The new method's one, required, argument is a hash reference of named arguments. The new method either returns a new parse object or throws an exception. Either the stringified_grammar or the grammar named argument must be specified, but not both. A recognizer is created with the current earleme set at earleme 0.

If the grammar option is specified, its value must be a grammar object with rules defined. By default, the grammar is cloned for use in the recognizer.

If the stringified_grammar option is specified, its value must be a Perl 5 string containing a stringified Marpa grammar, as produced by Marpa::Grammar::stringify. It will be unstringified for use in the recognizer. When the stringified_grammar option is specified, the resulting grammar is never cloned, regardless of the setting of the clone argument.

If the clone argument is set to 1, and the grammar argument is not in stringified form, new clones the grammar object. This prevents evaluators from interfering with each other's data. This is the default and is always safe. If clone is set to 0, the evaluator will work directly with the grammar object which was its argument. See above for more detail.

Marpa options can also be named arguments to new. For details of the Marpa options, see Marpa::Doc::Options.

text

    my $fail_offset = $recce->text('2-0*3+1');
    if ( $fail_offset >= 0 ) {
        Marpa::exception("Parse failed at offset $fail_offset");
    }

Extends the parse using the one-character-per-earleme model. The one, required, argument must be a string or a reference to a string which contains text to be parsed. If all the input was successfully consumed, the text method returns a negative number. The return value is -1 if parsing was exhausted after consuming the entire input. The return value is -2 if parsing was still active after consuming the entire input.

If parsing was exhausted before all the input was consumed, the text method returns the number of characters that were consumed before parsing was exhausted. If text is called on an exhausted recognizer, so that none of the input can be consumed, the return value is 0. Failures, other than exhausted recognizers, are thrown as exceptions.

Terminals are recognized in the text using the lexers that were specified in the porcelain or the plumbing. The earleme length of each token is set to the length of the token in characters. (If a token has a "lex prefix", the length of the lex prefix counts as part of the token length.)

Subsequent calls to text on the same recognizer always advance the earleme numbering monotonically. The cth character, where the count c includes all characters from any previous calls to the text method for this recognizer, will start at earleme c-1 and will end at earleme c.

How a string is divided up among calls to the text method makes no difference in the earleme location of individual characters, but it can affect the recognition of terminals by the lexers. If the characters from a single terminal are split between two text calls, the lexers will fail to recognize that terminal. Terminals cannot span calls to the text method.

earleme

    my $a = $grammar->get_symbol('a');
    $recce->earleme( [ $a, 'a', 1 ] ) or Marpa::exception('Parsing exhausted');

The earleme method takes zero or more arguments. Each argument represents a token which starts at the current earleme. Because ambiguous lexing is allowed. more than one token may start at each earleme, in which case, there will be one argument per token. Because tokens can span earlemes, no tokens may start at an earleme in which case the call to earleme will have zero arguments.

After adding the tokens to the recognizer, the earleme method determines whether the recognizer is active or exhausted. If the recognizer is still active, the earleme method moves the current earleme forward by one, and the earleme method returns 1. If the recognizer is exhausted, the current earleme stays where it is, and the earleme method returns 0. The earleme method throws an exception on failure. Any attempt to add more input to an exhausted recognizer will fail.

Each token argument is a reference to a three element array. The first element is a "cookie" for the token's symbol, as returned by the Marpa::Grammar::get_symbol method or the get_symbol method of a porcelain interface. The second element is the token's value in the parse, and may be any value legal in Perl 5, including undefined. The third is the token's length in earlemes.

While the recognizer is active, an earleme remains the current earleme during only one call of the earleme method. All tokens starting at that earleme must be added in that call. The first time that the earleme method is called in a recognizer, the current earleme is at earleme 0.

Once a recognizer is exhausted, the current earleme never moves and no more input can be added. It is possible for a call to earleme with no arguments to exhaust the recognizer. This happens if earleme is called with zero arguments when the current earleme reaches the furthest earleme.

earleme is the low-level token input method. Unlike text, the earleme method assumes no particular model of the input. It is up to the user to define the relationship between tokens and earlemes.

end_input

    $recce->end_input();

Used to indicate the end of input. Tells the recognizer that no new tokens will be added, or, in other words, that no tokens will start at or after the current earleme. The end_input method takes no arguments.

The end_input method does not change the location of the furthest earleme. After a successful call to the end_input method, the current earleme will be positioned at the furthest earleme. Since positioning the current earleme at the furthest earleme leaves the recognizer exhausted, any further calls to text will return 0, and any further calls to earleme will throw an exception.

The end_input method returns a Perl true value on success. On failure, it throws an exception. The end_input method can only usefully be called once per recognizer, but the method is idempotent. Subsequent calls to the end_input method will have no effect and will return a Perl true.

stringify

    my $stringified_recce = $recce->stringify();

The stringify method takes as its single argument a recognizer object and converts it into a string. It returns a reference to the string. The string is created using Data::Dumper. On failure, stringify throws an exception.

unstringify

    $recce = Marpa::Recognizer::unstringify( $stringified_recce, $trace_fh );

    $recce = Marpa::Recognizer::unstringify($stringified_recce);

The unstringify static method takes a reference to a stringified recognizer as its first argument. Its second, optional, argument is a file handle. The file handle argument will be used both as the unstringified recognizer's trace file handle, and for any trace messages produced by unstringify itself. unstringify returns the unstringified recognizer object unless it throws an exception.

If the trace file handle argument is omitted, it defaults to STDERR and the unstringified recognizer's trace file handle reverts to the default for a new recognizer, which is also STDERR. The trace file handle argument is necessary because in the course of stringifying, the recognizer's original trace file handle may have been lost.

clone

    my $cloned_recce = $recce->clone();

The clone method creates a useable copy of a recognizer object. It returns a successfully cloned recognizer object, or throws an exception.

SUPPORT

See the support section in the main module.

AUTHOR

Jeffrey Kegler

LICENSE AND COPYRIGHT

Copyright 2007 - 2009 Jeffrey Kegler

This program is free software; you can redistribute it and/or modify it under the same terms as Perl 5.10.0.