- ABOUT THIS DOCUMENT
- AMBIGUOUS LEXING
- THE CHARACTER-PER-EARLEME MODEL
- OTHER INPUT MODELS
- LICENSE AND COPYRIGHT
Marpa::Models - Alternative Input Models
The alternative input models desribed in this document are a very advanced technique. This document may safely be ignored by ordinary Marpa users.
Marpa's default input model is the traditional one -- a token stream. Token streams are so standard in parsing applications that most texts and documentation do not take the trouble of defining it. A token stream is input structured as a sequence of tokens, where token occupy successive locations and are of the same length. Conventionally, all tokens are of length 1, and the token stream starts at location 0. Following this convention, the Nth token would start at location N-1 and end at location N. For example, the first token would start at location 0 and end at location 1. The second token would start at location 1 and end at location 2.
The basic idea of location in Marpa is the earleme. Earlemes are named after Jay Earley. Internally, each earleme corresponds exactly and directly to an Earley set. Every token has a start earleme and an end earleme. A token is said to be at an earleme, if that earleme is its start earleme.
The token stream model may also be called the token-per-earleme model. The Nth token starts at earleme N-1 and ends at earleme N. The first token starts at earleme 0 and ends at earleme 1. and therefore is said to be at earleme 0. The second token is at earleme 1 and ends at earleme 2.
In the token stream model, token location and earleme location directly correspond on an one-to-one basis. It can be useful to have the structure of the input relate to earleme location in other ways. One such alternative, which is useful and been tested, is the character-per-earleme model, discussed below.
Alternative models are implemented using the optional third and fourth parameters of the token descriptors. The array of token descriptors is an argument to the Marpa Recognizer's tokens method.
Token length is the optional third element of the token descriptor. By default, it is 1, which is the correct value for the token stream model. Its value can be any integer greater than zero. Marpa does not allow zero length tokens in any input model.
The token offset is the fourth element of each token's descriptor. It controls the current earleme location. The current earleme location is the earleme which will be the start earleme of the next token to be added. When parsing begins, the current earleme location is earleme 0.
When token offset is left undefined, it defaults to 1. This is the correct value for the token stream model. Negative token offsets are not allowed.
Zero token offset are allowed, and will cause multiple tokens to start at the same location. This in turn will cause lexing to be ambiguous. Marpa supports ambiguous lexing.
Ambiguous lexing occurs when several different sequences of tokens are possible. Potentially ambiguous lexing occurs in any parse where multiple tokens start at a single earleme. If only one of the potential token choices is consistent with the grammar and the rest of input, there is no actual ambiguity and the one token choice which is consistent is the one that will be used.
If more than one of these potential alternative lexings is consistent with the grammar and the rest of the input, then lexing is actually ambiguous. If the lexing in a parse is actually ambiguous, the parse will be ambiguous, even if the grammar is not.
Marpa deals with ambiguity due to lexing in the same way that it deals with grammatical ambiguity. When the Single Parse Evaluator is used, Marpa arbitrarily chooses one of the possible token choices. When the Multi-parse Evaluator is used, Marpa allows the user to iterate through alternative lexings.
Not yet written.
Not yet written.
Copyright 2007-2010 Jeffrey Kegler, all rights reserved. Marpa is free software under the Perl license. For details see the LICENSE file in the Marpa distribution.