- ABOUT THIS DOCUMENT
- AMBIGUOUS LEXING
- THE CHARACTER-PER-EARLEME MODEL
- OTHER INPUT MODELS
- LICENSE AND COPYRIGHT
Marpa::Models - Other Input Models (Advanced)
The alternative input models desribed in this document are a very advanced technique. This document may safely be ignored by ordinary Marpa users.
Marpa's default input model is the traditional one -- a token stream. Token streams are very standard in parsing applications -- so much so that most texts do not take the trouble of defining the term. A token stream is input structured as a sequence of tokens, where each token occupies one location and every location has a token. In the token stream model, all tokens are of the same length.
Conventionally, all tokens are of length 1, and the token stream starts at location 0. Following this convention, the Nth token would start at location N-1 and end at location N. For example, the first token would start at location 0 and end at location 1. The second token would start at location 1 and end at location 2.
The basic idea of location in Marpa is the earleme. Earlemes are named after Jay Earley. Internally, each earleme corresponds exactly and directly to an Earley set. Every token has a start earleme and an end earleme.
The token stream model may also be called the token-per-earleme model. In the token stream model, token location and earleme location directly correspond on an one-to-one basis. It can be useful to have the structure of the input relate to earleme location in other ways. One such alternative, which is useful and been tested, is the character-per-earleme model, discussed below.
Alternative models are implemented using the optional third and fourth parameters of the token descriptors. Token descriptors are used in the arguments to the Marpa Recognizer's tokens method.
Token length is the optional third element of the token descriptor. By default, it is 1, which is the correct value for the token stream model. Its value can be any integer greater than zero. Marpa does not allow zero length tokens in any input model.
The token offset is the fourth element of each token's descriptor. This is the offset to be added to the current earleme location. The current earleme location is the earleme which will be the start earleme of the next token to be added. When parsing begins, the current earleme location is earleme 0.
When token offset is left undefined, it defaults to 1. That means the next token will be added at another earleme location, one after the current one. This is the correct value for the token stream model.
Negative token offsets are not allowed. Zero token offset are allowed, and will cause multiple tokens to start at the same location. This in turn will cause lexing to be ambiguous. Marpa supports ambiguous lexing.
Ambiguous lexing occurs when several different sequences of tokens are possible. Potentially ambiguous lexing occurs in any parse where multiple tokens start at a single earleme. An actual ambiguity only occurs if more than one of the potential token choices is consistent with the grammar and its input. If there is no actual ambiguity, Marpa will use the only token choice which is consistent with the grammar and its inputs.
When lexing is actually ambiguous, Marpa will use all consistent alternatives. When the lexing in a parse is actually ambiguous, the parse will be ambiguous. This means that in Marpa, a parse can be ambiguous, even when the grammar is not ambiguous.
Marpa's semantics deal with ambiguity due to lexing in the same way that Marpa deals with grammatical ambiguity. When the Single Parse Evaluator is used, Marpa arbitrarily chooses one of the possible token choices. When the Multi-parse Evaluator is used, Marpa allows the user to iterate through alternative lexings.
Not yet written.
Not yet written.
Copyright 2007-2010 Jeffrey Kegler, all rights reserved. Marpa is free software under the Perl license. For details see the LICENSE file in the Marpa distribution.