NAME
Marpa::R2::Algorithm - The Earley algorithm
Description
For most purposes, the user does not need to know anything about the Earley algorithm or Marpa algorithm, and can ignore this document. This document describes those aspects of the Earley and Marpa algorithms which are relevant in certain cases, when using Marpa. These cases include tracing and debugging and use of the recognizer's ranking methods.
Resources for readers who want to know the details of the Marpa algorithm that are beyond the scope of this document are given below. A reader of this document should be familiar with the basic parsing terminology described in Marpa::R2::Vocabulary.pod.
Marpa is based on Jay Earley's algorithm for parsing. The idea behind Earley's algorithm is that you can parse by building a table of rules and where you are in those rules. "Where" means two things: location in the rule relative to the rule's symbols, and location relative to the parse's input stream.
Dotted Rules
To understand this document, it is important to understand what a dotted rule is. An acquaintance with dotted rules is also important in understanding Marpa's progress reports. Dotted rules are thoroughly described in the progress report documentation. This section repeats the main ideas from the perspective of this document.
Recall that a rule is a LHS (left hand side) and a RHS (right hand side). The LHS is always exactly one symbol. The RHS is zero or more symbols. Consider the following example of a rule, given in the syntax of Marpa's DSL.
S ::= A B C
Dotted rules are used to track the progress of a parse through a rule. Within a rule, position relative to the symbols of the rule is traditionally indicated with a dot. In fact, the symbol-relative rule position is usually called the dot location. Taken as a pair, a rule and a dot location are called a dotted rule. The symbol before the dot is called the predot symbol.
In parsing this rule, we can be at any of four possible locations. One location is at the beginning, before all of the symbols. The other three locations are immediately after each of the rule's three symbols. The following is an example of a dotted rule, with the dot after the second RHS symbol.
S ::= A B . C
In this rule, B is the predot symbol. This rule indicates that we have recognized an A, followed by a B, but that we have yet to see a C.
When the dot is after the last symbol of the RHS, the dotted rule is called a completed rule, or a completion. A completion indicates that a rule has been fully recognized.
Here is the completion for the above rule:
S ::= A B C .
In this completion example, the symbol C is the predot symbol.
In every grammar, a special rule is reserved as the start rule. When the rule of a dotted rule is the start rule, and the dot is before the first symbol of the RHS, the dotted rule is the start dotted rule. The start dotted rule does not have a predot symbol.
When the dot is before the first symbol of the RHS, and the rule is not the start rule, the dotted rule is called a predicted rule, or a prediction. Here is the prediction of the rule we've been using for our examples:
S ::= . A B C
A prediction indicates that we have not yet recognized any of the symbols in the rule. All a prediction does is to predict that the rule will occur. In predictions, there is no predot symbol.
Earley items
The dotted rules contain all but one piece of the information that Marpa needs to track. The missing piece is the second of the two "wheres": where in the input stream. To associate input stream location and dotted rules, Marpa uses what are now called Earley items.
A convenient way to think of an Earley item is as a triple, or 3-tuple, consisting of dotted rule, origin and current location. The origin is the location in the input stream where the dotted rule starts. The current location is the location in the input stream which corresponds to the dot position.
Two noteworthy consequences follow from the way in which origin and current location are defined. First, if a dotted rule is a prediction, then origin and current location will always be the same. Second, the location where a rule ends is not tracked unless the dotted rule is a completion. If its dotted rule is not a completion, an Earley item does not tell us if a rule will ever be completed, much less at which location.
Confluences
In every parse, the Marpa algorithm starts with
an Earley table containing a start Earley item, and
a sequence of sets of tokens.
The Marpa algorithm then performs one of a set of operations until no more Earley items can be added to the Earley table. (We will not describe these operations in detail here. See below.)
If we count addition of the start Earley item as an "operation", this means that every Earley item has at least one reason for being in the Earley table. (Duplicate Earley items are not added to the table, so that, in an ambiguous parse, a single Earley item may have more than one reason for being in the table.)
For describing the reasons for the presence of Earley items in the Earley table, we use hydrological terminology. Intuitively, the tokens and/or Earley items are "upstream" from the Earley items they cause.
Every Earley item has one or more confluences which record the reason for that Earley item's presence in the Earley table. Each confluence has two inflows, but both of the inflows may be ill-defined.
The first inflow is the mainstem. When well-defined, a mainstem is always an Earley item. The second inflow is the tributary. When well-defined, a tributary is an Earley item or a token.
The start Earley item always has exactly one confluence. Both inflows of its confluence are ill-defined.
A predicted Earley item may have many confluences. The mainstem of a predicted Earley item is always another Earley item. The tributary of a predicted Earley item is always ill-defined.
An Earley item whose predot symbol is a terminal is a scanned Earley item. A scanned Earley item may have multiple confluences. The mainstem of a scanned Earley item is always another Earley item. The tributary of a scanned Earley item is always a token.
An Earley item whose predot symbol is a non-terminal is a reduction. A reduction may have many confluences. The mainstem of a reduction is always another Earley item. The tributary of a reduction is always a completed Earley item.
Resources on the Marpa algorithm
This document is not intended as an full explanation of the workings of the Marpa algorithm. Those who want to learn more about the Marpa algorithm, and who are not already familiar with Earley's algorithm, should consult an introductory presentation of it. One good start is the Wikipedia article https://en.wikipedia.org/wiki/Earley_parser. For the workings of the Marpa algorithm there are two arxiv.org papers.
Earley item correctness
For the definition of Earley item correctness, we show symbols in angle brackets, for example <A>, and sentential forms in double angle brackets, for example <<alpha>>.
Without loss of generality, we call our grammar G and we call our input W. The input is a sequence of tokens. W[i] is the i'th terminal of W, so that the first terminal of W is W[0]. We write W[i...j] for the segment of W whose first term is W[i] and whose last term is W[j]. As a special case, W[a...b] is the empty string if b is less than a.
Let <<alpha>> =>* <<beta>> mean that, using grammar G, the sentential form <<alpha>> derives the sentential form <<beta>> in zero or more steps. We recall from the vocabulary document that a sentential form is a sequence of zero or more symbols from the grammar. In a derivation, a single symbol represents the sentential form of length 1 consisting only of that symbol.
Also without loss of generality, we let the grammar G be such that we have the following:
Symsis the set of symbols inG.Exactly one of the symbols in
Symsis distinguished as the "start symbol". Call the "start symbol",<Start>.Let
Termsbe a subset ofSymscalled the "terminals" ofG. The symbols of the tokens in the inputWare elements ofTerms.Gcontains the rule<A> ::= <<beta>> <<gamma>>where
<<beta>>and<<gamma>>are sentential forms.
We say the Earley item
[ [ <A> ::= <<beta>> . <<gamma>> ], origin, current ]
is correct if and only if, where <<alpha>> and <<delta>> are sentential forms, we have all of the following:
1. <Start> => <<alpha>> <A> <<delta>>
2. <<alpha>> => W[0...(origin-1)]
3. <<beta>> => W[origin...(current-1)]
A proof that the Earley algorithm adds all correct Earley items, and only correct Earley items, to the Earley table can be found in Jay Earley's original paper, and in Aho and Ullmann's 1972 textbook.
Copyright and License
Copyright 2022 Jeffrey Kegler
This file is part of Marpa::R2. Marpa::R2 is free software: you can
redistribute it and/or modify it under the terms of the GNU Lesser
General Public License as published by the Free Software Foundation,
either version 3 of the License, or (at your option) any later version.
Marpa::R2 is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
Lesser General Public License for more details.
You should have received a copy of the GNU Lesser
General Public License along with Marpa::R2. If not, see
http://www.gnu.org/licenses/.