The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Marpa::Doc::Internals - Marpa Internals

OVERVIEW

Knowledge of Marpa internals is not necessary to use Marpa. But you can make better use of Marpa if you understand how it works. This document assumes that you have read the other Marpa documentation, and that you have a general knowledge of parsing. All the examples of diagnostic output in this document assume that the strip Marpa option has been turned off.

EARLEY SETS

To speak pedantically, the algorithm in Earley's 1970 parsing article does not actually parse in the strict sense. It is a recognizer, not a parser.

Earley items are built as tokens are recognized. If an Earley item of the correct form is built when all of an input has been processed, that input is recognized as one of those described by the grammar. Once the recognition phase is done, the phase that does parsing in the strict sense uses the Earley items.

Every Earley item has a current earleme. An Earley item's current earleme is also called its dot earleme, its current location, or its dot location. The set of all the Earley items with the same dot location, is called an Earley set.

In addition to its current earleme, each Earley item has a QDFA state, and an origin. The origin of an Earley item can also be called its start location. Here's a representation of an Earley item, as you might see it in debugging output from Marpa:

    S5@6-7

This Earley item is for QDFA state 5 (that is the "S5" part), The "@6-7" part says that this Earley item originates at earleme 6, and is current at earleme 7. The number of an Earley item's current, or dot, earleme is always the same as number of the Earley set that contains the item.

(Note to experts: Those familiar with Earley parsing will note that S5@6-7 does not look like a traditional Earley item. QDFA states are not traditional -- they are my elaboration of an invention by Aycock and Horspool. The use of the at-sign as punctuation follows Grune and Jacobs. Finally, in the traditional notation for an Earley item, the dot earleme is not included at all -- you were supposed to figure out from the context which Earley set an Earley item belongs to.)

QDFA STATES

I will mention QDFA's (Quasi-Deterministic Finite Automata) frequently and NFA's (Non-deterministic Finite Automata) sometimes. All you need to know about NFA's is that the QDFA's are built from them. NFA state numbers sometimes appear in the diagnostic outputs. They can be ignored.

About QDFA's, it will be necessary to know more. Let's start with an example of another Earley item:

    S1@2-2

This states that this item is for QDFA state 1; that it originates at earleme 2; and that it is currently at (or has its dot position at) earleme 2. We can get a description of the QDFA states with the show_QDFA method. Here's what it says about QDFA state 1:

    S1: predict; 1,5
    e -> . e op e
    e -> . number
     <e> => S3
     <number> => S4

For those who like the big picture, or who simply like to skim ahead, the entire show_QDFA output, along with the other trace and debugging outputs for this example, is in the appendices.

The first line of show_QDFA's description of QDFA state 1 begins with its name: S1. Next comes the predict tag. That indicates this item was created by a prediction. That's one of the four ways an Earley item can be created. There's more on that below.

The two numbers after the "predict" tag on the first line may be ignored. They are the numbers of the NFA states that were combined into this QDFA state. Marpa uses QDFA states to track the parse. Every QDFA state has one or more LR(0) items. In this example, the second and third lines are LR(0) items. LR(0) items are rules with a dot added to indicate how far recognition has proceeded into the rule.

A dot between two symbols indicates that all symbols before the dot have been recognized, but that none of the symbols after the dot have been recognized. A dot at the beginning of an LR(0) item indicates that none of its symbols have yet been recognized. A dot at the end indicates that the rule is complete -- all of its symbols have been recognized.

The location of the dot is called the dot position. The symbol before the dot position in an LR(0) item is called the pre-dot symbol. The symbol after the dot position in an LR(0) item is called the post-dot symbol.

The last two lines in the display show transitions. The first line says that, on seeing an e, you transition to QDFA state 3. The second states that, on seeing a number, you transition to QDFA state 4.

It's important not to confuse Earley items with LR(0) items. Earley items are built from one or more LR(0) items. In traditional Earley parsing, each Earley item contained one and only one LR(0) item. This made the internals simpler, but was not as efficient. Marpa combines LR(0) items into QDFA states based on ideas in Aycock and Horspool 2002.

Every LR(0) item is a statement that a rule has been recognized as far as the dot position. The dot can be at the beginning of the rule in an LR(0) item, indicating that the part of the rule recognized so far is zero length. An LR(0) item with the dot at the beginning of the rule is a prediction that the rule might be recognized.

Every QDFA state contain one or more LR(0) items and is therefore a set of statements about rule recognition. An Earley item is present in the Earley sets if and only if every rule in every LR(0) item in the QDFA state of the Earley item has been recognized as far as its dot position.

Earley items fix the location in the earleme stream where LR(0) items begin and end. Recognition of each LR(0) item began at the origin of the Earley item, and ended at the dot location of the Earley item. Put another way, the origin of each LR(0) item corresponds to the origin of the Earley item, and the dot position of each LR(0) item corresponds to the dot location of the Earley item.

Some LR(0) items are predictions -- statements that their rules could start at the dot location of the Earley item. In predictions, no symbols have been recognized, and the origin and the dot location of the Earley item will be the same. Our example is a prediction -- both origin and dot location are at earleme 2. We can see that that is the case from the "2-2" in the description of Earley item S1@2-2.

HOW EARLEY SETS ARE BUILT

New items come into the Earley sets in four ways: scanning, completion, prediction, and initialization.

Scanning

Scanning adds Earley items to indicate which tokens have been seen in the input, and where. Suppose the Earley item S1@2-2 (mentioned above) is present at earleme 2. Marpa knows to instruct the lexer to look for a number token because there is a transition from S1 to S4 on number.

If the lexer finds a number token with a length (in earlemes) of 1 at earleme 2, a new Earley item S4@2-3 is added at earleme 3. (Earleme 3 because 3 is 2+1, the current token plus the token length.)

Here (again from show_QDFA) is the description of QDFA state 4:

    S4: 6
    e -> number .

We can see that there is only one LR(0) item in QDFA state 4, and that it has the dot pointer at the end of its rule. An LR(0) item with the dot pointer at the end of its rule is a completed rule.

Marpa calls the item that was looking for the scanned symbol, the predecessor of the item added by the scan. In this example, S1@2-2 is the predecessor of S4@2-3. Any item which is added to an Earley set based on a QDFA state transition from a predecessor, is called that predecessor's successor. In this example, S4@2-3 is the successor of S1@2-2.

An Example

Here's what S4@2-3 looks like in the Earley sets, after a successful scan of the input "2+2".

    S4@2-3  [p=S1@2-2; s=number; t=\'2']

After the name of the Earley item is an Earley item source choice, shown in brackets. When the context makes the meaning clear, an Earley item source choice may be called a source or a source choice. This parse is not ambiguous, so all Earley items have at most one source choice entry.

In this case the source choice entry is for a token choice. Token choices are shown as [p=predecessor; t=token_value] pairs. In the above example, the token choice [p=S1@2-2; s=number; t=\'2'] indicates that the token has a value of "2", and that, when this token choice is selected, the predecessor of S4@2-3 is S1@2-2.

Completion

Whenever an Earley item with a completed rule is added to an Earley set, other Earley items may be added to that Earley set as a result. The item with the completed rule is called the completion, or the cause, and any new item added as a result is called an effect.

Let's look at one example of a completion "cause and effect". As the completion in our example, we'll use S4@2-3, the Earley item in our example of a scanned item. Its state was QDFA state 4. QDFA state 4 contained the LR(0) item

    e -> number .

The dot is at the far right, so it's a completed rule whose the left hand side is e.

S4@2-3 begins at earleme 2 and has a complete rule with e as its lhs, and therefore any Earley item which is looking for an e beginning at earleme 2 is in luck. S5@0-2 is such a rule. QDFA state 5 looks like this:

    S5: 3
    e -> e op . e
     <e> => S6

This says that in QDFA state 5, when you see an e, you transition to QDFA state 6. Here's S6@0-3, the item which is the effect of the completion:

    S6@0-3 [p=S5@0-2; c=S4@2-3]

For completion items, the terms predecessor and successor mean pretty much what they did in token scanning. Since S6@0-3 comes from moving the dot forward in a rule in S5@0-2, we say S5@0-2 is the predecessor of S6@0-3, and that S6@0-3 is the successor of S5@0-2.

The same Earley item can be added to the same Earley set for multiple reasons. A rule can be added because of one or more scanned tokens; because of one or more completions; or because of any combination of these.

Traditionally in Earley's algorithm, the same item is not added twice. Where an item comes from is not important in recognition. But parsing in the strict sense means determining the structure of the input, and that requires us to know why an Earley item was added to an Earley set. Each reason for the existence of an Earley item in an Earley set is a different choice that can be made when it comes time to create a parse structure.

Marpa uses the Earley item source choices to track the reasons why Earley items were added. Aycock and Horspool don't call them source choices. But Marpa's method is derived from the one Aycock and Horspool describe in their 2002 article.

We've already seen a token source choice in the scanning example. Earley item S6@0-3 has another example of a source choice: [p=S5@0-2; c=S4@2-3]. This is another kind of source choice, a completion source choice. This source choice says that one reason for S6@0-3 to exist was the completion in S4@2-3, with S5@0-2 as the predecessor. Since this parse in unambiguous, there are no other source choices.

Prediction

A third source of Earley items is prediction. Whenever an Earley item is expecting to see an e, for example, it can also expect to see the start of all the rules that have an e on their left hand side.

Recall QDFA state 5:

    S5: 3
    e -> e op . e
     <e> => S6

QDFA state 5 is expecting an e. This means that the rules

    e -> . e op e
    e -> . number

can also be expected to originate at this earleme. Marpa is smart about grouping rules into QDFA states and both these rules are in QDFA state 1:

    S1: predict; 1,5
    e -> . e op e
    e -> . number
     <e> => S3
     <number> => S4

Since S5@0-2 is at earleme 2, and it is expecting the rules which originate in QDFA state 1, there will be an Earley item for QDFA state 1 at earleme 2:

    S1@2-2

In predicted items, either the dot is at the beginning of the rule, or else all the symbols to the left of the dot are nulling symbols. Either way, the origin and dot earlemes will always be the same.

Source choices are not recorded for predicted items. Predicted items are essential to get rules started in the Earley sets. But they are not actually needed when the time comes for parsing and evaluation. That means they don't need source choice entries.

Initialization

One or two Earley items are put into Earley set 0 to start things off. In our example, the initial Earley items are

    S0@0-0
    S1@0-0

QDFA state 0 contains the start rules, with the dot pointer at the beginning. Only Earley set 0 will contain an Earley item for QDFA state 0. S1@0-0 contains the rules predicted by S0@0-0.

Ambiguous Parsing

If a parse is ambiguous, one or more scans, one or more completions, or a combination of scans and completions, will try to add at least one of the Earley items more than once. Marpa does not add an Earley item more than once. Instead of creating a second Earley item with the same QDFA state and origin at the current earleme, Marpa adds another source choice to the existing Earley item.

When the time comes to produce parses from the Earley item, more than one rule may apply. This is because Marpa's Earley items correspond to QDFA states instead of to a single rule. Every pairing of a source choice with an applicable rule is an alternative that can be selected when the choices are made that produce a parse. For more detail on how ambiguous parses are evaluated, including how Marpa determines which rules are applicable, see the section on creating the parse bocage, below.

HOW A SUCCESSFUL PARSE IS RECOGNIZED

As mentioned, what's usually called Earley's algorithm is just a recognizer, an algorithm to build the Earley sets. The input is recognized successfully if, at the earleme location designated as the end of parsing, there is an Earley item for a completed start rule state. A completed start rule state is a QDFA state containing a completed start rule. A completed start rule is the LR(0) item for the start rule that has its dot position at the end of that rule.

At most two of the QDFA states can be completed start rule states. One is a special case. In grammars which allow a null parse, the start state, QDFA state 0, is also a completed start rule state.

The other completed start rule state contains the LR(0) item which has Marpa's internal start rule with the dot pointer at the end of the rule. The lhs of the internal start rule is a special internal start symbol. The rhs of the internal start rule is the grammar's original start symbol. QDFA state 2 is the completed start rule state for our example parse:

    S2: 8
    e['] -> e .

A parse starting at earleme S is successful at earleme N if it contains an Earley item for a completed start rule state with a origin earleme of S and a dot earleme of N. Here's the Earley item which makes the parse in our example, which started at earleme 0, successful at earleme 3:

    S2@0-3 [p=S0@0-0; c=S6@0-3]
      

THE PARSE BOCAGE

After a successful parse has been recognized, it can be evaluated. The main data structure of a Marpa evaluator is a parse bocage.

Each Marpa evaluator is created with parse criteria. The parse criteria are the start and end earlemes of the parse, and the parse's top symbol.

Default values are often used for these parse criteria. The default start earleme is earleme 0. The default end earleme is the recognizer's current earleme. This will usually be the furthest earleme reached by any token in the parse. The default top symbol is the start symbol of the grammar.

The parse bocage stores all parses meeting the evaluator's parse criteria. In other general parsers, a parse forest has been used to store ambiguous parses. Marpa's parse bocage is an adaptation of a parse forest. Parse bocages differ from parse forests in three ways. First, parse bocages may contain cycles. Second, and-nodes and or-nodes strictly alternate in a parse bocage. That is, every direct child of an and-node is an or-node, and every direct child of an or-node is an and-node. Third, and-nodes in a parse bocage have at most two rhs symbols.

Parse forests and parse bocages both consist of and-nodes and or-nodes. In both parse forests and parse bocages, and-nodes contain pieces of parse trees, while or-nodes represent choices among the pieces.

In parse forests, the and-nodes directly represent productions of the original grammar. Marpa's parse bocages contain the original grammar in Marpa Bocage Form (MBF), which represents the original grammar's productions in pieces with at most two symbols on the rhs. In restricting the rhs to at most two symbols, MBF is very similar to Chomsky Normal Form (CNF).

Marpa uses MBF for four reasons. First, MBF can handle parses with cycles.

Second, MBF is essentially the form in which the grammar is found in the Earley items. Every source choice of an Earley item contains at most two links. These links can be treated as symbols on the rhs of a production. So the parse as found in the Earley items is in a form with at most two symbols on the right hand side.

Third, MBF easily produces binary trees. Binary trees are fast and easy to traverse and manipulate. The effort spent to rewrite the parse data found in the Earley sets back into the productions of the original grammar would result in a slower and more cumbersome data structure.

Or-Nodes and And-Nodes

Or-nodes and and-nodes strictly alternate in MBF. An or-node, called the start or-node, is always at the top of a Marpa bocage. Every child of an or-node is an and-node. Every child of an and-node is an or-node.

An or-node is trivial if it has only one child and-node. A trivial or-node represents a choice from a list containing only a single alternative. Marpa or-nodes are often trivial.

In MBF, an or-node represents a decision point in the parse, even though often it's a trivial decision point. There is an or-node for every symbol of every rule that might be used in a parse.

Every and-node represents either a production of the original grammar, or a piece of a production of the original grammar. And-nodes represent specific alternatives.

Every or-node is a parent of one or more and-nodes. Every and-node either has a token value or is the parent of a cause or-node. Any and-node may also be the parent of a predecessor or-node. When there is a predecessor child or-node, it always comes first, before the token or the cause or-node.

Or-Nodes and their Child Criteria

Each or-node corresponds to an Earley item and child criteria. The or-node's corresponding Earley item is determined when the or-node is created. The child criteria for the or-node are set at the same time. The or-node's child criteria determine which and-nodes can be children of the or-node.

There are two kinds of or-node, each with different child criteria. The start or-node, and all cause or-nodes, are closure or-nodes. All other or-nodes are kernel or-nodes. All kernel or-nodes are predecessor or-nodes, and vice versa.

Closure or-nodes are so called because they represent points at which a complete rule or rules have been recognized. That is, they represent the "closing" of a rule. The child criterion for a closure or-node is a symbol, called its child lhs symbol.

Kernel or-nodes represent decision points in the interior of a rule. The child criteria for a kernel or-node are a rule and dot position in that rule.

No or-nodes represent the beginning of a rule. Because Marpa rewrites grammars so that there are no empty rules, no closure or-nodes can represent a point at the beginning of a rule. And kernel or-nodes are not created for the dot position at the beginning of a rule.

The symbol before the dot position in a kernel or-node's rule is the or-node's pre-dot symbol. The symbol after the dot position in a kernel or-nodes's rule is the or-node's post-dot symbol. Since kernel or-nodes never have dot positions at the end or the beginning of their rules, they always have both a pre-dot and post-dot symbol.

The Relationship between Or-Nodes and Earley Items

If the post-dot symbol of a kernel or-node is nulling, it is a mortar or-node. Every other kernel or-node, and every closure or-node, is a brick or-node.

Marpa has no LR(0) items with nulling post-dot symbols. But Marpa requires there to be an or-node for every symbol in every rule that might be used in any possible parse, including nulling symbols and the symbols before nulling symbols. This is why there are mortar or-nodes. The idea behind mortar or-nodes is that they fill the cracks in between brick or-nodes.

For brick kernel or-nodes the QDFA state of the corresponding Earley item will have an LR(0) item with a rule and dot position which exactly match the rule and dot position of the kernel or-node's child criteria. For mortar or-nodes, the QDFA state of the corresponding Earley item will have an LR(0) item with a rule which is the same as the rule in the or-node's child criteria, and the only symbols between the dot position of the or-node and the dot position of that LR(0) item will be nulling symbols.

How the Parse Bocage is Built

Initializing the Parse Bocage

Creation of a parse bocage starts with the creation of the start or-node. The start or-node is a closure or-node. The child lhs symbol of the start or-node will be the top symbol in the parse. For the parse to proceed, the start or-node's corresponding Earley item needs to contain a QDFA state which contains at least one completed rule which has the child lhs symbol as its lhs. The parse will start at the origin of the Earley item, and the parse will end at the current earleme of the Earley item.

In normal use, the child lhs symbol of the start or-node is set to the start symbol of the grammar. The corresponding Earley item for the start or-node normally is set to be the Earley item which has its current earleme at the end of parsing, which has its origin at earleme 0, and which has a completed start rule state.

The parse bocage is created by processing or-nodes recursively. At each step there is a list of or-nodes remaining to be processed. At the first step, the list of unprocessed or-nodes contains only the start or-node.

Creating the And-Node Children for a Parent Or-Node

The loop that creates a parse bocage repeatedly takes an or-node from a list of or-nodes yet to be processed, and adds that or-node to the parse bocage. It then creates that or-node's and-node children and links them into the parse bocage. The and-node children will often have their own child or-nodes. These new cause and predecessor or-nodes must be added to the list of or-nodes yet to be processed.

In describing the construction of the parse bocage, I'll call the or-node currently being processed, the parent or-node, or simply the parent. I'll call any and-node child that is created for this parent, an and-node child, or where it's clear, simply a child. I'll call an or-node child of an and-node child, a grandchild. An and-node child may have a predecessor grandchild and a cause grandchild.

For every or-node there is a set of applicable rules, which are determined by the or-node's Earley item and its child criteria. For kernel or-nodes there is only one applicable rule, the rule in the child criteria. For closure or-nodes, the applicable rules are taken from the QDFA state of the corresponding Earley item. A rule is applicable if it is a rule in a completed LR(0) item in the QDFA state of the corresponding Earley item, and if the rule's lhs is the same as the child lhs symbol of the or-node's child criteria. There is always at least one applicable rule.

For a parent or-node, an child and-node is created for every pairing of an applicable rule with a source choice of the corresponding Earley item. This means that the number of and-node children will be the number of applicable rules, times the number of source choices. There is always at least one and-node child for any parent or-node.

In addition to an applicable rule and a source choice, every and-node corresponds to a dot position. If the parent or-node was a kernel or-node, the dot position will be the same as the dot position of the parent or-node's child criteria. If the parent or-node was a closure or-node, the dot position will be at the end of the applicable rule.

Every and-node has a pre-dot symbol. Where the parent or-node is a kernel or-node, this is because kernel or-nodes are never positioned at the beginning of rules. Where the parent or-node is a closure or-node, this is because there are no empty rules after Marpa rewrites its grammar.

If the and-node's pre-dot symbol is nulling, the and-node is created with a token and a predecessor grandchild. The value of the token is the null value of the pre-dot symbol. The predecessor grandchild will be a mortar or-node. The corresponding Earley item of the predecessor grandchild will be the Earley item of the parent or-node. The child criteria for the predecessor grandchild will be the and-node's rule, with a dot position before the and-node's pre-dot symbol.

If the and-node's pre-dot symbol is non-nulling, its token, cause and predecessor are created based on the and-node's corresponding source choice. If it was a token source choice, the and-node will have a token, which will take its value from the source choice token. If it was a completion source choice, the and-node will have a cause grandchild. The corresponding Earley item for the cause grandchild will be the Earley item which was the cause in the source choice. The child lhs symbol for the cause grandchild will be the pre-dot symbol of the and-node.

If the and-node's pre-dot symbol is non-nulling, a predecessor grandchild or-node is created if and only if its corresponding source choice had a predecessor item. The predecessor grandchild will be created as an or-node child of the and-node child. If a predecessor grandchild is created, its corresponding Earley item will be the predecessor item of the and-node child's corresponding source choice. The child criteria for the predecessor grandchild will be the and-node child's rule, with a dot position before the and-node child's pre-dot symbol.

Completing the Parse Bocage

Once an or-node has been processed, and its and-node children created and linked into the parse bocage, Marpa checks to see if any more or-nodes remain on the list to be processed. As the last section showed, when Marpa creates and-node children, it often creates predecessor and cause grandchildren. These grandchild or-nodes must be added to the list of or-nodes yet to be processed. Or-nodes are pulled from the list and processed until there are no more.

Even when the parse is infinitely ambiguous, Marpa is guaranteed to reach a point were there are no more or-nodes on the list of unprocessed or-nodes. Token and-nodes without predecessors are called leaf and-nodes. Leaf and-nodes do not have any child nodes. Eventually every recursion through the Earley sets reaches a leaf and-node. At this point, the parse bocage has been successfully created.

Naming Or-Nodes

The name of an or-node is the name of its corresponding Earley item, followed by a label describing its child criteria. For a kernel or-node, the label consists of the letter "R", and the number of the or-node's rule and its dot position, colon-separated. The "R" in the label is a mnemonic for "rule".

For example, S3@0-1R0:1 is the name of a kernel or-node. The corresponding Earley item is S3@0-1. The child criteria are rule 0 at dot position 1 (e -> e . op e).

For closure or-nodes, the child criteria label is the letter "L", followed by the number of the child lhs symbol. The "L" in the symbol or-node's label is a mnemonic for "left hand side". An example of a closure or-node is S6@0-3L0. In this case the matching Earley item is S6@0-3, and the child lhs symbol is e, which was symbol 0. (Symbol numbers for the example may be found in the appendix, in the show_symbols output.)

Naming And-Nodes

An and-node is always a child of exactly one or-node. The name of an and-node is the name of the parent or-node, followed by a numerical index in square brackets. The child and-nodes are numbered starting at 0. For example, S6@0-3L0[0] is the first child of the or-node named S6@0-3L0.

The Parse Bocage Grammar

Marpa typically displays a parse bocage as a parse bocage grammar (PBG). It may seem pointless to have come this far in the parsing, only to write out another grammar. But there is a big difference between the original grammar and the parse bocage grammar. The original grammar describes all parses which are possible for any input. The PBG describes only those parses which are possible for the specific input that was accepted by the recognizer.

Examples of Parse Bocage Productions

    S2@0-3L3o0 -> S2@0-3L3o0a0

A parse bocage grammar always starts with the start or-production. The above is the start production for our example. The left hand side is an or-node which matches a start rule completion item. The completion start rule state in our example is QDFA state 2 and the input in our example was 3 earlemes long. So the start Earley item is S2@0-3. The start symbol is e['], which is symbol number 3. So the child criterion label is "L3".

Start or-productions are always trivial. The rhs of the start or-production is the and-node which is its first and only child. It has index 0.

Or-nodes appear in the PBG as one or more or-productions. Each or-production has the name of an or-node as its lhs. Also, each or-production has the name of one of the or-node's child and-nodes as the only symbol on its rhs.

In many cases, the or-nodes are trivial -- they have exactly one child. Trivial or-nodes have only one or-production in the PBG. In an unambiguous parse, like the parse of our example, all the or-nodes are trivial.

Even when they are non-trivial, or-productions contain little information, and show_bocage does not show or-productions unless called with the verbose option. Or-productions are necessary for the PBG to be a real BNF grammar, but all the information in them is easily deduced from the names of the and-nodes on the lhs of the and-productions.

For every and-node on the lhs of an and-production, there will be always be an or-production with the unbracketed name of the and-node node as the or-production's lhs, and with the and-node itself as the or-production's only rhs symbol. For example, from the name of the and-node S2@0-3L3[0], we can deduce the existence of the or-production in the example above.

Here is the and-production which is the start or-node's only child:

    S2@0-3L3o0a0 -> S6@0-3L0o1
        rule 2: e['] -> e .

Parse bocage and-productions always have one or two rhs symbols. Either a token or a closure or-node will be on the right hand side. One or the other will always be present, but never both. Additionally, there may be a kernel or-node. If a kernel or-node is present, its name is always the first symbol on the rhs.

In this example, the only symbol on the rhs is a closure or-node: S6@0-3L0. The next line contains a dotted rule which shows the rule and dot position for this and-node.

The last line tells us the length of the rule's rhs, and that there is a Perl closure corresponding to this rule. Perl closures and closure or-nodes are not to be confused. The word "closure" in "closure or-node" refers to its being the last in a rule. That is to say, the or-node "closes" the rule. A closure or-node may or may not have a corresponding Perl closure.

    S5@0-2R0:2o2 -> S5@0-2R0:2o2a2
    S5@0-2R0:2o2a2 -> S3@0-1R0:1o4 '+'
        rule 0: e -> e op . e

This example shows an and-node with a kernel or-node and a token on the rhs of its rule. The and-node is at position 2 of rule 0, as shown by its child criteria label ("R0:2") and in the dotted rule on the third line. The pre-dot symbol ("op") matched the token ("+"). The kernel or-node (S3@0-1R0:1) points to the part of the bocage which derives the symbols before the pre-dot symbol in rule 0. In this case there is only one symbol ("e") before the pre-dot symbol.

    S4@0-1L0o5 -> S4@0-1L0o5a5
    S4@0-1L0o5a5 -> '2'
        rule 1: e -> number .

The and-production above has only a token on its rhs. There is no kernel or-node, because the pre-dot symbol for this and-production is the first one of its rule -- in fact, it is the only symbol in its rule.

S4@0-1L0 is a closure or-node with "L0" as its child criteria. Applicable rules will be found in QDFA state 4, and must be completed rules with symbol number 0 on the lhs. Symbol number 0 is e. There is only one LR(0) item in QDFA state 4, but it is, as required, a completed rule with e as its lhs.

THE PARSE TREES

Parse trees have been elminated. The following needs to be rewritten.

Marpa evaluates using an evaluation stack, which is initialized to empty. Every time Marpa evaluates a node, it pushes the result onto the evaluation stack. If a node needs values as input, perhaps because a Perl closure requires arguments, the values are popped from the evaluation stack. When Marpa reaches the root tree-node, computes its value and pushes it onto the stack, that value will be the only item on the stack. That one remaining stack item becomes the value of the parse.

The evaluation process can be followed by setting the trace_values option. The trace_values output for our example is in an appendix. Here's the section that evaluates the tree nodes shown above:

    Popping 1 values to evaluate S4@2-3L0o3a3, rule: 1: e -> number
    Calculated and pushed value: '2==2'
    Popping 3 values to evaluate S6@0-3L0o1a1, rule: 0: e -> e op e
    Calculated and pushed value: '(2+2)==4'

GRAMMAR REWRITING

Marpa rewrites grammars, adding internal symbols and rules in the process. This rewriting does not affect the semantics, but it does show up when you examine the internals.

Marpa's internal symbols have tags at the end, enclosed in square brackets. This means all Marpa internal symbols end in a right square bracket.

Adding a Start Rule

Many parsers add their own start rule and their own start symbol to grammars. Marpa is no exception. The new start symbol is the old one with "[']" suffixed. We saw a Marpa internal start symbol above: e[']. If the grammar allows a null parse, there will also be a nulling start symbol, with "['][]" suffixed.

Elminating Proper Nullable Symbols

Nulling symbols are those which always produce the empty sentence. Nullable symbols are those which sometimes produce the empty sentence. Non-nullable symbols are those which never produce the empty sentence.

Pedantically, all nulling symbols are also nullable symbols. A proper nullable is any nullable symbol which is not a nulling symbol. In other words, a proper nullable is a symbol that can produce either the empty sentence or a non-empty sentence.

Nullable symbols were a problem in previous versions of Earley parsers. Aycock and Horspool 2002 outlined a new approach for dealing with them. I use their ideas with modifications of my own.

Marpa rewrites its grammar to eliminate proper nullables. It does this by turning every proper nullable into two symbols: a non-nullable variant and a nulling variant. The non-nullable variant keeps the original symbol's name, but is no longer allowed to appear in places where it might be nulled. The name of the nulling variant is that of the original symbol with the nulling tag ("[]") suffixed. When the nulling variant is used, it must be nulled.

The newly introduced nulling symbols will not appear on any left hand sides, with one exception: grammars that allow a null parse will have a nulling start rule. Except for the nulling start symbol, Marpa marks nulling symbols internally and recognizes them directly, without the need of empty rules.

Rules with proper nullables on the rhs have to be replaced with new rules covering every possible combination of the non-nullable and nulling variants. That rewrite is described in the next section.

CHAF Rewriting

To deal with the splitting of rhs proper nullables into two symbols, one non-nullable and one nulling, Aycock and Horspool created new rules covering all possible rhs combinations of non-nullable and nulling symbols. This factoring was exponential in the worst case. I don't like leaving exponential explosions in an algorithm, even unlikely ones. And I suspect that the generation of all possible combinations for an arbitrarily long right hand side might be inefficient in practice.

A result due to Chomsky shows that any grammar can be rewritten as a grammar with at most two symbols on the right hand side. Relaxing Chomsky's rewrite to allow right hand sides with any number of symbols, but at most two proper nullables, produces a rewrite I call CHAF (Chomsky-Horspool-Aycock Form).

CHAF changes the worst case to linear, and in practical cases lowers the multiplier. Here's an example of a CHAF rewrite from Marpa's own self-grammar. First, the rule:

    production paragraph:
        non structural production sentences,
        production sentence,
        non structural production sentences,
        optional action sentence,
        non structural production sentences.

This rule contains four proper nullables, reinforcing my fear that grammars written as test cases won't be the only ones with lots of proper nullables on the right hand side. non structural production sentences and optional action sentence are both proper nullables and non structural production sentences appears three times.

Here's is the output from show_rules, showing what Marpa did with this rule:

    11: production-paragraph
            -> non-structural-production-sentences
            production-sentence
            non-structural-production-sentences
            action-sentence:optional
            non-structural-production-sentences /* !useful */

    94: production-paragraph ->
        non-structural-production-sentences
        production-sentence
        production-paragraph[R11:2][x5b] /* vrhs real=2 */
    95: production-paragraph ->
        non-structural-production-sentences
        production-sentence
        non-structural-production-sentences[]
        action-sentence:optional[]
        non-structural-production-sentences[]
    96: production-paragraph ->
        non-structural-production-sentences[]
        production-sentence
        production-paragraph[R11:2][x5b] /* vrhs real=2 */
    97: production-paragraph ->
        non-structural-production-sentences[]
        production-sentence
        non-structural-production-sentences[]
        action-sentence:optional[]
        non-structural-production-sentences[]

    98: production-paragraph[R11:2][x5b] ->
        non-structural-production-sentences
        production-paragraph[R11:3][x5c] /* vlhs vrhs real=1 */
    99: production-paragraph[R11:2][x5b] ->
        non-structural-production-sentences
        action-sentence:optional[]
        non-structural-production-sentences[] /* vlhs real=3 */
    100: production-paragraph[R11:2][x5b] ->
        non-structural-production-sentences[]
        production-paragraph[R11:3][x5c] /* vlhs vrhs real=1 */

    101: production-paragraph[R11:3][x5c] ->
        action-sentence:optional
        non-structural-production-sentences /* vlhs real=2 */
    102: production-paragraph[R11:3][x5c] ->
        action-sentence:optional
        non-structural-production-sentences[] /* vlhs real=2 */
    103: production-paragraph[R11:3][x5c] ->
        action-sentence:optional[]
        non-structural-production-sentences /* vlhs real=2 */

Rule 11 is the original rule. Because Marpa has rewritten it, the rule is marked !useful, telling later stages in the precomputation to ignore it. Marpa breaks Rule 11 up into three pieces, each with no more than two proper nullables. Rules 100 to 103 are the first piece, with the first two symbols from Rule 11. Rules 104 to 106 are the second, with the 3rd symbol. Rules 107 to 109 are the third, with the 4th and 5th symbols from Rule 11.

Each piece is factored, so that every combination of nulling and non-nullable symbols is included. New symbols are introduced to be the left hand sides of the pieces. The tag "[R11:3]" indicates that this symbol is the left hand side for the piece of Rule 11 which begins at right hand symbol 3 (the first symbol is symbol 0). The tags beginning with an "x", like "[x5f]", are arbitrary hex values, inserted to guarantee that the new symbols are unique.

This rule is a worst case for CHAF, because the last three symbols of the right hand side are all proper nullables. That means that the last two pieces of the original rule can be either empty or non-empty, and therefore that both of the newly created lhs symbols are also proper nullables.

With the CHAF rewrite, there are now a total of 6 proper nullables: the original 4 plus the 2 symbols newly created to serve as left hand sides. This is why, in order to have only 2 proper nullables per piece, the original rule needed to be divided into 3 pieces. The newly created lhs symbols, because they are proper nullables, need to be split into nulling and non-nullable variants and factored, just like the proper nullables in the original rule.

Nonetheless this factoring can be done with 10 rules in CHAF, while the original Aycock-Horspool factoring (NNF) required 16. After more than 4 proper nullables, the advantage of CHAF becomes overwhelming. With 5 proper nullables, there would be 13 rules for CHAF versus 32 for NNF. With 6 proper nullables, 16 versus 64.

The user's semantics are preserved, because Marpa, while splitting the rule into pieces and factoring the pieces, inserts logic to gather and preserve the values of child nodes. These values are presented to the user's actions as if no CHAF rewrite had occurred.

Converting Sequence Productions to BNF

Internally, Marpa converts productions specified as sequences into BNF productions. The conversion is done in a standard way. For example,

    paragraphs: empty line separated paragraph sequence.

becomes

    1: paragraphs -> paragraph[Seq:1-*][Sep:empty_line][x5] /* vrhs discard_sep real=0 */
    2: paragraphs -> paragraph[Seq:1-*][Sep:empty_line][x5] empty-line /* vrhs discard_sep real=1 */
    3: paragraph[Seq:1-*][Sep:empty_line][x5] -> paragraph /* vlhs real=1 */
    4: paragraph[Seq:1-*][Sep:empty_line][x5]
        -> paragraph empty-line paragraph[Seq:1-*][Sep:empty_line][x5] /* vlhs vrhs real=2 */

In the added symbol, the tag "[Seq:1-*]" indicates this is a symbol for a sequence of from 1 to an infinite number of symbols and the tag "[Sep:empty_line]" that it is empty_line separated.

Here's another example, this time of a sequence without a separator:

    definition paragraph: definition sequence.

is written as the following BNF:

    8: definition-paragraph -> definition[Seq:1-*][xa] /* vrhs real=0 */
    9: definition[Seq:1-*][xa] -> definition /* vlhs real=1 */
    10: definition[Seq:1-*][xa]
        -> definition definition[Seq:1-*][xa] /* vlhs vrhs real=1 */

SELF-TUTORIALS

If you want to investigate internals more on your own, here are two "self-tutorials", which should make you pretty much an expert.

First, go through the show_earley_sets output in the appendix. For each Earley item in it, reason out how it came to be there.

Second, take the grammar used in the example and run it on the input text "2+2*3". While the parse in this document was not ambiguous, the grammar was. The grammar's ambiguity reveals itself when there is more than one operation in the input string. Get the show_earley_sets output from any one of the ambiguous parses and reason out how all of its Earley items came to be.

APPENDIX: DATA FOR THE EXAMPLE

Below are the MDL grammar and outputs for the example used in this document. The input text was "2+2".

THE FOLLOWING DISPLAY NEEDS TO CHANGE SO THAT IT NO LONGER USES MDL.

MDL Grammar

    semantics are perl5.  version is 0.001_019.

    default action is 'default_action'.

    start symbol is E.

    E: E, Op, E.  'op'.

    E: Number. 'number'.

    Number matches qr/\d+/.
 
    Op matches qr/[-+*]/.

show_symbols Output

    0: e, lhs=[0 1] rhs=[0 2]
    1: op, lhs=[] rhs=[0] terminal
    2: number, lhs=[] rhs=[1] terminal
    3: e['], lhs=[2] rhs=[]

show_rules Output

    0: e -> e op e
    1: e -> number
    2: e['] -> e /* vlhs real=1 */

show_QDFA Output

    Start States: S0; S1
    S0: 7
    e['] -> . e
     <e> => S2
    S1: predict; 1,5
    e -> . e op e
    e -> . number
     <e> => S3
     <number> => S4
    S2: 8
    e['] -> e .
    S3: 2
    e -> e . op e
     <op> => S1; S5
    S4: 6
    e -> number .
    S5: 3
    e -> e op . e
     <e> => S6
    S6: 4
    e -> e op e .

show_earley_sets Output

    Current Earley Set: 5; Furthest: 3
    Earley Set 0
    S0@0-0
    S1@0-0
    Earley Set 1
    S4@0-1 [p=S1@0-0; s=number; t=\'2']
    S2@0-1 [p=S0@0-0; c=S4@0-1]
    S3@0-1 [p=S1@0-0; c=S4@0-1]
    Earley Set 2
    S5@0-2 [p=S3@0-1; s=op; t=\'+']
    S1@2-2
    Earley Set 3
    S4@2-3 [p=S1@2-2; s=number; t=\'2']
    S6@0-3 [p=S5@0-2; c=S4@2-3]
    S3@2-3 [p=S1@2-2; c=S4@2-3]
    S2@0-3 [p=S0@0-0; c=S6@0-3]
    S3@0-3 [p=S1@0-0; c=S6@0-3]

show_bocage Output

    parse count: 0
    S2@0-3L3o0 -> S2@0-3L3o0a0
    S2@0-3L3o0a0 -> S6@0-3L0o1
        rule 2: e['] -> e .
    S6@0-3L0o1 -> S6@0-3L0o1a1
    S6@0-3L0o1a1 -> S5@0-2R0:2o2 S4@2-3L0o3
        rule 0: e -> e op e .
    S5@0-2R0:2o2 -> S5@0-2R0:2o2a2
    S5@0-2R0:2o2a2 -> S3@0-1R0:1o4 '+'
        rule 0: e -> e op . e
    S4@2-3L0o3 -> S4@2-3L0o3a3
    S4@2-3L0o3a3 -> '2'
        rule 1: e -> number .
    S3@0-1R0:1o4 -> S3@0-1R0:1o4a4
    S3@0-1R0:1o4a4 -> S4@0-1L0o5
        rule 0: e -> e . op e
    S4@0-1L0o5 -> S4@0-1L0o5a5
    S4@0-1L0o5a5 -> '2'
        rule 1: e -> number .

trace_values Output

    Pushed value from S4@0-1L0o5a5: '2'
    Popping 1 values to evaluate S4@0-1L0o5a5, rule: 1: e -> number
    Calculated and pushed value: '2==2'
    Pushed value from S5@0-2R0:2o2a2: '+'
    Pushed value from S4@2-3L0o3a3: '2'
    Popping 1 values to evaluate S4@2-3L0o3a3, rule: 1: e -> number
    Calculated and pushed value: '2==2'
    Popping 3 values to evaluate S6@0-3L0o1a1, rule: 0: e -> e op e
    Calculated and pushed value: '(2+2)==4'
    New Virtual Rule: S2@0-3L3o0a0, rule: 2: e['] -> e
    Symbol count is 1, now 1 rules
    (2+2)==4

SUPPORT

See the support section in the main module.

AUTHOR

Jeffrey Kegler

LICENSE AND COPYRIGHT

Copyright 2007 - 2009 Jeffrey Kegler

This program is free software; you can redistribute it and/or modify it under the same terms as Perl 5.10.0.