- NAME
- OVERVIEW
- DESCRIPTION
- THE BASE EXAMPLE
- AN EARLY WARNING
- THE trace_terminals OUTPUT
- THE show_progress OUTPUT
- COMPLICATIONS
- APPENDIX: FULL TRACES AND CODE FOR THE BASE EXAMPLE
- LICENSE AND COPYRIGHT

# NAME

Marpa::Debug - Marpa Grammar Debugging

# OVERVIEW

This document describes Marpa's more powerful general-use tracing and debugging techniques in detail. It assumes that you have written a grammar for your Marpa application, and that something is going wrong. It is a very good idea to read the overview document for tracing problems before reading this one.

# DESCRIPTION

To read the `show_progress`

output, it is important to have a basic idea of what Earley items are, and of what the information in them means. Everything that the user needs to know is explained in this section.

## Dotted Rules

The idea behind Earley's algorithm is that you can parse by building a table of rules and where you are in those rules. "Where" means two things: location in the rule relative to the rule's symbols, and location relative to the parse's input stream.

Let's look at an example of a rule in a context-free grammar. Here's the rule for assignment from the Perl distribution's `perly.y`

` termbinop ::= term ASSIGNOP term`

In parsing this rule, we can be at any of four possible locations. One location is at the beginning, before all of the symbols. The other three locations are immediately after each of the rule's three symbols.

Within a rule, position relative to the symbols of the rule is traditionally indicated with a dot. In fact, the symbol-relative rule position is very often called the **dot location**. Taken as a pair, a rule and a dot location are called a **dotted rule**.

Here's our rule with a dot location indicated:

` termbinop ::= · term ASSIGNOP term`

The dot location in this dotted rule is at the beginning. A dot location at the beginning of a dotted rule means that we have not recognized any symbols in the rule yet. All we are doing is predicting that the rule will occur. A dotted rule with the dot before all of its symbols is called a **prediction**.

Here's another dotted rule:

` termbinop ::= term · ASSIGNOP term`

In this dotted rule, we are saying we have seen a `term`

, but have not yet recognized an `ASSIGNOP`

. (`ASSIGNOP`

is `perly.y`

's internal name for the assignment operator. In plain Perl terms, this is the "`=`

" character.)

There's another special kind of dotted rule, a completion. A **completion** is a dotted rule with the dot after all of the symbols. Here's the completed version of the example we've been using:

` termbinop ::= term ASSIGNOP term ·`

A completion indicates that a rule has been fully recognized.

## Earley Items

In dotted rules, Earley's algorithm has all but one piece of the information it needs to track. The final piece is the second of the two "wheres": where in the input stream. To associate input stream location and dotted rules, Earley's algorithm uses what are now called Earley items.

A convenient way to think of an **Earley item** is as a triple, or 3-tuple, consisting of dotted rule, origin and current location. The **origin** is the location in the input stream where the dotted rule starts. The **current location** is the location in the input stream which corresponds to the dot position.

Two noteworthy consequences follow from way in which origin and current location are defined. First, if a dotted rule is a prediction, origin and current location will always be the same. Second, the input stream location where a rule ends is not necessarily tracked, unless the dotted rule is a completion. In other cases, we don't necessarily know the location at which (or even if) the rule will be completed.

# THE BASE EXAMPLE

For this example of debugging, I've taken a very common example of a grammar and deliberately introduced a problem. (All the code and the full trace outputs for this base example are in the Appendix.) I've commented out the correct start rule:

` ## { lhs => 'Expression', rhs => [qw/Term/] },`

and replaced it with another start rule, one which will cause problems:

` { lhs => 'Expression', rhs => [qw/Factor/] },`

In what follows, we'll pretend we don't already know where the problem is, and use the Marpa diagnostics and tracing facilities to "discover" it.

# AN EARLY WARNING

Right off the bat, we get two warning messages:

```
Inaccessible symbol: Add
Inaccessible symbol: Term
```

If we were alert, these would be enough to tell us there is a serious problem. "Inaccessible" symbols are symbols which cannot be reached from the start symbol. This means that the grammar will never produce them, and that parses will never find them in the input.

Since `Add`

and `Term`

are both important symbols in our application, that should tell us our grammar has a serious problem. In fact, these warning messages would often be enough to point us to the error. But, in order to look at more of Marpa's tracing facilities, let's pretend we have not had our morning coffee, and we miss the significance of these warning messages.

# THE trace_terminals OUTPUT

Before looking at Marpa's progress reports, it usually best to orient yourself by looking at the output from `trace_terminals`

. Typically, you will be interested in what were the last tokens to be accepted, and perhaps what tokens the recognizer was looking for when it didn't find what it wanted. Sometimes that information alone is enough to make it clear where the problem is.

The full `trace_terminals`

output for this base example is in the Appendix. We see that the recognizer seems to accept "`42*1`

" but it fails when it confronts the plus sign ("`+`

"). The last two lines are:

```
Accepted "Number" at 2-3
Expecting "Multiply" at 3
```

A note in passing: Marpa shows the location of the tokens it accepts as a range of locations. For `Number`

, the range is "`2-3`

", indicating that earleme 2 is the start location and earleme 3 is the end location. That level of detail will seem like overkill in ordinary applications, where every token has length 1. But Marpa allows other input models, and in those models the information about start and end location of the token is important.

Returning to the problem at hand: debugging the grammar. We notice that at earleme 3 we are expecting a `Multiply`

operator, but not an `Add`

operator. That should strike us as strange, and send us back to the grammar. But for the sake of our example we will assume that we are slow on the uptake today, and this does not clue us in. We move on.

# THE show_progress OUTPUT

The last tool that you should ever need for debugging Marpa grammars are its progress reports. The progress reports show Earley items in their original form. Marpa does not actually use Earley's original items directly. But they underlie Marpa's algorithm and Marpa reconstructs them for `show_progress`

.

In the Appendix, progress reports for the entire parse are shown. Our base example in this document is a very small one, so that's a reasonable thing to do in this case. If a parse is at all large, you'll often need to be selective.

Of most interest is the progress report for the "current" Earley set. In our example that's earleme 3. The current Earley set is in the return value of the `tokens`

method. But it may be easier to simply call `show_progress`

without arguments. By default, `show_progress`

prints out only the progress reports for the current earleme.

Here are the progress reports for the current earleme, earleme 3, from our example.

```
BUILDING @0-3 Factor -> Factor . Multiply Factor
BUILDING @2-3 Factor -> Factor . Multiply Factor
COMPLETED @0-3 0: Expression -> Factor
COMPLETED @2-3 2: Factor -> Number
COMPLETED @0-3 4: Factor -> Factor Multiply Factor
COMPLETED @0-3 5: Expression['] -> Expression
```

As an aside, we see that there is a completed rule for `Expression[']`

, which is Marpa's special start symbol. That indicates that if our input ended at earleme 3, it would be a valid sentence in the language of our grammar.

More relevant for our purposes, we see that we have completed rules for `Expression`

, and `Factor`

, as expected. We also see two Earley items that show that we are in the progress of building another `Factor`

, and that it is expecting a `Multiply`

symbol. This is not the rule we want, but it explains why the `trace_terminals`

output showed that the recognizer was expecting a `Multiply`

symbol.

What we want to know is, why is the recognizer **not** expecting an `Add`

symbol? Looking back at the grammar, we see that only one rule uses the `Add`

symbol: the rule "`Term ::= Term Add Term`

". The next step is to look at the Earley items for this rule. But there is a problem. We don't find any.

Next, we ask ourselves, what is the earliest place the "`Term ::= Term Add Term`

" rule should be appearing? The answer is that there should be a prediction of "`Term ::= Term Add Term`

" at earleme 0. So we look at the predictions at earleme 0.

```
PREDICTING @0 0: Expression -> Factor
PREDICTING @0 2: Factor -> Number
PREDICTING @0 4: Factor -> Factor Multiply Factor
PREDICTING @0 5: Expression['] -> Expression
```

No "`Term ::= Term Add Term`

" rule. We are not even predicting a "`Term ::= Term Add Term`

" rule. We look back at the grammar, and start from the beginning.

```
{ lhs => 'Expression', rhs => [qw/Factor/] },
{ lhs => 'Term', rhs => [qw/Factor/] },
{ lhs => 'Factor', rhs => [qw/Number/] },
{ lhs => 'Term',
rhs => [qw/Term Add Term/],
action => 'do_add'
},
{ lhs => 'Factor',
rhs => [qw/Factor Multiply Factor/],
action => 'do_multiply'
},
```

Our special start symbol is `Expression[']`

and we do see a prediction for the rule with `Expression[']`

on the left hand side. It in turn predicts an `Expression`

symbol, and we do see a prediction for the rule with `Expression`

on the left hand side. `Expression`

in turn predicts a `Factor`

symbol, and we do see a prediction for both of the rules with `Factor`

on the left hand side.

But we don't see anything predicting `Term`

. And then it hits us. If we start at the start symbol, and follow the rules of our grammar, we will never get to a `Term`

symbol. That's what the warning message was trying to tell us.

This makes us reread our grammar, and see that out `Expression ::= Factor`

rule is wrong. It should be `Expression ::= Term`

. We have found the source of our problem.

# COMPLICATIONS

Internally, Marpa rewrites Earley items and grammars. `show_progress`

hides most of this from the user. But some aspects of Marpa's rewrites are relevant and useful to know.

## Special Symbols

Marpa uses a few special symbols internally which it is useful for the the user of `show_progress`

to be aware of. To distinguish them, Marpa's internal symbols end in a right square bracket ("`]`

"). No user-defined symbol is allowed to end in a right square bracket.

One of these special symbols is Marpa's special start symbol, which always ends in "`[']`

". Marpa augments all of its grammars with a special start rule, which will have the special start symbol on its left hand side. We saw this above with the `Expression['] ::= Expression`

rule.

If the empty, or null, string is a sentence in the language of the grammar, Marpa will add a special empty start rule. The special empty start rule will have its own special null start symbol on its left hand side. The special null start symbol ends in "`['][]`

".

## Empty Rules

You may notice that any empty rules in your grammar never appear in the traces. That is because Marpa allows at most one empty rule in any grammar, and that rule is the one Marpa adds itself: the special empty start rule. Marpa removes all empty and nulling rules in the original grammar, after noting that the left hand side of an nulling rule must be treated as a nulling symbol.

Marpa's removal of nulling rules is recursive, at it needs to be. Removing rules that are nulling reveals that the left hand side symbol of those rules is also nulling. This in turn can reveal other nulling rules.

## Sequences

Marpa allows the user to explicitly specify sequences, rather than write them out in BNF. Marpa is able to optimize explicitly specified sequences. For the actual parsing, Marpa rewrites sequences into BNF.

In the Earley items, the rules will have been translated into BNF, so that is how they appear in `show_progress`

. Marpa's rewritten sequence rules take much the same form that a programmer's rewritten rules would, if she had to do the rewrite by hand.

Here's are the rules of a Marpa grammar, with a sequence:

```
my $grammar = Marpa::Grammar->new(
{ start => 'Document',
strip => 0,
lhs_terminals => 0,
rules => [
{ lhs => 'Document', rhs => [qw/Stuff/], min => 1 },
],
}
);
```

And here is how Marpa translates this sequence:

```
PREDICTING @0 0: Document -> Stuff[Seq:1-*]
PREDICTING @0 1: Stuff[Seq:1-*] -> Stuff
PREDICTING @0 2: Stuff[Seq:1-*] -> Stuff[Seq:1-*] Stuff
```

## Leo Items

In the original Earley's algorithm, left recursion was very efficient, but right recursion required a completion Earley item for every step of the recursion in every Earley set where the right recursion might end. That made right recursion take O(*n***2) time and space, where *n* was the length of the right recursion.

Joop Leo figured out a way to get around this. Leo noted that most of the completions were trivial, in the sense that their only purpose was to create the topmost completion in the chain. He figured out how to get directly to the topmost completion and let that topmost completion stand in for the long chain of trivial completions that produced it. This reduced the time and space required to O(*n*). Marpa uses Leo's methods.

The topmost completions are labeled "`(Leo)`

" as we'll see below. These topmost completions stand in for zero or more trivial completions in the same Earley set.

Here is a example. First, the right recursive grammar:

```
my $grammar = Marpa::Grammar->new(
{ start => 'S',
strip => 0,
lhs_terminals => 0,
rules => [
{ lhs => 'S', rhs => [qw/Top_sequence/] },
{ lhs => 'Top_sequence', rhs => [qw/Top Top_sequence/] },
{ lhs => 'Top_sequence', rhs => [qw/Top/] },
{ lhs => 'Top', rhs => [ qw/Upper_Middle/ ] },
{ lhs => 'Upper_Middle', rhs => [ qw/Lower_Middle/ ] },
{ lhs => 'Lower_Middle', rhs => [ qw/Bottom/ ] },
{ lhs => 'Bottom', rhs => [ qw/T/ ] },
],
}
);
```

And here is the progress report at earleme 20:

```
COMPLETED @0-20 0: S -> Top_sequence
COMPLETED @0-20 (Leo) 1: Top_sequence -> Top Top_sequence
COMPLETED @19-20 2: Top_sequence -> Top
COMPLETED @19-20 (Leo) 3: Top -> Upper_Middle
COMPLETED @19-20 (Leo) 4: Upper_Middle -> Lower_Middle
COMPLETED @19-20 6: Bottom -> T
COMPLETED @0-20 7: S['] -> S
```

### The First Leo Topmost Completion

In these reports we see three Leo topmost completions. Let's look at the first one:

` COMPLETED @0-20 (Leo) 1: Top_sequence -> Top Top_sequence`

This is the top of a chain of 19 Earley items, all for the same rule: "`Top_sequence ::= Top Top_sequence`

". This topmost one spans the earleme range from 0 to 20 ("`@0-20`

"). The other 18 in the chain are trivial completions, completions whose only purpose is to produce the topmost completion. The next one down in the chain (not shown in the `show_progress`

output) is identical, except for the origin, and so would look like this:

` COMPLETED @1-20 1: Top_sequence -> Top Top_sequence`

Similarly, the remaining 17 Earley items would differ from the topmost one only in their origin, each one starting one earleme later than the one above it. The bottom of the chain is shown in the `show_progress`

output and is a completion of a different rule:

` COMPLETED @19-20 (Leo) 3: Top -> Upper_Middle`

This example illustrates the improvement introduced by the Leo method. In this example, without the Leo methods, there would have been 20 items in the chain at earleme 20. At earleme 100 there would be 100 items, and so forth -- at earleme *n* there would be *n* items, all but two of them trivial.

By earleme 100, there would have been a total (in all Earley sets up to that point) of over 4800 trivial Earley items. By earleme 1000, there would have been over 498,000, which would have made parsing with this grammar impractical for many applications. With the Leo methods, the number of Earley items in each set stays below a small constant.

### The Second Leo Topmost Completion

The second Leo topmost completion was the bottom of the chain for the first topmost completion. It illustrates something to watch for:

` COMPLETED @19-20 (Leo) 3: Top -> Upper_Middle`

The chain of completions which the topmost item replaces can contain zero trivial completions. In other words, sometimes no completions are actually eliminated. That is the case here. This topmost completion stands in only for itself.

### The Third Leo Topmost Completion

Our third Leo topmost completion illustrates a final point to watch for:

` COMPLETED @19-20 (Leo) 4: Upper_Middle -> Lower_Middle`

Unlike the second Leo completion, this one does stand in for an trivial completion, one that was eliminated. If this trivial completion had appeared in the `show_progress`

output, it would have looked like this:

` COMPLETED @19-20 4: Lower_Middle -> Bottom`

For the first Leo topmost completion, all the trivial completions that were eliminated were for the same rule as the topmost completion. That is **not** always the case. Leo's methods catches and optimizes right recursion, even when it is indirect, and passes through a series of different rules. In this example, the trivial completion which was eliminated is not for the same rule as the rule in the Leo topmost completion which stands in for it.

### Futures

In future versions of Marpa, `show_progress`

may say more about with the missing trivial completions. It is not yet clear how much information, and what kind of information, to provide. Simply printing out all the eliminated, trivial completions does not seem to really be an option. When a right recursion is thousands of steps long, which it can easily be, there will often be millions of trivial completions. Nonetheless, more can be done in letting the user know about eliminated, trivial completions.

# APPENDIX: FULL TRACES AND CODE FOR THE BASE EXAMPLE

Below are the code, the trace outputs and the progress report for the base example used in this document.

## Code for the example

```
my $grammar = Marpa::Grammar->new(
{ start => 'Expression',
actions => 'My_Actions',
default_action => 'first_arg',
strip => 0,
rules => [
## This is a deliberate error in the grammar
## The next line should be:
## { lhs => 'Expression', rhs => [qw/Term/] },
## I have changed the Term to 'Factor' which
## will cause problems.
{ lhs => 'Expression', rhs => [qw/Factor/] },
{ lhs => 'Term', rhs => [qw/Factor/] },
{ lhs => 'Factor', rhs => [qw/Number/] },
{ lhs => 'Term',
rhs => [qw/Term Add Term/],
action => 'do_add'
},
{ lhs => 'Factor',
rhs => [qw/Factor Multiply Factor/],
action => 'do_multiply'
},
],
}
);
```

```
$grammar->precompute();
my @tokens = (
[ 'Number', 42 ],
[ 'Multiply', q{*} ],
[ 'Number', 1 ],
[ 'Add', q{+} ],
[ 'Number', 7 ],
);
sub My_Actions::do_add {
my ( undef, $t1, undef, $t2 ) = @_;
return $t1 + $t2;
}
sub My_Actions::do_multiply {
my ( undef, $t1, undef, $t2 ) = @_;
return $t1 * $t2;
}
sub My_Actions::first_arg { shift; return shift; }
my $recce = Marpa::Recognizer->new(
{ grammar => $grammar, trace_terminals => 2, mode => 'stream' } );
my $token_ix = 0;
my ( $current_earleme, $expected_tokens ) =
$recce->tokens( \@tokens, \$token_ix );
if ( $token_ix <= $#tokens ) {
$progress_report = $recce->show_progress( 0, $current_earleme );
}
```

## Trace Output

```
Inaccessible symbol: Add
Inaccessible symbol: Term
Setting trace_terminals option
Expecting "Number" at earleme 0
Expecting "Expression" at earleme 0
Expecting "Factor" at earleme 0
Accepted "Number" at 0-1
Expecting "Multiply" at 1
Accepted "Multiply" at 1-2
Expecting "Number" at 2
Expecting "Factor" at 2
Accepted "Number" at 2-3
Expecting "Multiply" at 3
Rejected "Add" at 3
```

## Progress Output

```
PREDICTING @0 0: Expression -> Factor
PREDICTING @0 2: Factor -> Number
PREDICTING @0 4: Factor -> Factor Multiply Factor
PREDICTING @0 5: Expression['] -> Expression
BUILDING @0-1 Factor -> Factor . Multiply Factor
COMPLETED @0-1 0: Expression -> Factor
COMPLETED @0-1 2: Factor -> Number
COMPLETED @0-1 5: Expression['] -> Expression
PREDICTING @2 2: Factor -> Number
PREDICTING @2 4: Factor -> Factor Multiply Factor
BUILDING @0-2 Factor -> Factor Multiply . Factor
BUILDING @0-3 Factor -> Factor . Multiply Factor
BUILDING @2-3 Factor -> Factor . Multiply Factor
COMPLETED @0-3 0: Expression -> Factor
COMPLETED @2-3 2: Factor -> Number
COMPLETED @0-3 4: Factor -> Factor Multiply Factor
COMPLETED @0-3 5: Expression['] -> Expression
```

# LICENSE AND COPYRIGHT

Copyright 2007-2010 Jeffrey Kegler, all rights reserved. Marpa is free software under the Perl license. For details see the LICENSE file in the Marpa distribution.