The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Parse::Eyapp::debuggingtut - Solving ambiguities and fixing lexical, syntactic and semantic errors

INTRODUCTION

The sources of error when programming with eyapp are many and various. Some of them are minor, as having a nonterminal without production rules or a terminal that is never produced by the lexical analyzer. These kind of errors can be caught with the help of the %strict directive.

In the following sections we will discuss three main kind of errors that correspond to three development stages:

  • Conflict errors:

    Conflicts with the grammar: the grammar is ambiguous or is not clear - perhaps due to the fact that eyapp uses only a lookahead symbol - which sort of tree must be built for some inputs

  • Tree building errors:

    There are no conflicts but the parser does not build the syntax tree as expected. May be it rejects correct sentences or accepts incorrect ones. Or may be it accepts correct ones but the syntax tree has not the shape we want (i.e. we have a precedence problem).

  • Semantic errors:

    We have solved the conflicts and trees are satisfactory but we have errors inside the semantic actions.

Each time you discover an error write a test that covers that error. Section "TREE EQUALITY" deals with the problem of checking if the generated abstract syntax tree has the correct shape and attributes.

As Andreas Zeller points out in his article "Beautiful Debugging" finding the causes of a failing program must follow the scientific method:

1. Observe the failure (there are conflicts or ambiguity, there are precedence problems, there are semantic errors, the output is wrong)
3. Based on your hypothesis make predictions
3. Using appropriate input tests and the available tools (eyapp -v option, yydebug, the Perl debugger, etc.) see if your predictions hold. Reject your hypothesis if they don't hold.
4. Repeat the last two steps until your hypothesis is confirmed. The hypothesis then becomes a theory.
5. Convert the knowledge and informal tests developed during this process in a formal test that covers the failure

THE %strict DIRECTIVE

By default, identifiers appearing in the rule section will be classified as terminal if they don't appear in the left hand side of any production rules.

The directive %strict forces the declaration of all tokens. The following eyapp program issues a warning:

  pl@nereida:~/LEyapp/examples/eyapplanguageref$ cat -n bugyapp2.eyp
       1  %strict
       2  %%
       3  expr: NUM;
       4  %%
  pl@nereida:~/LEyapp/examples/eyapplanguageref$ eyapp bugyapp2.eyp
  Warning! Non declared token NUM at line 3 of bugyapp2.eyp

To keep silent the compiler declare all tokens using one of the token declaration directives (%token, %left, etc.)

  pl@nereida:~/LEyapp/examples/eyapplanguageref$ cat -n bugyapp3.eyp
       1  %strict
       2  %token NUM
       3  %%
       4  expr: NUM;
       5  %%
  pl@nereida:~/LEyapp/examples/eyapplanguageref$ eyapp bugyapp3.eyp
  pl@nereida:~/LEyapp/examples/eyapplanguageref$ ls -ltr | tail -1
  -rw-r--r-- 1 pl users 2395 2008-10-02 09:41 bugyapp3.pm

It is a good practice to use %strict at the beginning of your grammar.

CONFLICTS AND AMBIGUITIES

Understanding Priorities

Token and production priorities are used to solve conflicts. Recall the main points of yacc-like parsers related to priorities:

  • The directives

                %left
                %right
                %nonassoc

    can be used in the head section to declare the priority of a token

  • The later the declaration line the higher the priority

  • The precedence of a production rule (right hand side) is the precedence of the last token in the right hand side

  • In a shift-reduce conflict the default action is to shift. This action can be changed if the production and the token have explicit priorities

  • If the precedence of the production rule is higher the shift-reduce conflict is solved in favor of the reduction

  • If the precedence of the token is higher the shift-reduce conflict is solved in favor of the shift

  • If the precedence of the token is the same than the precedence of the rule, and is left the shift-reduce conflict is solved in favor of the reduction

  • If the precedence of the token is the same than the precedence of the rule, and is right the shift-reduce conflict is solved in favor of the shift

  • If the precedence of the token is the same than the precedence of the rule, and is nonassoc the presence of a shift-reduce conflict means an error. This is used to describe operators, like the operator .LT. in FORTRAN, that may not associate with themselves. That is, because

                                 A .LT. B .LT. C

    is invalid in FORTRAN, .LT. would be described with the keyword %nonassoc in eyapp.

  • The default precedence of a production can be changed using the %prec TOKEN directive. Now the rule has the precedence and associativity of the specified TOKEN.

The program Precedencia.eyp illustrates the way priorities work in eyapp:

  pl@europa:~/LEyapp/examples/debuggingtut$ eyapp -c Precedencia.eyp
  %token NUM
  %left '@'
  %right '&' dummy
  %tree

  %%

  list:
      | list '\n'
      | list e
  ;
  e:
        %name NUM
        NUM
      | %name AMPERSAND
        e '&' e
      | %name AT
        e '@' e %prec dummy
  ;

  %%

See an execution:

  pl@europa:~/LEyapp/examples/debuggingtut$ ./Precedencia.pm
  Expressions. Press CTRL-D (Unix) or CTRL-Z (Windows) to finish:
  2@3@4
  2@3&4
  2&3@4
  2&3&4
  <CTRL-D>
  AT(AT(NUM(TERMINAL[2]),NUM(TERMINAL[3])),NUM(TERMINAL[4]))
  AT(NUM(TERMINAL[2]),AMPERSAND(NUM(TERMINAL[3]),NUM(TERMINAL[4])))
  AT(AMPERSAND(NUM(TERMINAL[2]),NUM(TERMINAL[3])),NUM(TERMINAL[4]))
  AMPERSAND(NUM(TERMINAL[2]),AMPERSAND(NUM(TERMINAL[3]),NUM(TERMINAL[4])))

See if you are able to understand the output:

  • 2@3@4: The phrase is interpreted as (2@3)@4 since the rule e '@' e has the precedence of the token dummy which is stronger that then priority of token @. The conflict is solved in favor of the reduction

  • 2@3&4: The rule e '@' e has the precedence of dummy which is the same than the token &. The associativity decides. Since they were declared %right the conflict is solved in favor of the shift. The phrase is interpreted as 2@(3&4)

  • 2&3@4: The rule e '&' e has more precedence than the token @. The phrase is interpreted as (2&3)@4

  • 2&3&4: Both the rule and the token have the same precedence. Since they were declared %right, the conflict is solved in favor of the shift. The phrase is interpreted as 2&(3&4)

An eyapp Program with Errors

The following simplified eyapp program has some errors. The generated language is made of lists of declarations (D stands for declaration) followed by lists of sentences (S stands for statement) separated by semicolons:

  pl@europa:~/LEyapp/examples/debuggingtut$ cat -n Debug.eyp
     1  # Debug.eyp
     2  # compile it eyapp -b Debug.eyp -o Debug.pl to produce an
     3  # executable
     4
     5  %token D S
     6
     7  %{
     8  our $VERSION = '0.01';
     9  %}
    10
    11  %%
    12  p:
    13      ds ';' ss
    14    | ss
    15  ;
    16
    17  ds:
    18      D ';' ds
    19    | D
    20      {
    21        print "Reducing by rule:\n";
    22        print "\tds -> D\n";
    23        $_[1];
    24      }
    25  ;
    26
    27  ss:
    28      S ';' ss
    29    | S
    30  ;
    31
    32  %%
    33
    34  my $tokenline = 1;
    35
    36  sub _Error {
    37    my $parser = shift;
    38
    39    my ($token) = $parser->YYCurval;
    40    my ($what) = $token ? "input: '$token'" : "end of input";
    41    die "Syntax error near $what line num $tokenline\n";
    42  }
    43
    44  my $input;
    45
    46  sub _Lexer {
    47
    48    for ($input) {
    49      s{^(\s)}{} and $tokenline += $1 =~ tr{\n}{};
    50      return ('',undef) unless $_;
    51      return ($1,$1) if s/^(.)//;
    52    }
    53    return ('',undef);
    54  }
    55
    56  sub Run {
    57
    58    $input = <>;
    59
    60    my $self = __PACKAGE__->new();
    61
    62    return $self->YYParse( yylex => \&_Lexer, yyerror => \&_Error,
    63                           yydebug => 0x1F
    64    );
    65  }
    66
    67  Run() unless caller;

Focusing in the Grammar

Sometimes the presence of actions, attribute names and support code makes more difficult the readability of the grammar. You can use the -c option of eyapp, to see only the syntactic parts:

  $ eyapp -c examples/debuggingtut/Debug.eyp
  %token D S

  %%

  p:
        ds ';' ss
      | ss
  ;
  ds:
        D ';' ds
      | D
  ;
  ss:
        S ';' ss
      | S
  ;

  $

It is clear now that the language generated by this grammar is made of non empty sequences of D followed by non empty sequences of <S> separated by semicolons.

Detecting Conflicts

When compiling this grammar, eyapp produces a warning message announcing the existence of a conflict:

  pl@nereida:~/LEyapp/examples$ eyapp Debug.eyp
  1 shift/reduce conflict (see .output file)
  State 4: shifts:
    to state    8 with ';'

Studying the .output file

The existence of warnings triggers the creation of a file Debug.output containing information about the grammar and the syntax analyzer.

Let us see the contents of the Debug.output file:

  pl@nereida:~/LEyapp/examples$ cat -n Debug.output
     1  Warnings:
     2  ---------
     3  1 shift/reduce conflict (see .output file)
     4  State 4: shifts:
     5    to state    8 with ';'
     6
     7  Conflicts:
     8  ----------
     9  State 4 contains 1 shift/reduce conflict
    10
    11  Rules:
    12  ------
    13  0:      $start -> p $end
    14  1:      p -> ds ';' ss
    15  2:      p -> ss
    16  3:      ds -> D ';' ds
    17  4:      ds -> D
    18  5:      ss -> S ';' ss
    19  6:      ss -> S
    20
    21  States:
    22  -------
    23  State 0:
    24
    25          $start -> . p $end      (Rule 0)
    26
    27          D       shift, and go to state 4
    28          S       shift, and go to state 1
    29
    30          p       go to state 2
    31          ss      go to state 3
    32          ds      go to state 5
    33
    ..  .........................................
    55  State 4:
    56
    57          ds -> D . ';' ds        (Rule 3)
    58          ds -> D .       (Rule 4)
    59
    60          ';'     shift, and go to state 8
    61
    62          ';'     [reduce using rule 4 (ds)]
    63
    ..  .........................................
    84  State 8:
    85
    86          ds -> D ';' . ds        (Rule 3)
    87
    88          D       shift, and go to state 4
    89
    90          ds      go to state 11
    91
    ..  .........................................
   112  State 12:
   113
   114          p -> ds ';' ss .        (Rule 1)
   115
   116          $default        reduce using rule 1 (p)
   117
   118
   119  Summary:
   120  --------
   121  Number of rules         : 7
   122  Number of terminals     : 4
   123  Number of non-terminals : 4
   124  Number of states        : 13

The parser generated by Parse::Eyapp is based on a deterministic finite automaton. Each state of the automaton remembers what production rules are candidates to apply and what have been seen from the right hand side of the production rule. The problem, according to the warning, occurs in state 4. State 4 contains:

    55  State 4:
    56
    57          ds -> D . ';' ds        (Rule 3)
    58          ds -> D .       (Rule 4)
    59
    60          ';'     shift, and go to state 8
    61
    62          ';'     [reduce using rule 4 (ds)]
    63

An state is a set of production rules with a marker (the dot in rules 3 and 4) somewhere in its right hand side. If the parser is in state 4 is because the production rules ds -> D ';' ds and ds -> D are potential candidates to build the syntax tree. That they will win or not depends on what will happen next when more input is processed.

The dot that appears on the right hand side means position in our guessing. The fact that ds -> D .';' ds is in state 4 means that if the parser is in state 4 we have already seen D and we expect to see a semicolon followed by ds (or something derivable from ds). If such thing happens this production will be the right one (will be the handle in the jargon). The comment

    60          ';'     shift, and go to state 8

means that if the next token is a semicolon the next state will be state 8:

    84  State 8:
    85
    86          ds -> D ';' . ds        (Rule 3)
    87
    88          D       shift, and go to state 4
    89
    90          ds      go to state 11
    

As we see state 8 has the item ds -> D ';' . ds which means that we have already seen a D and a semicolon.

The fact that ds -> D . is in state 4 means that we have already seen D and since the dot is at the end of the rule, this production can be the right one, even if a semicolon is just waiting in the input. An example that it will be correct to "reduce" by the rule ds -> D . in the presence of a semicolon is given by the input D ; S. A rightmost derivation for such input is:

  p => ds ; ss => ds ; S => D ; S

that is processed by the LALR(1) algorithm following this sequence of actions:

 +----------+---------+---------+
 | rule     | read    | input   |
 |          |         | D ; S $ |
 |          | D       |   ; S $ |
 | ds->d    | ds      |   ; S $ |
 |          | ds ;    |     S $ |
 |          | ds ; S  |       $ |
 | ss->s    | ds ; ss |       $ |
 | p->ds;ss | p       |         |
 +----------+---------+---------+

Since it is correct to reduce in some cases by the production ds -> D . and others in which is correct to shift the semicolon, eyapp complains about a shift/reduce conflict with ';'. State 4 has two rules that compete to be the right one:

  pl@nereida:~/LEyapp/examples$ eyapp Debug.eyp
  1 shift/reduce conflict (see .output file)

We can guess that the right item (the rules with the dot, i.e. the states of the automaton are called LALR(0) items in the yacc jargon) is ds -> D .';' ds and shift to state 8 consuming the semicolon, expecting to see something derivable from ds later or guess that ds -> D . is the right LR(0) item and reduce for such rule. This is the meaning of the comments in state 4:

    60          ';'     shift, and go to state 8
    61
    62          ';'     [reduce using rule 4 (ds)]

To illustrate the problem let us consider the phrases D;S and D;D;S.

For both phrases, after consuming the D the parser will go to state 4 and the current token will be the semicolon.

For the first phrase D;S the correct decision is to use rule 4 ds -> D (to reduce in the jargon). For the second phrase D;D;S the correct decision is to follow rule 3 ds -> D . ';' ds.

The parser generated by eyapp would be able to know which rule is correct for each case if it were allowed to look at the token after the semicolon: if it is a S is rule 4, if it is a D is rule 3. But the parsers generated by Eyapp do not lookahead more than the next token (this is what the "1" means when we say that Parse::Eyapp parsers are LALR(1)) and therefore is not in condition to decide which production rule applies.

Unfortunately this is the sort of conflict that can't be solved by assigning priorities to the productions and tokens as it was done for the calculator example in Parse::Eyapp::eyappintro. If we run the analyzer it will refuse to accept correct entries like D;D;S:

  pl@europa:~/LEyapp/examples/debuggingtut$ eyapp -b -o debug.pl Debug.eyp
  1 shift/reduce conflict (see .output file)
  State 4: shifts:
    to state    8 with ';'
  pl@europa:~/LEyapp/examples/debuggingtut$ ./debug.pl
  D;D;S
  ----------------------------------------
  In state 0:
  Stack:[0]
  Need token. Got >D<
  Shift and go to state 4.
  ----------------------------------------
  In state 4:
  Stack:[0,4]
  Need token. Got >;<
  Shift and go to state 8.
  ----------------------------------------
  In state 8:
  Stack:[0,4,8]
  Need token. Got >D<
  Shift and go to state 4.
  ----------------------------------------
  In state 4:
  Stack:[0,4,8,4]
  Need token. Got >;<
  Shift and go to state 8.
  ----------------------------------------
  In state 8:
  Stack:[0,4,8,4,8]
  Need token. Got >S<
  Syntax error near input: 'S' line num 1

The default parsing action is to shift the token ; giving priority to the production

           ds -> D . ';' ds

over the production

           ds -> D .

Since no ds production starts with S, the presence of S is (erroneously) interpreted as an error.

The Importance of the FOLLOW Set

You may wonder why the productions

  ss:
        S ';' ss
      | S
  ;

do not also produce a shift-reduce conflict with the semicolon. This is because the reduction by ss -> S always corresponds to the last S in a derivation:

   ss => S ; ss => S ; S ; ss => S ; S; S

and thus, the reduction by ss -> S only occurs in the presence of the end of input token and never with the semicolon. The FOLLOW set of a syntactic variable is the set of tokens that may appear next to such variable in some derivation. While the semicolon ; is in the FOLLOW of dd, it isn't in the FOLLOW of ss.

Solving Shift-Reduce Conflicts by Factorizing

To solve the former conflict the Eyapp programmer has to reformulate the grammar modifying priorities and reorganizing the rules. Rewriting the recursive rule for ds to be let recursive solves the conflict:

  pl@nereida:~/LEyapp/examples$ sed -ne '/^ds:/,/^;/p' Debug1.eyp | cat -n
   1  ds:
   2            ds ';' D
   3          | D
   4      {
   5        print "Reducing by rule:\n";
   6        print "\tds -> D\n";
   7        $_[1];
   8      }
   9  ;

Now, for any phrase matching the pattern D ; ... the action to build the tree is to reduce by ds -> D.

The rightmost reverse derivation for D;D;S is:

             Derivation                 |             Tree
  --------------------------------------+-----------------------------
  D;D;S <= ds;D;S <= ds;S <= ds;ss <= p |  p(ds(ds(D),';',D),';',ss(S))

while the rightmost reverse derivation for D;S is:

             Derivation                 |             Tree
  --------------------------------------+-----------------------------
  D;S <= ds;S <= ds;ss <= p             |      p(ds(D),';',ss(S))

When we recompile the modified grammar no warnings appear:

  pl@nereida:~/LEyapp/examples$ eyapp Debug1.eyp
  pl@nereida:~/LEyapp/examples$ 

Solving Shift-Reduce Conflicts By Looking Ahead

The problem here is that Eyapp/Yapp/Yacc etc. produce LALR(1) parsers. They only look the next token. We can decide how to solve the conflict by rewriting the lexical analyzer to peer forward what token comes after the semicolon: it now returns SEMICOLONS if it is an S and SEMICOLOND if it is an D. Here is a solution based in this idea:

  Eyapp/examples/debuggingtut$ cat -n DebugLookForward.eyp
     1  /*VIM: set ts=2 */
     2  %token D S
     3  %token SEMICOLONS SEMICOLOND
     4
     5  %{
     6  our $VERSION = '0.01';
     7  %}
     8
     9  %%
    10  p:
    11      ds SEMICOLONS ss
    12    | ss
    13  ;
    14
    15  ds:
    16      D SEMICOLOND ds
    17    | D
    18      {
    19        print "Reducing by rule:\n";
    20        print "\tds -> D\n";
    21        $_[1];
    22      }
    23  ;
    24
    25  ss:
    26      S SEMICOLONS ss
    27    | S
    28  ;
    29
    30  %%
    31
    32  my $tokenline = 1;
    33
    34  sub _Error {
    35    my $parser = shift;
    36    my ($token) = $parser->YYCurval;
    37    my ($what) = $token ? "input: '$token'" : "end of input";
    38    die "Syntax error near $what line num $tokenline\n";
    39  }
    40
    41  my $input;
    42
    43  sub _Lexer {
    44
    45    for ($input) {
    46       s{^(\s)}{} and $tokenline += $1 =~ tr{\n}{};
    47       return ('',undef) unless $_;
    48       return ($1,$1) if s/^([sSDd])//;
    49       return ('SEMICOLOND', 'SEMICOLOND') if s/^(;)\s*(D)/D/;
    50       return ('SEMICOLONS', 'SEMICOLONS') if s/^(;)\s*(S)/S/;
    51       die "Syntax error at line num $tokenline:".substr($_,0,10)."\n";
    52    }
    53    return ('',undef);
    54  }
    55
    56  sub Run {
    57    my ($self) = shift;
    58
    59    $input = shift;
    60
    61    return $self->YYParse( yylex => \&_Lexer, yyerror => \&_Error,
    62                           #yydebug => 0x1F
    63    );
    64  }

ERRORS DURING TREE CONSTRUCTION

Let us write the typical client program for the parser in Debug1:

  pl@nereida:~/LEyapp/examples$ cat -n usedebug1.pl
   1  #!/usr/bin/perl -w
   2  # usetreebypass.pl prueba2.exp
   3  use strict;
   4  use Debug1;
   5
   6  sub slurp_file {
   7    my $fn = shift;
   8    my $f;
   9
  10    local $/ = undef;
  11    if (defined($fn)) {
  12      open $f, $fn  or die "Can't find file $fn!\n";
  13    }
  14    else {
  15      $f = \*STDIN;
  16    }
  17    my $input = <$f>;
  18    return $input;
  19  }
  20
  21  my $input = slurp_file( shift() );
  22
  23  my $parser = Debug1->new();
  24
  25  $parser->Run($input);

When executing the program we observe an abnormal behavior:

  examples/debuggingtut$ usedebug1.pl
  D

  ;

  D

  ;
  S
  Reducing by rule:
          ds -> D
  Syntax error near end of input line num 2

We activate the option yydebug => 0xF in the call to a YYParser. The integer parameter yydebug of new and YYParse controls the level of debugging. Different levels of verbosity can be obtained by setting the bits of this argument. It works as follows:

     /============================================================\
     | Bit Value  | Outputs                                       |
     |------------+-----------------------------------------------|
     |  0x01      |  Token reading (useful for Lexer debugging)   |
     |------------+-----------------------------------------------|
     |  0x02      |  States information                           |
     |------------+-----------------------------------------------|
     |  0x04      |  Driver actions (shifts, reduces, accept...)  |
     |------------+-----------------------------------------------|
     |  0x08      |  Parse Stack dump                             |
     |------------+-----------------------------------------------|
     |  0x10      |  Error Recovery tracing                       |
     \============================================================/

Let us see what happens when the input is D;S. We have introduced some white spaces and carriage returns between the terminals:

  pl@nereida:~/LEyapp/examples$ usedebug1.pl
  D

  ;

  S

  ----------------------------------------
  In state 0:
  Stack:[0]
  Need token. Got >D<
  Shift and go to state 4.
  ----------------------------------------
  In state 4:
  Stack:[0,4]
  Don't need token.
  Reduce using rule 4 (ds --> D): Reducing by rule:
          ds -> D
  Back to state 0, then go to state 5.
  ----------------------------------------
  In state 5:
  Stack:[0,5]
  Need token. Got ><
  Syntax error near end of input line num 1

What's going on? After reading the carriage return

   Need token. Got >D<

the parser receives an end of file. ¿Why?. Something is going wrong in the communications between lexical analyzer and parser. Let us review the lexical analyzer:

  pl@nereida:~/LEyapp/examples$ sed -ne '/sub.*_Lexer/,/^}/p' Debug1.eyp | cat -n
   1  sub _Lexer {
   2
   3          for ($input) {
   4             s{^(\s)}{} and $tokenline += $1 =~ tr{\n}{};
   5             return ('',undef) unless $_;
   6             return ($1,$1) if s/^(.)//;
   7          }
   8          return ('',undef);
   9  }

The error is at line 4. Only a single white space is eaten! The second white in the input does not match lines 5 and 6 and the contextualizing for finishes. Line 8 the returns the ('',undef) signaling the end of input.

Let us write a new version Debug2.eyp that fixes the problem:

  pl@nereida:~/LEyapp/examples$ sed -ne '/sub.*_Lexer/,/^}/p' Debug2.eyp | cat -n
   1  sub _Lexer {
   2
   3          for ($input) {
   4             s{^(\s+)}{} and $tokenline += $1 =~ tr{\n}{};
   5             return ('',undef) unless $_;
   6             return ($1,$1) if s/^(.)//;
   7          }
   8          return ('',undef);
   9  }

Now the analysis seems to work:

  pl@nereida:~/LEyapp/examples$ usedebug2.pl
  D

  ;

  S

  ----------------------------------------
  In state 0:
  Stack:[0]
  Need token. Got >D<
  Shift and go to state 4.
  ----------------------------------------
  In state 4:
  Stack:[0,4]
  Don't need token.
  Reduce using rule 4 (ds --> D): Reducing by rule:
          ds -> D
  Back to state 0, then go to state 5.
  ----------------------------------------
  In state 5:
  Stack:[0,5]
  Need token. Got >;<
  Shift and go to state 8.
  ----------------------------------------
  In state 8:
  Stack:[0,5,8]
  Need token. Got >S<
  Shift and go to state 1.
  ----------------------------------------
  In state 1:
  Stack:[0,5,8,1]
  Need token. Got ><
  Reduce using rule 6 (ss --> S): Back to state 8, then go to state 10.
  ----------------------------------------
  In state 10:
  Stack:[0,5,8,10]
  Don't need token.
  Reduce using rule 1 (p --> ds ; ss): Back to state 0, then go to state 2.
  ----------------------------------------
  In state 2:
  Stack:[0,2]
  Shift and go to state 7.
  ----------------------------------------
  In state 7:
  Stack:[0,2,7]
  Don't need token.
  Accept.

THE LR PARSING ALGORITHM: UNDERSTANDING THE OUTPUT OF yydebug

The YYParse methods implements the generic LR parsing algorithm. It very much works Parse::Yapp::YYParse and as yacc/bison yyparse. It accepts almost the same arguments as Class->new (Being Class the name of the generated class).

The parser uses two tables and a stack. The two tables are called the action table and the goto table. The stack is used to keep track of the states visited.

At each step the generated parser consults the action table and takes one decision: To shift to a new state consuming one token (and pushing the current state in the stack) or to reduce by some production rule. In the last case the parser pops from its stack as many states as symbols are on the right hand side of the production rule. Here is a Perl/C like pseudocode summarizing the activity of YYParse:

     1   my $parser = shift; # The parser object
     2   push(@stack, $parser->{startstate});
     3   $b = $parser->YYLexer(); # Get the first token
     4   FOREVER: {
     5     $s = top(0);  # Get the state on top of the stack
     6     $a = $b;
     7     switch ($parser->action[$s->state][$a]) {
     8       case "shift t" : 
     9         my $t;
    10         $t->{state} = t;
    11         $t->{attr}  = $a->{attr};
    12         push($t); 
    13         $b = $parser->YYLexer(); # Call the lexical analyzer
    14         break;
    15       case "reduce A->alpha" : 
    16         # Call the semantic action with the attributes of the rhs as args
    17         my $semantic  = $parser->Semantic{A ->alpha}; # The semantic action
    18         my $r;
    19         $r->{attr} = $semantic->($parser, top(|alpha|-1)->attr, ... , top(0)->attr); 
    20  
    21         # Pop as many states as symbols on the rhs of A->alpha
    22         pop(|alpha|);  
    23  
    24         # Goto next state 
    25         $r->{state} = $parser->goto[top(0)][A]; 
    26         push($r); 
    27         break;
    28       case "accept" : return (1); 
    29       default : $parser->YYError("syntax error"); 
    30     }
    31     redo FOREVER;
    32   }

Here |alpha| stands for the length of alpha. Function top(k) returns the state in position k from the top of the stack, i.e. the state at depth k. Function pop(k) extracts k states from the stack. The call $state->attr returns the attribute associated with $state. The call $parser->Semantic{A ->alpha} returns the semantic action associated with production A ->alpha.

Let us see a trace for the small gramar in examples/debuggingtut/aSb.yp:

  pl@nereida:~/LEyapp/examples$ /usr/local/bin/paste.pl aSb.yp aSb.output | head -5
  %%                                             | Rules:
  S:                 { print "S -> epsilon\n" }  | ------
      |   'a' S 'b'  { print "S -> a S b\n" }    | 0:    $start -> S $end
  ;                                              | 1:    S -> /* empty */
  %%                                             | 2:    S -> 'a' S 'b'

The tables in file aSb.output describe the actions and transitions to take:

  pl@nereida:~/LEyapp/examples$ cat -n aSb.output
     .  .........................................
     7  States:
     8  -------
     9  State 0:
    10
    11          $start -> . S $end      (Rule 0)
    12
    13          'a'     shift, and go to state 2
    14
    15          $default        reduce using rule 1 (S)
    16
    17          S       go to state 1
    18
    19  State 1:
    20
    21          $start -> S . $end      (Rule 0)
    22
    23          $end    shift, and go to state 3
    24
    25  State 2:
    26
    27          S -> 'a' . S 'b'        (Rule 2)
    28
    29          'a'     shift, and go to state 2
    30
    31          $default        reduce using rule 1 (S)
    32
    33          S       go to state 4
    34
    35  State 3:
    36
    37          $start -> S $end .      (Rule 0)
    38
    39          $default        accept
    40
    41  State 4:
    42
    43          S -> 'a' S . 'b'        (Rule 2)
    44
    45          'b'     shift, and go to state 5
    46
    47  State 5:
    48
    49          S -> 'a' S 'b' .        (Rule 2)
    50
    51          $default        reduce using rule 2 (S)
    52
    53
    54  Summary:
    55  --------
    56  Number of rules         : 3
    57  Number of terminals     : 3
    58  Number of non-terminals : 2
    59  Number of states        : 6

When executed with yydebug set and input aabb we obtain the following output:

  pl@nereida:~/LEyapp/examples$ use_aSb.pl
  ----------------------------------------
  In state 0:
  Stack:[0]
  aabb                       <----------- user input
  Need token. Got >a<
  Shift and go to state 2.  
  ----------------------------------------
  In state 2:
  Stack:[0,2]
  Need token. Got >a<
  Shift and go to state 2.
  ----------------------------------------
  In state 2:
  Stack:[0,2,2]
  Need token. Got >b<
  Reduce using rule 1 (S --> /* empty */): S -> epsilon
  Back to state 2, then go to state 4.

The output S-> epsilon is consequence of the semantic action associated with such production rule.

  ----------------------------------------
  In state 4:
  Stack:[0,2,2,4]
  Shift and go to state 5.
  ----------------------------------------
  In state 5:
  Stack:[0,2,2,4,5]
  Don't need token.
  Reduce using rule 2 (S --> a S b): S -> a S b
  Back to state 2, then go to state 4.

As a result of reducing by rule 2 the semantic action is executed

                { print "S -> a S b\n" }

and the three last visited states are popped from the stack, and the stack becomes [0,2]. But that means that we are now in state 2 seeing a S. If you look at the table above being in state2 and seeing a S we go to state 4.

  ----------------------------------------
  In state 4:
  Stack:[0,2,4]
  Need token. Got >b<
  Shift and go to state 5.
  ----------------------------------------
  In state 5:
  Stack:[0,2,4,5]
  Don't need token.
  Reduce using rule 2 (S --> a S b): S -> a S b
  Back to state 0, then go to state 1.
  ----------------------------------------
  In state 1:
  Stack:[0,1]
  Need token. Got ><
  Shift and go to state 3.
  ----------------------------------------
  In state 3:
  Stack:[0,1,3]
  Don't need token.
  Accept.

ERRORS INSIDE SEMANTIC ACTIONS

A third type of error occurs when the code inside a semantic action doesn't behave as expected.

The semantic actions are translated in anonymous methods of the parser object. Since they are anonymous we can't use breakpoints as

  b subname # stop when arriving at sub ''name''

or

  c subname  # contine up to reach sub ''name''

Furthermore the file loaded by the client program is the generated .pm. The code in Debug.pm is alien to us - Was automatically generated by Parse::Eyapp - and it can be difficult to find where our inserted semantic actions are.

To watch the execution of a semantic action is simple: We use the debugger f file.eyp option to switch the viewing filename to our grammar file.

  pl@nereida:~/LEyapp/examples$ perl -wd usedebug2.pl

  Loading DB routines from perl5db.pl version 1.28
  Editor support available.

  Enter h or `h h' for help, or `man perldebug' for more help.

  main::(usedebug2.pl:21):        my $input = slurp_file( shift() );
    DB<1> f Debug2.eyp
  1       2       #line 3 "Debug2.eyp"
  3
  4:      our $VERSION = '0.01';
  5
  6       7       8       9       10

Now we can set a breakpoint at any line of our grammar file. Thus the 18 in the command b 18 refers to line 18 in Debug2.eyp. The command l shows the corresponding lines of the .eyp file

    DB<2> b 18
    DB<3> l
  11      12      13      14      15      16      #line 17 "Debug2.eyp"
  17
  18:b          print "Reducing by rule:\n";
  19:           print "\tds -> D\n";
  20:           $_[1];

We issue now the command c (continue). The execution continues up to line 18 of Debug2.eyp:

    DB<3> c
  D

  ;

  S

  Debug2::CODE(0x85129d8)(Debug2.eyp:18):
  18:           print "Reducing by rule:\n";
    DB<3> n
  Reducing by rule:

Now we can issue any debugger commands (like x, p, etc.) to investigate the internal state of our program and determine what are the reasons for any abnormal behavior.

  Debug2::CODE(0x85129d8)(Debug2.eyp:19):
  19:           print "\tds -> D\n";
    DB<3> x $_[0]{GRAMMAR}
  0  ARRAY(0x8538360)
     0  ARRAY(0x855aa88)
        0  '_SUPERSTART'
        1  '$start'
        2  ARRAY(0x855ab60)
           0  'p'
           1  '$end'
        3  0
     1  ARRAY(0x855a890)
        0  'p_1'
        1  'p'
        2  ARRAY(0x855a8fc)
           0  'ds'
           1  ';'
           2  'ss'
        3  0
     2  ARRAY(0x855a800)
        0  'p_2'
        1  'p'
        2  ARRAY(0x855a830)
           0  'ss'
        3  0
     3  ARRAY(0x855a764)
        0  'ds_3'
        1  'ds'
        2  ARRAY(0x855a7a0)
           0  'ds'
           1  ';'
           2  'D'
        3  0
     4  ARRAY(0x85421d4)
        0  'ds_4'
        1  'ds'
        2  ARRAY(0x855a6e0)
           0  'D'
        3  0
     5  ARRAY(0x8538474)
        0  'ss_5'
        1  'ss'
        2  ARRAY(0x854f9c8)
           0  'S'
           1  ';'
           2  'ss'
        3  0
     6  ARRAY(0x85383b4)
        0  'ss_6'
        1  'ss'
        2  ARRAY(0x85383f0)
           0  'S'
        3  0
    DB<4>                   

Using a second c the execution continues until reaching the end of the program:

    DB<3> c
  Debugged program terminated.  Use q to quit or R to restart,
    use o inhibit_exit to avoid stopping after program termination,
    h q, h R or h o to get additional info.
    DB<3>                                   

SOLVING DIFFICULT CONFLICTS AND AMBIGUITIES

Reduce-Reduce Conflicts

Most of the time reduce-reduce conflicts are due to some ambiguity in the grammar. In this example the programmer has attempted to define a language made of mixed lists IDs and NUMbers :

  pl@europa:~/LEyapp/examples/debuggingtut$ cat -n typicalrr.eyp
     1  %token ID NUM
     2
     3  %%
     4  s:
     5        /* empty */
     6      | s ws
     7      | s ns
     8  ;
     9  ws:
    10        /* empty */
    11      | ws ID
    12  ;
    13  ns:
    14        /* empty */
    15      | ns NUM
    16  ;
    17
    18  %%

The grammar has several reduce-reduce conflicts:

  pl@europa:~/LEyapp/examples/debuggingtut$ eyapp typicalrr.eyp
  3 shift/reduce conflicts (see .output file)
  State 2: reduce by rule 3: s -> s ns (default action)
  State 2: shifts:
    to state    5 with NUM  and 3 reduce/reduce conflicts

In fact we have infinite ways to rightmost derive the empty string.

The problem is easily solved designing an equivalent non ambiguous grammar:

  pl@europa:~/LEyapp/examples/debuggingtut$ cat -n correcttypicalrr.eyp
     1  %token ID NUM
     2
     3  %%
     4  s:
     5        /* empty */
     6      | s ID
     7      | s NUM
     8  ;
     9
    10  %%

Of course you can also try to disambiguate using priorities. In the file examples/debuggingtut/typicalrrwithprec.eyp there is a solution:

  lusasoft@LusaSoft:~/src/perl/Parse-Eyapp/examples/debuggingtut$ eyapp -c typicalrrwithprec.eyp
  # This example illustrates how to express EOI in the header section: use ''
  # For the original grammar
  # see file
  #   typicalrr.eyp
  # For an alternative solution see file
  #   correcttypicalrr.eyp
  %right LNUM
  %right NUM
  %right ID
  %right '' # The string '' refers to the 'End of Input' token
  %tree bypass

  %%

  s:
        %name EMPTY
        /* empty */%prec ''
      | %name WORDS
        s ws
      | %name NUMS
        s ns
  ;
  ns:
        %name EMPTYNUM
        /* empty */%prec NUM
      | %name NUMS
        NUM ns
  ;
  ws:
        %name EMPTYID
        /* empty */%prec LNUM
      | %name IDS
        ID ws
  ;

  %%

Observe the use of %right '' in the header section: it gives a priority to the end-of-input token.

Reduce-Reduce Conflicts with Unambiguous Grammars

Though not so common, it may occur that a reduce-reduce conflict is not due to ambiguity but to the limitations of the LALR(1) algorithm. The following example illustrates the point:

  pl@europa:~/LEyapp/examples/debuggingtut$ cat -n rrconflictnamefirst.eyp
     1  %token VAR ',' ':'
     2
     3  %{
     4  use base q{Tail};
     5  %}
     6
     7  %%
     8  def:    param_spec return_spec ','
     9          ;
    10  param_spec:
    11               type
    12          |    name_list ':' type
    13          ;
    14  return_spec:
    15               type
    16          |    name ':' type
    17          ;
    18  name:        VAR
    19          ;
    20  type:        VAR
    21          ;
    22  name_list:
    23               name
    24          |    name ',' name_list
    25          ;
    26  %%
    27
    28  __PACKAGE__->main unless caller();

This non ambiguous grammar generates a language of sequences like

                 a, b : e f : e,

The conflict is due to the final comma in:

      def:    param_spec return_spec ','

If you suppress such comma there is no conflict (try it). When compiling with eyapp we get the warning:

  pl@europa:~/LEyapp/examples/debuggingtut$ eyapp rrconflictnamefirst.eyp
  1 reduce/reduce conflict

Editing the .output file we can see the conflict is in state 2:

   46 State 2:
   47
   48         name -> VAR .   (Rule 6)
   49         type -> VAR .   (Rule 7)
   50
   51         ','     [reduce using rule 7 (type)]
   52         VAR     reduce using rule 7 (type)
   53         $default        reduce using rule 6 (name)

If we look at the grammar we can see that a reduction by

                   type -> VAR .

may occur with a comma as incoming token but only after the reduction by param_spec has taken place. The problem is that the automaton forgets about it. Look the automaton transitions in the .outputfile. By making explicit the difference between the first and second type we solve the conflict:

  pl@europa:~/LEyapp/examples/debuggingtut$ cat -n rrconflictnamefirst_fix1.eyp
     1  %token VAR ',' ':'
     2
     3  %{
     4  use base q{Tail};
     5  %}
     6
     7  %%
     8  def:    param_spec return_spec ','
     9          ;
    10  param_spec:
    11               type
    12          |    name_list ':' type
    13          ;
    14  return_spec:
    15               typeafter
    16          |    name ':' typeafter
    17          ;
    18  name:        VAR
    19          ;
    20  type:        VAR
    21          ;
    22  typeafter:        VAR
    23          ;
    24  name_list:
    25               name
    26          |    name ',' name_list
    27          ;
    28  %%
    29
    30  __PACKAGE__->main unless caller();

A reduce-reduce conflict is solved in favor of the first production found in the text. If we execute the grammar with the conflict ./rrconflictnamefirst.pm, we get the correct behavior:

  pl@europa:~/LEyapp/examples/debuggingtut$ eyapp -b '' rrconflictnamefirst.eyp
  1 reduce/reduce conflict
  pl@europa:~/LEyapp/examples/debuggingtut$ ./rrconflictnamefirst.pm
  Expressions. Press CTRL-D (Unix) or CTRL-Z (Windows) to finish:
  a,b:c d:e,
  <CTRL-D>
  $

The program accepts the correct language - in spite of the conflict - due to the fact that the production

                      name:        VAR

is listed first.

The parser rejects the correct phrases if we swap the order of the productions writing the type: VAR production first,

  pl@europa:~/LEyapp/examples/debuggingtut$ ./reducereduceconflict.pm
  Expressions. Press CTRL-D (Unix) or CTRL-Z (Windows) to finish:
  a,b:c d:e,
  <CTRL-D>

  Syntax error near input: ',' (lin num 1).
  Incoming text:
  ===
  b:c d
  ===
  Expected one of these terminals: VAR

Files reducereduceconflict_fix1.eyp and reducereduceconflict_fix2.eyp offer other solutions to the problem.

TOKENS DEPENDING ON THE SYNTACTIC CONTEXT

Usually there is a one-to-one relation between a token and a regexp. Problems arise, however when a token's type depends upon contextual information. An example of this problem comes from PL/I, where statements like this are legal:

         if then=if then if=then

In PL/I this problem arises because keywords like if are not reserved and can be used in other contexts. This simplified grammar illustrates the problem:

  examples/debuggingtut$ eyapp -c PL_I_conflict.eyp
  # This grammar deals with the famous ambiguous PL/I phrase:
  #                if then=if then if=then
  # The (partial) solution uses YYExpect in the lexical analyzer to predict the token
  # that fulfills the parser expectatives.
  # Compile it with:
  # eyapp -b '' PL_I_conflict.eyp
  # Run it with;
  # ./PL_I_conflict.pm -debug
  %strict
  %token ID
  %tree bypass

  %%

  stmt:
        ifstmt
      | assignstmt
  ;
  # Exercise: change this production
  #     for 'if' expr 'then' stmt
  # and check with input 'if then=if then if=then'. The problem arises again
  ifstmt:
        %name IF
        'if' expr 'then' expr
  ;
  assignstmt:
        id '=' expr
  ;
  expr:
        %name EQ
        id '=' id
      | id
  ;
  id:
        %name ID
        ID
  ;

  %%

If the token ambiguity depends only in the syntactic context, the problem can be alleviated using the YYExpect method. In case of doubt, the lexical analyzer calls the YYExpect method to know which of the several feasible tokens is expected by the parser:

  examples/debuggingtut$ sed -ne '/sub lex/,/^}/p' PL_I_conflict.eyp
  sub lexer {
    my $parser = shift;

    for ($parser->{input}) {    # contextualize
      m{\G\s*(\#.*)?}gc;

      m{\G([a-zA-Z_]\w*)}gc and do {
        my $id = $1;

        return ('if',   'if')   if ($id eq 'if')   && is_in('if', $parser->YYExpect);
        return ('then', 'then') if ($id eq 'then') && is_in('then', $parser->YYExpect);

        return ('ID', $id);
      };

      m{\G(.)}gc         and return ($1, $1);

      return('',undef);
    }
  }

Here follows an example of execution:

  examples/debuggingtut$ eyapp -b '' PL_I_conflict.eyp
  examples/debuggingtut$ ./PL_I_conflict.pm
  Expressions. Press CTRL-D (Unix) or CTRL-Z (Windows) to finish:
  if then=if then if=then
  IF(EQ(ID,ID),EQ(ID,ID))

LEXICAL TIE-INS

A lexical tie-in is a flag which is set to alter the behavior of the lexical analyzer. It is a way to handle context-dependency.

The Parsing of C

The C language has a context dependency: the way an identifier is used depends on what its current meaning is. For example, consider this:

  T(x);

This looks like a function call statement, but if T is a typedef name, then this is actually a declaration of x. How can a parser for C decide how to parse this input?

Here is another example:

  {
    T * x;
    ...
  }

What is this, a declaration of x as a pointer to T, or a void multiplication of the variables T and x?

The usual method to solve this problem is to have two different token types, ID and TYPENAME. When the lexical analyzer finds an identifier, it looks up in the symbol table the current declaration of the identifier in order to decide which token type to return: TYPENAME if the identifier is declared as a typedef, ID otherwise. See the ANSI C parser example in the directory examples/languages/C/ansic.eyp

A Simple Example

In the "Calc"-like example in examples/debuggintut/SemanticInfoInTokens.eyp we have a language with a special construct hex (hex-expr). After the keyword hex comes an expression in parentheses in which all integers are hexadecimal. In particular, strings in /[A-F0-9]+/ like A1B must be treated as an hex integer unless they were previously declared as variables:

  %strict

  %token ID INT INTEGER
  %syntactic token HEX

  %right '='
  %left '+'

  %{
  my %st;
  %}

  %tree bypass alias

  %%
  stmt:
      decl <* ';'> expr <%name EXPS + ';'>
        {
          $_[2]->{st} = { %st };
          $_[2];
        }
  ;

  decl:
      INT ID <+ ','>
        {
          $st{$_->{attr}} = 1 for $_[2]->children();
        }
  ;

  expr:
      %name ID
      ID
    | %name NUM
      INTEGER
    | %name HEX
      HEX '(' { $_[0]->{HEXFLAG} = 1; } $expr ')'
        {
          $_[0]->{HEXFLAG} = 0;
          $expr;
        }
    | %name ASSIGN
      $ID '=' expr
        {
          my $parser = shift;
          my $t = $parser->YYBuildAST(@_);
          # Retype left (TERMINAL) child as ID
          $t->ID->type('ID');
          $t;
        }
    | %name PLUS
      expr '+' expr
  ;

Here the lexical analyzer looks at the value of the attribute HEXFLAG; when it is nonzero, all integers are parsed in hexadecimal, and tokens starting with letters are parsed as integers if possible.

  sub lexer {
    my $parser = shift;
    my $hexflag = $parser->{HEXFLAG};

    for ($parser->{input}) {    # contextualize
      m{\G\s*(\#.*)?}gc;

      m{\G(HEX\b|INT\b)}igc and return (uc($1), $1);

      m{(\G\d+)}gc and return ('INTEGER', $hexflag? hex($1) : $1);


      m{\G([a-zA-Z_]\w*)}gc and do {
        my $match = $1;
        $hexflag and !exists($st{$match}) 
          and $match =~ m{^([A-F0-9]+)$}gc 
            and return ('INTEGER', hex($match));
        return ('ID', $1);
      };

      m{\G(.)}gc         and return ($1, $1);

      return('',undef);
    }
  }

Let us see an example of execution:

  examples/debuggingtut$ cat inputforsemanticinfo2.txt
  int A2
  A2 = HEX(A23);
  A2 = HEX(A2)

  examples/debuggingtut$ ./SemanticInfoInTokens.pm -file inputforsemanticinfo2.txt
  EXPS(ASSIGN(ID[A2],NUM[2595]),ASSIGN(ID[A2],ID[A2]))

The first hex expression HEX(A23) is interpreted as the number 2595 while the second HEX(A2) refers to previously declared variable A2.

An alternative solution to this problem that does not make use of lexical tie-ins - but still uses an attribute HEXFLAG for communication between different semantic actions - can be found in the file examples/debuggintut/DynamicvsTieIns.eyp.

References about Lexical tie-ins

For more about lexical tie-ins see also

SOLVING CONFLICTS WITH THE POSTPONED CONFLICT STRATEGY

Yacc-like parser generators provide ways to solve shift-reduce mechanims based on token precedence. No mechanisms are provided for the resolution of reduce-reduce conflicts. The solution for such kind of conflicts is to modify the grammar. The strategy I present here provides a way to broach conflicts that can't be solved using static precedences.

Introduction: The C++ Ambiguity

The C++ syntax does not disambiguate between expression statements and declaration statements. The ambiguity arises when an expression statement has a function-style cast as its left-most subexpression. (Since C does not support function-style casts, this ambiguity does not occur in C programs.)

For example,

     int (x) = y+z;

parses as either an expr or a stmt.

If the statement can be interpreted both as a declaration and as an expression, the statement is interpreted as a declaration statement.

The following expressions disambiguate into expression statements because the declarator is followed by an operator different from the assignment operator.

    type_spec(i)++;             // expression statement
    type_spec(i,3)<<d;          // expression statement
    type_spec(i)->l=24;         // expression statement

Where type_spec stands for a type specifier.

In the following examples, the interpretation as declaration works, and consequently the statements are interpreted as declarations:

    type_spec(*i)(int);         // declaration
    type_spec(j)[5];            // declaration
    type_spec(m) = { 1, 2 };    // declaration
    type_spec(a);               // declaration
    type_spec(*b)();            // declaration
    type_spec(c)=23;            // declaration
    type_spec(d),e,f,g=0;       // declaration
    type_spec(h)(e,3);          // declaration

An Ambiguous Grammar

The simple grammar in examples/debuggingtut/SimplifiedCplusplusAmbiguity.eyp illustrates the problem of parsing C++:

  examples/debuggingtut$ eyapp -c SimplifiedCplusplusAmbiguity.eyp
  %strict
  %token ID INT NUM
  %right '='
  %left '+'

  %%

  prog:
        /* empty */
      | prog stmt
  ;
  stmt:
        expr ';'
      | decl
  ;
  expr:
        ID
      | NUM
      | INT '(' expr ')' /* typecast */
      | expr '+' expr
      | expr '=' expr
  ;
  decl:
        INT declarator ';'
      | INT declarator '=' expr ';'
  ;
  declarator:
        ID
      | '(' declarator ')'
  ;

  %%

The grammar is ambiguous since an input like:

               int (x) = 4;

can be interpreted as a decl or an expr.

The eyapp compiler warn us of the presence of reduce/reduce conflict:

  examples/debuggingtut$ eyapp -v SimplifiedCplusplusAmbiguity.eyp
  1 reduce/reduce conflict

when we look at the .output file we see that the reduce-reduce conflict is at state 18:

  examples/debuggingtut$ head -12 SimplifiedCplusplusAmbiguity.output
  Warnings:
  ---------
  1 reduce/reduce conflict

  Conflicts:
  ----------
  Conflict in state 15 between rule 8 and token '+' resolved as reduce.
  Conflict in state 15 between rule 8 and token '=' resolved as reduce.
  Conflict in state 17 between rule 9 and token '+' resolved as shift.
  Conflict in state 17 between rule 9 and token '=' resolved as shift.
  State 18 contains 1 reduce/reduce conflict

When we look at the description of the involved state, we see the reasons for the conflict:

  examples/debuggingtut$ sed -ne '/^State 18:/,/^State/p' SimplifiedCplusplusAmbiguity.output
  State 18:

          expr -> ID .    (Rule 5)
          declarator -> ID .      (Rule 12)

          ')'     [reduce using rule 12 (declarator)]
          $default        reduce using rule 5 (expr)

  State 19:

The conflict means that once the parser has seen the ID and is in the presence of the closing parenthesis ')', it is incapable to decide whether to reduce by rule 12 or rule 5.

As we said, the C++ disambiguation rule is: take it as a declaration if it looks as a declaration, otherwise is an expression. But we see that interpretation as decl will succeed if declarator is followed - after the sequence of closing parenthesis - by one of the two tokens a ; or =. That can be traced during parsing time. Parse::Eyapp provides the mechanisms to change the parsing actions at parsing time.

The Postponed Conflict Resolution Strategy

The postponed conflict strategy presented here can be used whenever there is a shift-reduce or reduce-reduce conflict that can not be solved using static precedences but that can be solved using information obtained at parsing time.

Postponed Conflict Resolution: Reduce-Reduce Conflicts

Let us assume we have a reduce-reduce conflict between to productions

                      A -> alpha .
                      B -> beta .

for some token @. Let also assume that production

                      A -> alpha

has name ruleA and production

                      B -> beta 

has name ruleB.

The postponed conflict resolution strategy consists in modifying the conflictive grammar by adding at the points of conflict, in this case at then end of the involved productions a new syntactic variable IsAorB:

                      A -> alpha IsAorB .
                      B -> beta  IsAorB .

The reduce-reduce conflict is now postponed after IsAorB is seen. The new syntactic variable IsAorB will be called the conflict name. The new introduced syntactic variable has only one empty production:

                     IsAorB -> /* empty */

The programmer associates with that production a semantic action whose mission is to solve the conflict by dynamically changing the parsing table like this:

                     IsAorB -> /* empty */ 
                       {
                          my $self = shift;

                          if (looks_like_A($self)) {
                            $self->YYSetReduce('@', 'ruleA' );
                          }
                          else {
                            $self->YYSetReduce('@', 'ruleB' );
                          }
                       }

The semantic action associated with the conflict name receives the name of conflict handler.

An Example of Dynamic Conflict Resolution

The Eyapp program examples/debuggingtut/Cplusplus.eyp solves the conflicts in examples/debuggingtut/SimplifiedCplusplusAmbiguity.eyp by using the Postponed Conflict Resolution strategy:

  examples/debuggingtut$ head -67 Cplusplus.eyp | cat -n
     1  # See http://www.gnu.org/software/bison/manual/html_mono/bison.html#GLR-Parsers
     2  %strict
     3  %token ID INT NUM
     4
     5  %right '='
     6  %left '+'
     7
     8  %tree bypass
     9
    10  %expect 0 1 # 0 shift-reduce conflicts, 1 reduce-reduce conflict
    11
    12  %%
    13  prog:
    14      %name EMPTY
    15      /* empty */
    16    | %name PROG
    17      prog stmt
    18  ;
    19
    20  stmt:
    21      %name EXP
    22      expr ';'
    23    | %name DECL
    24      decl
    25  ;
    26
    27  expr:
    28      %name EXPID
    29      ID decexpconflict
    30    | %name NUM
    31      NUM
    32    | %name TYPECAST
    33      INT '(' expr ')' /* typecast */
    34    | %name PLUS
    35      expr '+' expr
    36    | %name ASSIGN
    37      expr '=' expr
    38  ;
    39
    40  decl:
    41      %name DECLARATOR
    42      INT declarator ';'
    43    | %name DECLARATORINIT
    44      INT declarator '=' expr ';'
    45  ;
    46
    47  declarator:
    48      %name DECID
    49      ID decexpconflict
    50    | '(' declarator ')'
    51  ;
    52
    53  decexpconflict:
    54      /* empty. Just to solve the reduce-reduce conflict */
    55        {
    56          my $self = shift;
    57
    58          if ($self->{INPUT} =~ m{^[)\s]*[;=]\s*}) {
    59            $self->YYSetReduce(')', 'DECID' );
    60          }
    61          else {
    62            $self->YYSetReduce(')', 'EXPID' );
    63          }
    64        }
    65  ;
    66
    67  %%

The aforementioned conflict in examples/debuggingtut/SimplifiedCplusplusAmbiguity.eyp:

  examples/debuggingtut$ sed -ne '/^State 18:/,/^State/p' SimplifiedCplusplusAmbiguity.output
  State 18:

          expr -> ID .    (Rule 5)
          declarator -> ID .      (Rule 12)

          ')'     [reduce using rule 12 (declarator)]
          $default        reduce using rule 5 (expr)

  State 19:

is now delayed just after seen decexpconflict:

  examples/debuggingtut$ eyapp -vb '' Cplusplus.eyp
  1 reduce/reduce conflict
  examples/debuggingtut$ sed -ne '/^State 27:/,/^State/p' Cplusplus.output
  State 27:

          expr -> ID decexpconflict .     (Rule 5)
          declarator -> ID decexpconflict .       (Rule 12)

          ')'     [reduce using rule 12 (declarator)]
          $default        reduce using rule 5 (expr)

  State 28:

The call

         $self->YYSetReduce(')', 'DECID' );

inside the semantic action associated with decexpconflict sets the parsing action to reduce by the production with name DECID, i.e.

            declarator -> ID 

when the incoming input is followed by a semicolon or an equal. Otherwise the action taken is to reduce by

            expr -> ID

The program now successfully solves the ambiguity:

  examples/debuggingtut$ Cplusplus.pm
  int(x)+2;
  PROG(EMPTY,EXP(TYPECAST(TERMINAL[int],EXPID[x]),NUM[2]))
  debuggingtut$ Cplusplus.pm
  int(x)=2;
  PROG(EMPTY,DECL(TERMINAL[int],DECID[x],NUM[2]))
  examples/debuggingtut$ Cplusplus.pm
  int(x);
  PROG(EMPTY,DECL(TERMINAL[int],DECID[x]))

Using YYSetLRAction and YYNextState

The modulino glrexpressions.eyp illustrates an alternative way to apply the postponed conflict resolution strategy to the aforementioned problem:

  examples/debuggingtut$ head -69 glrexpressions.eyp | cat -n
     1  # See http://www.gnu.org/software/bison/manual/html_mono/bison.html#GLR-Parsers
     2  %strict
     3  %token ID INT NUM
     4
     5  %right '='
     6  %left '+'
     7
     8  %{
     9  my $input;
    10  %}
    11
    12  %tree bypass
    13  %%
    14  prog:
    15      %name EMPTY
    16      /* empty */
    17    | %name PROG
    18      prog stmt
    19  ;
    20
    21  stmt:
    22      %name EXP
    23      expr ';'
    24    | %name DECL
    25      decl
    26  ;
    27
    28  expr:
    29      %name EXPID
    30      ID decexpconflict
    31    | %name NUM
    32      NUM
    33    | %name TYPECAST
    34      INT '(' expr ')' /* typecast */
    35    | %name PLUS
    36      expr '+' expr
    37    | %name ASSIGN
    38      expr '=' expr
    39  ;
    40
    41  decl:
    42      %name DECLARATOR
    43      INT declarator ';'
    44    | %name DECLARATORINIT
    45      INT declarator '=' expr ';'
    46  ;
    47
    48  declarator:
    49      %name DECID
    50      ID decexpconflict
    51    | '(' declarator ')'
    52  ;
    53
    54  decexpconflict:
    55      /* empty. Just for hacking the LALR tables */
    56        {
    57          my $self = shift;
    58
    59          my $conflictstate = $self->YYNextState();
    60          if ($input =~ m{^[)\s]*[;=]\s*}) {
    61            $self->YYSetLRAction($conflictstate, ')', 'DECID' );
    62          }
    63          else {
    64            $self->YYSetLRAction($conflictstate, ')', 'EXPID' );
    65          }
    66        }
    67  ;
    68
    69  %%

Line 59 uses the method YYNextState to compute the state after the reduction for the production rule

      decexpconflict -> /* empty */

which is precisely the conflict state. If the incoming input is a sequence of parenthesis followed by either a semicolon or an equal we call to the method YYSetLRAction to set a reduction by the rule

     declarator -> ID

for that state and token ')', otherwise we indicate a reduction by the rule:

     expr -> ID    

Postponed Conflict Resolution: Shift-Reduce Conflicts

The program in examples/debuggingtut/DynamicallyChangingTheParser2.eyp illustrates how the postponed conflict strategy is used for shift-reduce conflicts. This is an extension of the grammar in examples/debuggingtut/Debug.eyp. The generated language is constituted by sequences like:

    { D; D; S; S; S; } {D; S} { S }

As you remember the conflict was:

  examples/debuggingtut$ sed -ne '/^State 13:/,/^State/p' DynamicallyChangingTheParser2.output
  State 13:

          ds -> D conflict . ';' ds       (Rule 6)
          ds -> D conflict .      (Rule 7)

          ';'     shift, and go to state 16

          ';'     [reduce using rule 7 (ds)]

  State 14:

The conflict handler (lines 50-56 below) sets the LR action to reduce by the production with name LAST_D

                 ds -> D

in the presence of token ';' if indeed is the last 'D'. The semantic action associated with the production LAST_D (lines 34-37) restores the former shift action and proceeds to call Parse::Eyapp::Driver::YYBuildAST in order to continue with the building of the abstract syntax tree:

  examples/debuggingtut$ cat -n DynamicallyChangingTheParser2.eyp
     1  # See section 'Hacking the Parsing Tables: ACTION and GOTOs' in
     2  # http://search.cpan.org/perldoc?Parse::Eyapp::debuggingtut
     3  #
     4  # See also: Debug.eyp Debug1.eyp Debug2.eyp  LookForward.eyp
     5  # DynamicallyChangingTheParser.eyp This example illustrates how to dynamically
     6  # change the behavior of the parser
     7
     8  %token D S
     9
    10  %{
    11  our $VERSION = '0.01';
    12  %}
    13
    14  %tree bypass
    15
    16  %%
    17  p: %name PROG
    18      block +
    19  ;
    20
    21  block:
    22      %name BLOCK
    23      '{' ds ';' ss '}'
    24    | %name SS
    25      '{' ss '}'
    26  ;
    27
    28  ds:
    29      %name MORE_Ds
    30      D conflict ';' ds
    31    | %name LAST_D
    32      D conflict
    33        {
    34           # Recover former LALR action
    35           $_[0]->YYRestoreLRAction('conflict', ';');
    36           # Do whatever you want
    37           goto &Parse::Eyapp::Driver::YYBuildAST;
    38        }
    39  ;
    40
    41  ss:
    42      %name SS
    43      S ';' ss
    44    | %name S
    45      S
    46  ;
    47
    48  conflict:
    49      /* empty. Just for dynamic precedence */
    50        {
    51          my $self = shift;
    52
    53          $self->YYSetReduce(';', 'LAST_D' ) if ($self->{INPUT} =~ m{^;\s*S});
    54
    55          undef; # skip this node in the AST
    56        }
    57  ;
    58
    59  %%
    60
    61  sub _Error {
    62    my $parser = shift;
    63
    64    my ($token) = $parser->YYCurval;
    65    my ($what) = $token ? "input: '$token'" : "end of input";
    66    warn "Syntax error near $what\n";
    67  }
    68
    69  sub _Lexer {
    70    my $self = shift;
    71
    72    for ($self->{INPUT}) {
    73      s{^(\s*)}{};
    74
    75      return ('',undef) unless $_;
    76
    77      return ($1,$1) if s/^(.)//;
    78    }
    79    return ('',undef);
    80  }
    81
    82  sub Run {
    83    my $debug = shift || 0;
    84    $debug = 0x1F if $debug;
    85
    86    my $self = __PACKAGE__->new();
    87    $self->{INPUT} = <STDIN>;
    88
    89    print
    90      $self->YYParse(
    91        yylex => \&_Lexer,
    92        yyerror => \&_Error,
    93        yydebug => $debug,
    94      )->str."\n";
    95  }
    96
    97  Run(@ARGV) unless caller;

TREE EQUALITY

The more the time invested writing tests the less the time spent debugging. This section deals with the Parse::Eyapp::Node method equal which can be used to test that the trees have the shape we expect.

$node->equal

A call $tree1->equal($tree2) compare the two trees $tree1 and $tree2. Two trees are considered equal if their root nodes belong to the same class, they have the same number of children and the children are (recursively) equal.

In Addition to the two trees the programmer can specify pairs attribute_key => equality_handler:

  $tree1->equal($tree2, attr1 => \&handler1, attr2 => \&handler2, ...)

In such case the definition of equality is more restrictive: Two trees are considered equal if

  • Their root nodes belong to the same class,

  • They have the same number of children

  • For each of the specified attributes occur that for both nodes the existence and definition of the key is the same

  • Assuming the key exists and is defined for both nodes, the equality handlers return true for each of its attributes and

  • The respective children are (recursively) equal.

An attribute handler receives as arguments the values of the attributes of the two nodes being compared and must return true if, and only if, these two attributes are considered equal. Follows an example:

  examples/Node$ cat -n equal.pl
     1  #!/usr/bin/perl -w
     2  use strict;
     3  use Parse::Eyapp::Node;
     4
     5  my $string1 = shift || 'ASSIGN(VAR(TERMINAL))';
     6  my $string2 = shift || 'ASSIGN(VAR(TERMINAL))';
     7  my $t1 = Parse::Eyapp::Node->new($string1, sub { my $i = 0; $_->{n} = $i++ for @_ });
     8  my $t2 = Parse::Eyapp::Node->new($string2);
     9
    10  # Without attributes
    11  if ($t1->equal($t2)) {
    12    print "\nNot considering attributes: Equal\n";
    13  }
    14  else {
    15    print "\nNot considering attributes: Not Equal\n";
    16  }
    17
    18  # Equality with attributes
    19  if ($t1->equal($t2, n => sub { return $_[0] == $_[1] })) {
    20    print "\nConsidering attributes: Equal\n";
    21  }
    22  else {
    23    print "\nConsidering attributes: Not Equal\n";
    24  }

When the former program is run without arguments produces the following output:

  examples/Node$ equal.pl

  Not considering attributes: Equal

  Considering attributes: Not Equal

Using equal During Testing

During the development of your compiler you add new stages to the existing ones. The consequence is that the AST is decorated with new attributes. Unfortunately, this implies that tests you wrote using is_deeply and comparisons against formerly correct abstract syntax trees are no longer valid. This is due to the fact that is_deeply requires both tree structures to be equivalent in every detail and that our new code produces a tree with new attributes.

Instead of is_deeply use the equal method to check for partial equivalence between abstract syntax trees. You can follow these steps:

  • Dump the tree for the source inserting Data::Dumper statements

  • Carefully check that the tree is really correct

  • Decide which attributes will be used for comparison

  • Write the code for the expected value editing the output produced by Data::Dumper

  • Write the handlers for the attributes you decided. Write the comparison using equal.

Tests using this methodology will not fail even if later code decorating the AST with new attributes is introduced.

See an example that checks an abstract syntax tree produced by the simple compiler (see examples/typechecking/Simple-Types-XXX.tar.gz) for a really simple source:

  Simple-Types/script$ cat prueba27.c
  int f() {
  }

The first thing is to obtain a description of the tree, that can be done executing the compiler under the control of the Perl debugger, stopping just after the tree has been built and dumping the tree with Data::Dumper:

  pl@nereida:~/Lbook/code/Simple-Types/script$ perl -wd usetypes.pl prueba27.c
  main::(usetypes.pl:5):  my $filename = shift || die "Usage:\n$0 file.c\n";
    DB<1> c 12
  main::(usetypes.pl:12): Simple::Types::show_trees($t, $debug);
    DB<2> use Data::Dumper
    DB<3> $Data::Dumper::Purity = 1
    DB<4> p Dumper($t)
  $VAR1 = bless( {
                   ..............................................
                 }, 'PROGRAM' );
  ...............................................................

Once we have the shape of a correct tree we can write our tests:

  examples/Node$ cat -n testequal.pl
     1  #!/usr/bin/perl -w
     2  use strict;
     3  use Parse::Eyapp::Node;
     4  use Data::Dumper;
     5  use Data::Compare;
     6
     7  my $debugging = 0;
     8
     9  my $handler = sub {
    10    print Dumper($_[0], $_[1]) if $debugging;
    11    Compare($_[0], $_[1])
    12  };
    13
    14  my $t1 = bless( {
    15                   'types' => {
    16                                'CHAR' => bless( { 'children' => [] }, 'CHAR' ),
    17                                'VOID' => bless( { 'children' => [] }, 'VOID' ),
    18                                'INT' => bless( { 'children' => [] }, 'INT' ),
    19                                'F(X_0(),INT)' => bless( {
    20                                   'children' => [
    21                                      bless( { 'children' => [] }, 'X_0' ),
    22                                      bless( { 'children' => [] }, 'INT' ) ]
    23                                 }, 'F' )
    24                              },
    25                   'symboltable' => { 'f' => { 'type' => 'F(X_0(),INT)', 'line' => 1 } },
    26                   'lines' => 2,
    27                   'children' => [
    28                                   bless( {
    29                                            'symboltable' => {},
    30                                            'fatherblock' => {},
    31                                            'children' => [],
    32                                            'depth' => 1,
    33                                            'parameters' => [],
    34                                            'function_name' => [ 'f', 1 ],
    35                                            'symboltableLabel' => {},
    36                                            'line' => 1
    37                                          }, 'FUNCTION' )
    38                                 ],
    39                   'depth' => 0,
    40                   'line' => 1
    41                 }, 'PROGRAM' );
    42  $t1->{'children'}[0]{'fatherblock'} = $t1;
    43
    44  # Tree similar to $t1 but without some attributes (line, depth, etc.)
    45  my $t2 = bless( {
    46                   'types' => {
    47                                'CHAR' => bless( { 'children' => [] }, 'CHAR' ),
    48                                'VOID' => bless( { 'children' => [] }, 'VOID' ),
    49                                'INT' => bless( { 'children' => [] }, 'INT' ),
    50                                'F(X_0(),INT)' => bless( {
    51                                   'children' => [
    52                                      bless( { 'children' => [] }, 'X_0' ),
    53                                      bless( { 'children' => [] }, 'INT' ) ]
    54                                 }, 'F' )
    55                              },
    56                   'symboltable' => { 'f' => { 'type' => 'F(X_0(),INT)', 'line' => 1 } },
    57                   'children' => [
    58                                   bless( {
    59                                            'symboltable' => {},
    60                                            'fatherblock' => {},
    61                                            'children' => [],
    62                                            'parameters' => [],
    63                                            'function_name' => [ 'f', 1 ],
    64                                          }, 'FUNCTION' )
    65                                 ],
    66                 }, 'PROGRAM' );
    67  $t2->{'children'}[0]{'fatherblock'} = $t2;
    68
    69  # Tree similar to $t1 but without some attributes (line, depth, etc.)
    70  # and without the symboltable and types attributes used in the comparison
    71  my $t3 = bless( {
    72                   'types' => {
    73                                'CHAR' => bless( { 'children' => [] }, 'CHAR' ),
    74                                'VOID' => bless( { 'children' => [] }, 'VOID' ),
    75                                'INT' => bless( { 'children' => [] }, 'INT' ),
    76                                'F(X_0(),INT)' => bless( {
    77                                   'children' => [
    78                                      bless( { 'children' => [] }, 'X_0' ),
    79                                      bless( { 'children' => [] }, 'INT' ) ]
    80                                 }, 'F' )
    81                              },
    82                   'children' => [
    83                                   bless( {
    84                                            'symboltable' => {},
    85                                            'fatherblock' => {},
    86                                            'children' => [],
    87                                            'parameters' => [],
    88                                            'function_name' => [ 'f', 1 ],
    89                                          }, 'FUNCTION' )
    90                                 ],
    91                 }, 'PROGRAM' );
    92
    93  $t3->{'children'}[0]{'fatherblock'} = $t2;
    94
    95  # Without attributes
    96  if (Parse::Eyapp::Node::equal($t1, $t2)) {
    97    print "\nNot considering attributes: Equal\n";
    98  }
    99  else {
   100    print "\nNot considering attributes: Not Equal\n";
   101  }
   102
   103  # Equality with attributes
   104  if (Parse::Eyapp::Node::equal(
   105        $t1, $t2,
   106        symboltable => $handler,
   107        types => $handler,
   108      )
   109     ) {
   110        print "\nConsidering attributes: Equal\n";
   111  }
   112  else {
   113    print "\nConsidering attributes: Not Equal\n";
   114  }
   115
   116  # Equality with attributes
   117  if (Parse::Eyapp::Node::equal(
   118        $t1, $t3,
   119        symboltable => $handler,
   120        types => $handler,
   121      )
   122     ) {
   123        print "\nConsidering attributes: Equal\n";
   124  }
   125  else {
   126    print "\nConsidering attributes: Not Equal\n";
   127  }

The code defining tree $t1 was obtained from an output using Data::Dumper. The code for trees $t2 and $t3 was written using cut-and-paste from $t1. They have the same shape than $t1 but differ in their attributes. Tree $t2 shares with $t1 the attributes symboltable and types used in the comparison and so equal returns true when compared. Since $t3 differs from $t1 in the attributes symboltable and types the call to equal returns false.

FORMATTING Parse::Eyapp PROGRAMS

I use these rules for indenting Parse::Eyapp programs:

  • Use uppercase identifiers for tokens, lowecase identifiers for syntactic variables

  • The syntactic variable that defines the rule must be at in a single line at the leftmost position:

      synvar:
           'a' othervar 'c'
        |  'b' anothervar SOMETOKEN
      ;

    The separation bar | goes indented relative to the left side of the rule. Each production starts two spaces from the bar. The first right hand side is aligned with the rest.

  • The semicolon ; must also be in its own line at column 0

  • If there is an empty production it must be the first one and must be commented

      syntacvar:
           /* empty */
        |  'a' othervar 'c'
        |  'b' anothervar 
      ;
  • Only very short semantic actions can go in the same line than the production. Semantic actions requiring more than one line must go in its own indented block like in:

      exp:
          $NUM            { $NUM->[0] }
        | $VAR
           {
             my $id = $VAR->[0];
             my $val = $s{$id};
             $_[0]->semantic_error("Accessing undefined variable $id at line $VAR->[1].\n")
             unless defined($val);
             return $val;
           }
        | $VAR '=' $exp   { $s{$VAR->[0]} = $exp }
        | exp.x '+' exp.y { $x + $y }
        | exp.x '-' exp.y { $x - $y }
        | exp.x '*' exp.y { $x * $y }
        | exp.x '/'.barr exp.y
          {
             return($x/$y) if $y;
             $_[0]->semantic_error("Illegal division by zero at line $barr->[1].\n");
             undef
          }
        | '-' $exp %prec NEG  { -$exp }
        | exp.x '^' exp.y     { $x ** $y }
        | '(' $exp ')'        { $exp }
      ;
    
     

SEE ALSO

REFERENCES

CONTRIBUTORS

AUTHOR

Casiano Rodriguez-Leon (casiano@ull.es)

ACKNOWLEDGMENTS

This work has been supported by CEE (FEDER) and the Spanish Ministry of Educacion y Ciencia through Plan Nacional I+D+I number TIN2005-08818-C04-04 (ULL::OPLINK project http://www.oplink.ull.es/). Support from Gobierno de Canarias was through GC02210601 (Grupos Consolidados). The University of La Laguna has also supported my work in many ways and for many years.

A large percentage of code is verbatim taken from Parse::Yapp 1.05. The author of Parse::Yapp is Francois Desarmenien.

I wish to thank Francois Desarmenien for his Parse::Yapp module, to my students at La Laguna and to the Perl Community. Thanks to the people who have contributed to improve the module (see "CONTRIBUTORS" in Parse::Eyapp). Thanks to Larry Wall for giving us Perl. Special thanks to Juana.

LICENCE AND COPYRIGHT

Copyright (c) 2006-2008 Casiano Rodriguez-Leon (casiano@ull.es). All rights reserved.

Parse::Yapp copyright is of Francois Desarmenien, all rights reserved. 1998-2001

These modules are free software; you can redistribute it and/or modify it under the same terms as Perl itself. See perlartistic.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 840:

Non-ASCII character seen before =encoding in '¿Why?.'. Assuming CP1252