The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Parse::Eyapp::eyappintro - An introduction to Parse::Eyapp

SYNOPSIS

  use Parse::Eyapp;
  use Parse::Eyapp::Treeregexp;

  sub TERMINAL::info {
    $_[0]{attr}
  }

  my $grammar = q{
    %right  '='     # Lowest precedence
    %left   '-' '+' # + and - have more precedence than = Disambiguate a-b-c as (a-b)-c
    %left   '*' '/' # * and / have more precedence than + Disambiguate a/b/c as (a/b)/c
    %left   NEG     # Disambiguate -a-b as (-a)-b and not as -(a-b)
    %tree           # Let us build an abstract syntax tree ...

    %%
    line:
        exp <%name EXPRESSION_LIST + ';'>
          { $_[1] } /* list of expressions separated by ';' */
    ;

    /* The %name directive defines the name of the class to 
       which the node being built belongs */
    exp:
        %name NUM
        NUM
      | %name VAR
        VAR
      | %name ASSIGN
        VAR '=' exp
      | %name PLUS
        exp '+' exp
      | %name MINUS
        exp '-' exp
      | %name TIMES
        exp '*' exp
      | %name DIV
        exp '/' exp
      | %name UMINUS
        '-' exp %prec NEG
      | '(' exp ')'
          { $_[2] }  /* Let us simplify a bit the tree */
    ;

    %%
    sub _Error { die "Syntax error near ".($_[0]->YYCurval?$_[0]->YYCurval:"end of file")."\n" }

    sub _Lexer {
      my($parser)=shift; # The parser object

      for ($parser->YYData->{INPUT}) { # Topicalize
        m{\G\s+}gc;
        $_ eq '' and return('',undef);
        m{\G([0-9]+(?:\.[0-9]+)?)}gc and return('NUM',$1);
        m{\G([A-Za-z][A-Za-z0-9_]*)}gc and return('VAR',$1);
        m{\G(.)}gcs and return($1,$1);
      }
      return('',undef);
    }

    sub Run {
        my($self)=shift;
        $self->YYParse( yylex => \&_Lexer, yyerror => \&_Error, );
    }
  }; # end grammar

  our (@all, $uminus);

  Parse::Eyapp->new_grammar( # Create the parser package/class
    input=>$grammar,
    classname=>'Calc', # The name of the package containing the parser
    firstline=>7       # String $grammar starts at line 7 (for error diagnostics)
  );
  my $parser = Calc->new();                # Create a parser
  $parser->YYData->{INPUT} = "2*-3+b*0;--2\n"; # Set the input
  my $t = $parser->Run;                    # Parse it!
  local $Parse::Eyapp::Node::INDENT=2;
  print "Syntax Tree:",$t->str;

  # Let us transform the tree. Define the tree-regular expressions ..
  my $p = Parse::Eyapp::Treeregexp->new( STRING => q{
      { #  Example of support code
        my %Op = (PLUS=>'+', MINUS => '-', TIMES=>'*', DIV => '/');
      }
      constantfold: /TIMES|PLUS|DIV|MINUS/:bin(NUM($x), NUM($y))
        => {
          my $op = $Op{ref($bin)};
          $x->{attr} = eval  "$x->{attr} $op $y->{attr}";
          $_[0] = $NUM[0];
        }
      uminus: UMINUS(NUM($x)) => { $x->{attr} = -$x->{attr}; $_[0] = $NUM }
      zero_times_whatever: TIMES(NUM($x), .) and { $x->{attr} == 0 } => { $_[0] = $NUM }
      whatever_times_zero: TIMES(., NUM($x)) and { $x->{attr} == 0 } => { $_[0] = $NUM }
    },
    OUTPUTFILE=> 'main.pm'
  );
  $p->generate(); # Create the tranformations

  $t->s($uminus); # Transform UMINUS nodes
  $t->s(@all);    # constant folding and mult. by zero

  local $Parse::Eyapp::Node::INDENT=0;
  print "\nSyntax Tree after transformations:\n",$t->str,"\n";

INTRODUCTION TO PARSING WITH Parse::Eyapp

Parsing is the activity of producing a syntax tree from an input stream. The program example in the synopsis section shows an example of parsing. The variable $grammar contains a context free eyapp grammar defining the language of lists of arithmetic expressions. A context free grammar is a mathematical device to define languages. The grammar for the example in the synopsis section is:

   line: exp <+ ';'>  
   ;

   exp:
       NUM            
     | VAR         
     | VAR '=' exp
     | exp '+' exp    
     | exp '-' exp 
     | exp '*' exp
     | exp '/' exp 
     | '-'  exp 
     |   '(' exp ')' 
   ;

A grammar generates a language. A grammar is defined by a set of production rules. A production rule has two components: a left hand side which is a syntactic variable or non terminal and a right hand side which is a phrase made of syntactic variables and terminals. The left hand side (lhs) and the right hand side (rhs) are usually separated by an arrow like in:

                                    exp -> VAR = exp

A note: the production rule

                        line: exp <+ ';'>

is not really a production rule but an abbreviation for two productions. It stands for:

                        line : exp
                             | line ';' exp
                        ;

A terminal or token never appears on the left hand side of a production rule. The phrases of the language are those obtained successively applying the production rules of the grammar until no more rules can be applied. The successive substitutions must start from the start symbol of the grammar (line in the example). Such legal sequence of substitutions is known as a derivation. The following is an example of a legal derivation (the big arrow => is read derives):

  line => exp => VAR = exp => VAR = exp + exp => VAR = exp + NUM => VAR = VAR + NUM

thus the phrase VAR = VAR + NUM belongs to the language generated by the former grammar. A derivation like can be seen as a tree. For instance, the former derivation is equivalent (has the same information) than the tree:

                        line(exp(VAR, '=', exp(exp(VAR), '+',  exp(NUM))))

Such a tree is called a syntax tree for the input VAR = VAR + NUM. A grammar is said to be ambiguous if there are phrases in the generated language that have more than one syntax tree. The grammar in the synopsis example is ambiguous. Here is an alternative tree for the same phrase VAR = VAR + NUM:

                        line(exp(exp(VAR, '=', exp(VAR)), '+', exp(NUM)))

Parsers created by eyapp do not deal directly with the input. Instead they expect the input to be processed by a lexical analyzer. The lexical analyzer parses the input and produces the next token. A token is a pair. The first component is the name of the token (like NUM or VAR) and the second is its attribute (i.e. the information associated with the token, like that the value is 4 for a NUM or the identifier is temperature for a VAR). Tokens are usually defined using regular expressions. Thus the token NUM is characterized by /[0-9]+(?:\.[0-9]+)?/ and the token VAR by /[A-Za-z][A-Za-z0-9_]*/. The subroutine _Lexer in the tail section is a typical example of a typical lexical analyzer:

    sub _Lexer {
      my($parser)=shift; # The parser object

      for ($parser->YYData->{INPUT}) { # Topicalize
        m{\G\s+}gc;     # skip whites
        $_ eq '' and return('',undef);
        m{\G([0-9]+(?:\.[0-9]+)?)}gc and return('NUM',$1);
        m{\G([A-Za-z][A-Za-z0-9_]*)}gc and return('VAR',$1);
        m{\G(.)}gcs and return($1,$1);
      }
      return('',undef);
    }

The input was saved in the YYData->{INPUT} section of the $parser object. The for loop is a false for: its goal is to make $_ an alias of $parser->YYData->{INPUT}. To catch the next pattern we use the anchor \G. The \G anchor matches at the point where the previous /g match left off. Normally, when a scalar m{}g match fails, the match position is reset and \G will start matching at the beginning of the string. The c option causes the match position to be retained following an unsuccessful match. The couple ('',undef) signals the end of the input.

Parse::Eyapp can analyze your grammar and produce a parser from your grammar. Actually Parse::Eyapp is a translation scheme analyzer. A translation scheme scheme is a context free grammar where the right hand sides of the productions have been augmented with semantic actions (i.e. with chunks of Perl code):

                                A -> alpha { action(@_) } beta

The analyzer generated by Eyapp executes { action(@_) } after all the semantic actions associated with alpha have been executed and before the execution of any of the semantic actions associated with beta.

Notice that ambiguous grammars produce ambiguous translation schemes: since a phrase may have two syntactic trees it will be more than one tree-traversing and consequently more than one way to execute the embedded semantic actions. Certainly different execution orders will usually produce different results. Thus, syntactic ambiguities translate onto semantic ambiguities. That is why it is important to resolve all the ambiguities and conflicts that may arise in our grammar. This is the function of the %left and %right declarations on the header section:

  my $grammar = q{
    # header section
    %right  '='     # Lowest precedence
            %left   '-' '+' # + and - have more precedence than = Disambiguate a-b-c as (a-b)-c
            %left   '*' '/' # * and / have more precedence than + Disambiguate a/b/c as (a/b)/c
            %left   NEG     # Disambiguate -a-b as (-a)-b and not as -(a-b)
            %tree           # Let us build an abstract syntax tree ...

            %%
            .... # body section
            %%
            .... # tail section
  };

Priority can be assigned to tokens by using the %left and %right declarations. Tokens in lines below have more precedence than tokens in line above. By giving token '+' more precedence than token '=' we solve the ambiguity for phrases like VAR = VAR + NUM. The tree

line(exp(VAR, '=', exp(exp(VAR), '+', exp(NUM))))

will be built, discarding the other tree. Since priority means earlier evaluation and the evaluation of semantic actions is bottom up, the deeper the associated subtree the higher the priority.

In a translation scheme the embedded actions modify the attributes associated with the symbols of the grammar.

A -> alpha { action(@_) } beta

Each symbol on the right hand side of a production rule has an associated scalar attribute. In eyapp the attributes of the symbol to the left of action are passed as arguments to action (in the example, those of alpha). These arguments are preceded by a reference to the syntax analyzer object. There is no way inside an ordinary eyapp program for an intermediate action to access the attributes of the symbols on its right, i.e. those associated with the symbols of beta. This restriction is lifted if you use the %metatree directive to build a full translation scheme. See Parse::Eyapp::translationschemestut to know more about full translation schemes.

Actions on the right hand side counts as symbols and so they can be referenced by its positional argument in later actions in the same production rule. For intermediate actions, the value returned by the action is the attribute associated with such action. For an action at the end of the rule:

                                A -> alpha { lastaction(@_) } 

the returned value constitutes the attribute of the left hand side of the rule (the attribute of A in this case). The action at the end of the right hand side is called the action associated with the production rule. When no explicit action has been associated with a production rule the default action applies. In Parse::Eyapp the programmer can define what is the default action.

A very special action is "build the node associated with this production rule" which is performed by the YYBuildAST method of the parser object:

                %default action { goto &Parse::Eyapp::Driver::YYBuildAST }

The %tree directive used in the "SYNOPSIS" example is an abbreviation for this and has the effect of building an abstract syntax tree for the input.

The call to

       Parse::Eyapp->new_grammar( # Create the parser package/class
          input=>$grammar,
          classname=>"Calc", # The name of the package containing the parser
        );

compiles $grammar and produces a new class Calc containing a LALR parser for such grammar. The call

                          $parser = Calc->new()

creates a parser object for the language generated by $grammar. Using the YYParse of the parser object:

        $self->YYParse( yylex => \&_Lexer, yyerror => \&_Error, )

YYParse is called with arguments a reference to the lexical analyzer and a reference to the error diagnostic subroutine _Error. Such subroutine will be called by YYParse when an error occurs. Is therefore convenient to give a meaningful diagnostic:

    sub _Error { 
      die "Syntax error near "
      .($_[0]->YYCurval?$_[0]->YYCurval:"end of file")."\n" 
    }

The parser method YYCurval returns the value of the current token. A more accurate error diagnostic subroutine can be obtained if the lexical analyzer is modified so that tokens keep the line number where they start (i.e. the token is a pair (TOKENNAME, [ ATTRIBUTE, LINENUMBER]). In such case the _Error subroutine can be rewritten as:

  sub _Error {
    my($token)=$_[0]->YYCurval;
    my($what)= $token ? "input: '$token->[0]' in line $token->[1]" : "end of input";
    my @expected = $_[0]->YYExpect();
    my $expected = @expected? "Expected one of these tokens: '@expected'":"";

    croak "Syntax error near $what. $expected\n";
  }

The YYExpect method returns the set of tokens that were expected when the error occurred.

The input in

                $parser->YYData->{INPUT} 

is then analyzed by YYParse and an abstract syntax tree is built. The tree rooted on a Parse::Eyapp::Node can be displayed using the method str:

                    local $Parse::Eyapp::Node::INDENT=2;
                    print "Syntax Tree:",$t->str;

The following is the description of the syntax tree produced by the call $t->str for the list of expressions "2*-3+b*0;--2\n";:

  pl@nereida:~/LEyapp/examples$ synopsis.pl
  Syntax Tree:
  EXPRESSION_LIST(
    PLUS(
      TIMES(
        NUM(
          TERMINAL[2]
        ),
        UMINUS(
          NUM(
            TERMINAL[3]
          )
        ) # UMINUS
      ) # TIMES,
      TIMES(
        VAR(
          TERMINAL[b]
        ),
        NUM(
          TERMINAL[0]
        )
      ) # TIMES
    ) # PLUS,
    UMINUS(
      UMINUS(
        NUM(
          TERMINAL[2]
        )
      ) # UMINUS
    ) # UMINUS
  ) # EXPRESSION_LIST

Did you notice that the TERMINAL nodes appear decorated with its attribute? This is because each time the Parse::Eyapp::Node method str visits a node checks if the node has a method info (i.e. $node->can(info)). If so, the info method is called and the string returned is concatenated in the description string. This is the reason for these three lines at the beginning of the "SYNOPSIS" example:

                    sub TERMINAL::info {
                      $_[0]{attr}
                    }

Parse::Eyapp not only gives support to parsing but to later phases of the translation process: tree transformations and scope analysis (scope analysis is the task to find which definition applies to an use of an object in the source). The program in the synopsis section shows an example of a tree transformation specification. Tree transformations are specified using a language called Tree regular expressions. The transformation object is created by the constructor Parse::Eyapp::Treeregexp->new.

  my $p = Parse::Eyapp::Treeregexp->new( STRING => q{
      { #  Example of support code
        my %Op = (PLUS=>'+', MINUS => '-', TIMES=>'*', DIV => '/');
      }
      constantfold: /TIMES|PLUS|DIV|MINUS/:bin(NUM($x), NUM($y))
        => {
          my $op = $Op{ref($bin)};
          $x->{attr} = eval  "$x->{attr} $op $y->{attr}";
          $_[0] = $NUM[0];
        }
      uminus: UMINUS(NUM($x)) => { $x->{attr} = -$x->{attr}; $_[0] = $NUM }
      zero_times_whatever: TIMES(NUM($x), .) and { $x->{attr} == 0 } => { $_[0] = $NUM }
      whatever_times_zero: TIMES(., NUM($x)) and { $x->{attr} == 0 } => { $_[0] = $NUM }
    },
  );

The set of transformations specified in the example are

  • The transformation constantfold produces constant folding i.e. trees of expressions like 3*2+4 are reduced to the tree for 10

          { #  Example of support code
            my %Op = (PLUS=>'+', MINUS => '-', TIMES=>'*', DIV => '/');
          }
          constantfold: /TIMES|PLUS|DIV|MINUS/:bin(NUM($x), NUM($y))
            => {
              my $op = $Op{ref($bin)};
              $x->{attr} = eval  "$x->{attr} $op $y->{attr}";
              $_[0] = $NUM[0];
            }

    Here constantfold is the name of the transformation. The treeregexp compiler will produce an object $constantfold implementing the transformation. After the name comes the tree pattern:

              /TIMES|PLUS|DIV|MINUS/:bin(NUM($x), NUM($y))

    It matches any subtree rooted in any node belonging to one of these classes: TIMES or PLUS or DIV or MINUS that has two children belonging to the NUM class. The Perl code after the big arrow is executed on any matching subtree. We can refer to the root of the subtree using the variable $bin. We can also refer to the child of the first NUM node using $x. In the same way $y refers to the child of the second NUM node. Since there are two NUM nodes in the pattern, we refer to them inside the transformation part using the array @NUM:

                             $_[0] = $NUM[0];

    The action uses and eval and the hash %Op to compute the corresponding reduction of the two nodes. The hash %Op was defined in a previous section containing support code. You can insert in any place of a treeregexp program such support code by surrounding it with curly brackets. The subtree that matched (that is in $_[0]) is substituted by its left child:

                             $_[0] = $NUM[0];
  • The transformations zero_times_whatever and whatever_times_zero produce the simplification of trees corresponding to multiplications by zero. Trees for expressions like (a+b)*0 or 0*(b-4) are reduced to the tree for 0.

      zero_times_whatever: TIMES(NUM($x), .) and { $x->{attr} == 0 } => { $_[0] = $NUM }

    Here zero_times_whatever is the name of the transformation. The pattern TIMES(NUM($x), .) matches any TIMES node with two children and whose first child belongs to the NUM class. The dot matches any subtree, indicating that we don't care what sort of tree the right child is. The third component

                       { $x->{attr} == 0 }

    is the semantic pattern. If both the shape pattern and the semantic pattern apply the action after the arrow is applied. The subtrees is substituted by its left child.

  • The transformation uminus simplifies the tree for unary minus of constant expressions.

          uminus: UMINUS(NUM($x)) => { $x->{attr} = -$x->{attr}; $_[0] = $NUM }

    It matches trees rooted in a UMINUS node whose only child is a NUM node. In such case the sign of the number that is the attribute of the TERMINAL node is changed and the tree is substituted by its single child.

The call

                  $p->generate(); 

compiles the transformation specification producing a set of transformations $constantfold, $zero_times_whatever, whatever_times_zero and $uminus. Transformations are Parse::Eyapp::YATW objects. The list variable @all refer to the whole set of Parse::Eyapp::YATW transformations.

The nodes of the abstract syntax tree are objects. The class (NUM, TIMES, UMINUS, etc.) defines the type of node. All node classes inherit from the class Parse::Eyapp::Node. Parse::Eyapp::Node provides a set of methods to manipulate nodes. Among these methods are str, m and s. The m and s methods resemble the matching and substitution operators for regular expressions. But instead of regular expressions they work with tree transformations or treeregexps or, more precisely with Parse::Eyapp::YATW objects. By calling:

                          $t->s($uminus);

subtrees like

                  UMINUS(UMINUS(NUM(TERMINAL[2])))

are simplified to

                                   NUM(TERMINAL[2])

The call to

                           $t->s(@all);

applies the whole set of transformations. The transformations in @all are iteratively applied to the tree $t until no transformation succeeds: Yes, that means that a inappropriate set of transformations my hang your program.

Thus, the former syntax tree for "2*-3+b*0;--2\n"; becomes:

  EXPRESSION_LIST(NUM(TERMINAL[-6]),NUM(TERMINAL[2]))

The analyzer has been able to optimize - at compile time - the computation of these two expressions

                    2*-3+b*0;
                    --2

reducing them to the computation of:

                     -6;
                      2

SEE ALSO

REFERENCES

CONTRIBUTORS

AUTHOR

Casiano Rodriguez-Leon (casiano@ull.es)

ACKNOWLEDGMENTS

This work has been supported by CEE (FEDER) and the Spanish Ministry of Educacion y Ciencia through Plan Nacional I+D+I number TIN2005-08818-C04-04 (ULL::OPLINK project http://www.oplink.ull.es/). Support from Gobierno de Canarias was through GC02210601 (Grupos Consolidados). The University of La Laguna has also supported my work in many ways and for many years.

A large percentage of code is verbatim taken from Parse::Yapp 1.05. The author of Parse::Yapp is Francois Desarmenien.

I wish to thank Francois Desarmenien for his Parse::Yapp module, to my students at La Laguna and to the Perl Community. Thanks to the people who have contributed to improve the module (see "CONTRIBUTORS" in Parse::Eyapp). Thanks to Larry Wall for giving us Perl. Special thanks to Juana.

LICENCE AND COPYRIGHT

Copyright (c) 2006-2008 Casiano Rodriguez-Leon (casiano@ull.es). All rights reserved.

Parse::Yapp copyright is of Francois Desarmenien, all rights reserved. 1998-2001

These modules are free software; you can redistribute it and/or modify it under the same terms as Perl itself. See perlartistic.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.