The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

pct_optable_guide.pod - A Guide to Using an Operator Parsing Table in PGE-based grammars.

DESCRIPTION

This document describes how to use an operator parsing table in grammars written for the Parrot Grammar Engine. The mk_language_shell.pl script will generate a very simple grammar that already includes an operator table. This document may help you to understand the syntax of defining an optable.

WHY AN OPTABLE?

Even in simple languages such as languages/abc, parse trees can become very big for even very simple expressions.

Consider the following example grammar:

 token addop { '+' | '-' }
 token mulop { '*' | '/' }

 rule expression {
    <mulexpr> [<addop> <mulexpr>]*
 }

 rule mulexpr {
    <primary> [<mulop> <primary>]*
 }

 rule primary {
    | <integer>
    ...
 }

An expresson such as 1 + 2 will result in a parse tree like this:

 expression
   mulexpr
      primary
         integer
            1
   addop
      +
   mulexpr
      primary
         integer
            2

This may not seem very big, but remember it's only for 1 + 2! Also note that you have to add at least one rule per precedence level, so the rules for parsing simple statements can increase the size of your grammar dramatically.

In order to prevent very big parse trees (which is rather inefficient), it's much better to do bottom-up parsing for expressions such as these.

HOW TO USE AN OPTABLE?

Perl 6 rules are run as ordinary subroutines (well, in fact, they are methods of the class representing the grammar you're writing), and parsing will be done by invoking a rule's subrules. A subrule can also contain subrules. Therefore, the generated parser is a top-down parser, starting at the TOP rule, and invoking rules that represent lower parts of the grammar.

Switching Top-down and Bottom-up parsing

An optable parses operators bottom-up, which is the other well-known parsing approach, implemented popular parser generators such as Yacc. At some point, when using an optable, your parser must switch from top-down parsing to bottom-up, and after parsing the operators (and related operands), the parser switches back from bottom-up to top-down again.

Buttom up parsing is very well suited for expressions that consist of terms, unary and binary operators with no two terms in a row.

In order to define the entry point of the bottom-up, operator parsing table, you should define a rule. Below is an example that states that an expression is parsed by the operator parser:

 rule expression is optable { ... }

The { ... } is legal Perl 6 syntax. See Synopsis 6: subroutines for details.

In order to define what an operand is, a special rule is define, named term:.

 proto 'term:' is tighter('infix:*')
               is parsed(&simple_expression) { ... }

This rule states that whenever an operand is expected, it is parsed by the rule named simple_expression. In other words, this is the point where the parser switches back from bottom-up to top-down.

Be sure to add the tighter clause, to make sure that your terms are parsed correctly. The grammar compiler (PGE) will not notify you if you don't do this.

Defining operators

In between these two rules defining the entry point and the exit point of the operator parser, operators are declared, along with their precedence. This typically look as follows:

 proto 'infix:+' is looser('infix:*') { ... }

This defines the symbol + as an infix operator. The is looser specification defines the precedence of the + operator in relation to the * operator.

Operator precedence

Operator precedence can be expressed using the following traits:

  • looser( OP )

    States that the precedence of the current operator being defined is looser than op.

  • equiv( OP )

    States that the precedence of the current operator being defined is equal to the precedence of operator OP.

  • tighter( OP )

    States that the precedence of the current operator being defined is tighter than the precedence of the operator OP.

Be sure that OP is defined before referencing it in a operator precedence statement as discussed in this section.

Where's the operator?

In Perl 6, operators are just subroutines with special names and scoping (see Synopsis 6: subroutines).

When operators such as infix:+ are defined, the bottom-up parser will generate a call instruction to a subroutine with the same name, that is, infix:+. Therefore, you need to write a subroutine for each operator you define. The subroutine for the infix:+ operator could look like this:

 .sub 'infix:+'
    .param pmc a      # left operand
    .param pmc b      # right operand
    n_add $P0, a, b   # create a new PMC object with the value a+b
    .return ($P0)
 .end

Whenever an expression such as 42 + 1 is parsed, this will result in a call to infix:+ with the operands 42 and 1. The infix:+ sub will create a new object, and assign the result of 42 + 1 to it, after which this new object is returned. Note that the n_ prefix of the add instruction implies that a new object is created.

You might think that it's somewhat overkill to write and call a subroutine for such a simple operation, and that this could be done simpler. Well, you're right. If the implementation of the + operator is as simple as in this case, there's a simpler way to implement the exact same behavior. This can be accomplished using a the pirop trait, discussed below.

Infix, prefix, postfix

Besides infix operators, operators can be prefix or postfix Operators can even be declared to be both prefix and infix. A common example is the - operator. This is fine, because each operator has a certain precedence, which is kept track of by the bottom-up parser.

In order to define a prefix ++ operator, you could write this:

 proto 'prefix:++' ... # specify precedence

Besides these well-known mathematical operators, some languages have built-in operators with normal names. For instance, Ruby defines an operator called defined?, which queries its operand whether it is defined. Defining such special operators are not special in any way, just declared them as any other operator:

 proto 'prefix:defined?' ...

Note that defined? is a prefix operator, which would look like this in code:

 defined? $x

Ternary operators

Most languages only have one ternary operator: ? :. An example is shown below:

 $income > 100000 ? print("I'm rich!") : print("I'll keep my job for now")

To declare a ternary operator, you write:

 proto ternary:<? :> is looser('infix:+') { ... }

Circumfix operators

Brackets can be considered operators as well. There are two types:

  • circumfix

    circumfix operators can enclose other terms and operators. For instance many languages define a rule expression similar to the following:

     expression: literal
               | identifier
               | '(' expression ')'
               | expression ['+'|'-'|...] expression

    When defining an operator table, the third alternative specifying a parenthesized expression can be expressed as follows:

     proto circumfix:<( )> ...

    This means that operands (which are parsed through the special term: rule) can be enclosed in parenthesis, like so:

     (1 + 2) * 3

    This is legal, because the ( ) is just another operator handled by the operator table.

  • postcircumfix

    Postcircumfix brackets are useful for subroutine invocation or indexing. Of course, this fully depends on the specification of your language. Sometimes, you need a different rule to define subroutine invocation syntax. This is the case when arguments can be other objects than operands of normal operators (which, again, are defined by the c<term:> rule).

    An example to handle indexing (assuming the index is an operand as any other operator's operand) is this:

     proto postcircumfix:<[ ]> ...

    which, given the following rules:

     rule expression is optable { ... }
    
     rule 'term:' (...) is parsed(&simple_expression) { ... }
    
     rule simple_expression {
        | <literal>
        | <ident>
     }

    allows us to write this:

     foo["hello"]

    Here, "hello" is a literal and foo is an identifier.

The assoc trait

Operators have a certain associacity. For instance, the + operator is usually said to be left associated, while the exponential operator (often written as ^) is usually right associated. Consider this example:

 10 - 5 - 2

Should this be parsed as:

 ((10 - 5) - 2)  # 5-2=3

or as:

 (10 - (5 - 2))  # 10-3=7

In other words, does the - associate with the left or right operand? According to standard rules, the - operator is left associated, which means the first option is correct.

This associecity is expressed using the assoc trait, like so:

 proto 'infix:-' is assoc('left') ...

If you don't specify the association, it defaults to left. right association works the other way around. In mathematics, the power operator is right associated.

There is a third associacity in Perl 6. Generally a list (using the comma operator) is not nested at all:

 1, 2, 3, 4

should neither be parsed as

 (((1, 2), 3), 4)

nor as

 (1, (2, (3, 4)))

You can achieve that with the is assoc('none') trait.

The pirop trait

Some operators can be perfectly mapped to a specific Parrot instruction, for instance the n_add op that we introduced earlier. By default, an operator is implemented as a subroutine call, which is obviously not as efficient as a single Parrot instruction. Therefore, you can specify the Parrot instruction using the pirop trait. You can do this as follows:

 proto 'infix:+' ... is pirop('n_add') { ... }

This will not work for all Parrot ops, though. Certain instructions leave their result in an I register (one of the four types of Parrot registers). PCT currently only supports PMC registers (P registers). Operators such as == and != must therefore be implemented as subroutine calls.

The pasttype trait

Some operators have behavior that can be implemented by certain PAST nodes. For instance, many languages define the semantics of the and operator to be the following:

 A and B

 1 evaluate A
 2 if result(1) is false return false
 3 evaluate B
 4 return result(3)

As soon as the result of A is found to be false, there is no need to evaluate B, as the final result can never be true (as this is the and operator). So, B is evaluated only if A is true.

This is very similar to the PAST::Op node with the if :pasttype:

 if (A) {
   B
 }

Therefore, an operator such as and can be implemented as an if statement. In order to specify that the and operator must be handled as a PAST::Op( :pasttype('if') ) node, you can use the pasttype trait, like so:

 proto 'infix:and' ... is pasttype('if') { ... }

The or operator is similar to the semantics of a PAST::Op( :pasttype('unless') node:

 A or B

 1 evaluate A
 2 if result(1) is true return true
 3 evaluate B
 4 return result(3)

So, unless A is true, evaluate B. Hence, the or operator could be implemented like so:

 proto 'infix:or' ... is pasttype('unless') { ... }

Special characters

Some operators, such as the shift operators contain the "<" or ">" character. As operators can be specified as, for instance, infix:<+>, defining an operator such as "<<" will make the rule parser (the parser generator) confused.

You have two options. Either use quoted names for the operator subroutine names, like so:

 proto 'infix:>>' is equiv('infix:<<') { ... }

Or, use so-called French quotes to do this. This looks as follows:

 proto infix:«>>»  is equiv(infix:«<<») { ... }

FAQ

  • Why are some operators quoted and some not?

    That's a matter of taste. You can define the operator + in various ways:

     proto infix:+
     proto infix:<+>
     proto 'infix:+'

    Note that 'infix:<+>' is not allowed.

  • I get an error message saying:

     "Can only replace inside string or index after end of string"

    This error occurs if the operator table contains a misspelling somewhere. Make sure that operators containing the character "<" or ">" are quoted using French quotes (« and »), or quote the proto name, as in 'infix:<'. The best solution to solve this is to comment out the whole table except for a single operator, and try to find the error using a binary search. Or, better yet, set up your operator table incrementally, while keeping tests passing.

  • Help! I defined my operator table, but some operators are not recognized.

    Be sure that there are no spelling errors. For instance, on first sight the following looks correct:

     proto 'infix:-' is equal('infix:+') { ... }

    However, equal is not a valid precedence specifier, it should be spelled as equiv. The PGE will not warn you for such errors.

  • How many operator tables can I use in my language?

    {{ XXX I think one only? Anybody? }}

  • How does an optable parser work?

    If you really want to know, check out compiler literature. It's something with stacks and emitting instructions for operators with tighter precedence (and their operands) first.

SEE ALSO

1 POD Error

The following errors were encountered while parsing the POD:

Around line 389:

Non-ASCII character seen before =encoding in 'infix:«>>»'. Assuming UTF-8