The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

So far we've talked a lot about low-level Parrot programming with PIR and PASM. However, the true power of Parrot is it's ability to host programs written in high level languages such as Perl 6, Python, Ruby, Tcl, and PHP. In order to write code in these languages developers need there to be compilers that convert from the language into PIR or PASM (or even directly convert to Parrot Bytecode). People who have worked on compilers before may be anticipating us to use terms like "Lex and Yacc" here, but we promise that we wont.

Instead of traditional lexical analyzers and parser-generators that have been the mainstay of compiler designers for decades, Parrot uses an advanced set of parsing tools called the Parrot Compiler Tools (PCT). PCT uses a subset of the Perl 6 programming language called Not Quite Perl (NQP) and an implementation of the Perl 6 Grammar Engine (PGE) to build compilers for Parrot. Instead of using traditional low-level languages to write compilers, we can use a modern dynamic language like Perl 6 to write it instead. On a more interesting note, this means that the Perl 6 compiler is itself being written in Perl 6, a mind-boggling process known as bootstrapping.

PCT Overview

PCT is a collection of classes which handle the creation of a compiler and driver program for a high-level language. The PCT::HLLCompiler class handles building the compiler front end while the PCT::Grammar and PCT::Grammar::Actions classes handle building the parser and lexical analyzer. Creating a new HLL compiler is as easy as subclassing these three entities with methods specific to that high-level language.

Grammars and Action Files

Creating a compiler using PCT requires three basic files, plus any additional files needed to implement the languages logic and library:

    =item* A main file

    The main file should contain the :main function that is the driver program for the compiler. Here, a new PCT::HLLCompiler object is instantiated, libraries are loaded, and necessary special global variables are created. The driver program is typically written in PIR, although thankfully they tend to be very short. Most of the action happens elsewhere.

    =item* A parser file

    The grammar for the high level language is specified using the Perl 6 grammar engine (PGE) and is stored in a .pg file. This file should subclass the PCT::Grammar class and implement all the necessary rules to successfully parse the language.

    =item* An actions file

    Actions files are written in NQP. They take match objects generated by the grammar file and convert them into an Abstract Syntax Tree (AST) which is converted by PCT into PIR for compiling and execution. The PIR implementation of these AST trees and nodes is called the Parrot Abstract Syntax Tree (PAST).

make_language_shell.pl

Parsing Fundamentals

Compilers typically consist of three components: The lexical analyzer, the parser, and the code generator. The lexical analyzer converts the HLL input file into individual tokens. A token may consist of an individual punctuation mark("+"), an indentifier ("myVar"), or a keyword ("while"), or any other artifact that cannot be sensibly broken down. The parser takes a stream of these input tokens, and attempts to match them against a given pattern, or grammar. The matching process orders the input tokens into an abstract syntax tree (AST), which is a form that the computer can easily work with. This AST is passed to the code generator which converts it into code of the target language. For something like the GCC C compiler, the target language is machine code. For PCT and Parrot, the target language is PIR and PBC.

Parsers come in two general varieties: Top-down and bottom-up. Top-down parsers start with a top-level rule, a rule which is supposed to represent the entire input. It attempts to match various combination of subrules until the entire input is matched. Bottom-down parsers, on the other hand, start with individual tokens from the lexical analyzer and attempt to combine them together into larger and larger patterns until they produce a top-level token.

PGE itself is a top-down parser, although it also contains a bottom-up operator precidence parser, for things like mathematical expressions where bottom-up methods are more efficient. We'll discuss both, and the methods for switching between the two, throughout this chapter.

Driver Programs

HLLCompiler class

Parrot Grammar Engine

The Parrot Grammar Engine is an implementation of the Perl 6 grammar syntax in Parrot. The grammar engine in Perl 6 is much more advanced and flexible then the regular expression syntax of Perl 5. Since most other languages who implement regular expressions use a copy (or a subset) of Perl 5's regular expressions, PGE is much more powerful then those too.

PGE uses a recursive descent algorithm it also uses an operator precedence parser, when you ask it nicelywhich should be very familar to users of the Perl 5 module Parse::RecDescent. In fact, both were originally designed by the same developer, Damian Conway. Recursive Descent, for those who get into the algorithmic details, is a top-down parsing algorithm, unlike the top-down LALR algorithm used in parser- generators like yacc and Bison. Most programmers won't notice the difference between the two for most applications, although there are some specific places where the behavior will be different.

Rules and Actions

A recursive descent parser, like the one used in PGE, is a top-down parser. This means it attempts to start at the highest-level rule and work it's way down to the individual input tokens in order to match the given input. Once the parser has matched the entire input a source code file, or a line of input at the terminal in interactive mode the parse is considered successful and the generated AST is delivered to the code generator for conversion into PIR.

The parser is formed by creating a grammar. Grammars consist of three primary components: rules, tokens, and protoregexes.

    =item* Rules

    Rules are the most basic grammar element. Rules may call subrules and may also contain arbitrary whitespace, which is ignored.

    =item* Tokens

    Tokens represent basic regular expressions. They may not call subrules and whitespace is treated literally. {{I don't think this is right, but I'm adding it here anyway as a placeholder -- Whiteknight}}

    =item* Protoregex

    A protoregex is like a rule or a token, except it can be overloaded dynamically.

When a rule matches a sequence of input tokens, an associated method in NQP is called to convert that match into an AST node. This node is then inserted into the parse tree.

Basic Rules

Let's start off with a simple rule:

 rule persons_name {
    <first_name> <last_name>
 }

We also define the two name tokens as:

 token first_name { <alpha>+ }
 token last_name { <alpha>+ }

The special token <alpha> is a built-in construct that only accepts upper case and lower case letters. The "+" after the <alpha> tag is a short way of saying "one or more". Our rule persons_name would match something like Darth Vader Actually, it matches a lot of things that aren't people's namesbut wouldn't match something like C 3P0. Notice that the rule above would match Jar Jar Binks, but not the way you would expect: It would match the first "Jar" as <first_name> and the second "Jar" as <last_name> and wouldn't match "Binks" at all.

In PGE. The top-level rule which starts the match process and must successfully match in order for the compiler to continue is always called TOP. Without a TOP rule, your compiler won't do anything Actually, it will print out an error saying that you have no TOP rule, and then it will exit, but that isn't anything that we would want it to do.

Here's an example TOP rule that uses our definition for <persons_name> above and matches a file that contains a comma- separated list of names:

 rule TOP {
    [ <persons_name> ',' ]*
 }

this example shows another new construct, the square brackets. Square brackets are ways to group things together. The star at the end means that we take all the things inside the brackets zero or more times. This is similar to the plus, except the plus matches one or more times. Notice, however, that the above rule always matches a comma at the end, so we would need to have something like:

 Darth Vader, Luke Skywalker,

Instead of something more natural like:

 Darth Vader, Luke Skywalker

We can modify the rule a little bit so that it always ends with a name instead of a comma:

 rule TOP {
    [ <persons_name> ',' ]* <persons_name>
 }

Now we don't need a trailing comma, but at the same time we can't match an empty file because it always expects to have at least one name at the end. If we still want to match empty files successfully, we need to make the whole rule optional:

 rule TOP {
    [ [ <persons_name> ',' ]* <persons_name> ]?
 }

We've grouped the whole rule together in another set of brackets, and put a "?" question mark at the end. The question mark means zero or one of the prior item.

The symbols "*" (zero or more), "+" (one or more) and "?" are called quantifiers, and allow an item in the rule to match a variable number of times. These aren't the only quantifiers, but they are the most common. We will talk about other quantifiers later on.

Calling Actions

We haven't covered actions yet, but it's still important now to talk about how we will call them when we are ready. We call an action by inserting the {*} token into the rule. When the {*} rule is encountered, PGE calls the associated action method with the current match object as an argument. Let's take our persons_name rule from above, and sprinkle liberally with action calls:

 rule persons_name {
    {*} <first_name> {*} <last_name> {*}
 }

The first call to the action method contains an empty match object because the parser hasn't had a chance to match anything yet. The second call contains only the first name of the match. The third and final call contains both the matched first and last name. Notice that if the match fails halfway through, we still call the actions where we succeeded, but do not call the actions after the failure. So, if we try to match the string "Leia", the action is called before the name and after the first name. When the rule tries to match the last name, it fails because no last name is provided, and the third action method call is never made.

Alternations and Keys

In addition to sub-rules, groups, and quantifiers, we also are able to take alternations between options that are either-or. The vertical bar token "|" can be used to distinguish between options where only one may match, Here's an example:

 rule hero {
    ['Luke' | 'Leia'] 'Skywalker'
 }

This rule will match either "Luke Skywalker" or "Leia Skywalker" but won't match "Luke Leia Skywalker" or anything else, for that matter. With things like alternations, if we want to call an action method it's helpful to distinguish which combination we matched:

 rule hero {
    [
      'Luke' {*}    #= Luke
    | 'Leia' {*}    #= Leia
    ]
    'Skywalker'
 }

This is the same rule, except now it passes two arguments to it's action method: the match object and the name of the person who got matched.

Warning: Left Recursion

Getting into all the nitty-gritty theory behind parsers is well beyond the scope of this book. However, there is one potential pitfall that developers should be made aware of that is not immediately obvious. Like functions in ordinary procedural or functional languages, the methods in the PGE parser grammar can call themselves recursively. Consider the following rules derived in part from the grammar for the C programming language:

 rule if_statement {
    'if' <condition> '{' <statement>* '}' <else_block>?
 }

 rule statement {
    <if_statement> | <expression>
 }

 rule else_block {
    'else' '{' <statements>* '}'
 }

Notice that an if_statement can contain a list of statements, and that each statement may itself be an if_statement? This is called recursion , and is part of where the "Recursive Descent" algorithm gets it's name from.

Now, let's look at a more direct example of a comma-separated list of integer digits to form an array. We can define this recursively as follows:

 rule list {
     <list> ',' <digit> | <digit>
 }

The intention is that if there is only one digit, we match the second option in the alternation, and if there are more digits we can match them recursively in the first alternation. However, take a close look at the insideous result. The recursive descent parser enters the list rule. It's first option is to enter the list rule again, so it does. Recursive descent is a depth-first algorithm, and it will continue to descent down a particular path until it finds a successful match or a match failure. In this case, it matches list and then it matches list again, then it matches list again, and so on and so forth. What we have created is an infinite loop pattern called left recursion.

Left recursion is caused when the left-most item of the left-most alternation is a recursion. The rule above can be easily resolved by writing:

 rule list {
    <digit> | <list> ',' <digit>
 }

Or even

 rule list {
    <digit> ',' <list> | <digit>
 }

Both of these two options make sure the left-most item in our rule is not a recursion, therefore preventing left recursion.

Here is a more tricky example where the left recursion is hidden from view:

 rule term {
    <expression> '*' <term> | <digit>
 }

 rule expression {
    <term> '+' <expression> | <term>
 }

This is a very limited subset of mathematical equations that we might like to write, and even in this small subset we have this same problem: To match a term, the parser first tries to match an expression, which in turn matches a term and then an expression ...

Left recursion is not the only problem you can run into with a recursive descent grammar, but it's one that's likely going to come up relatively often for new language designers, and one that is not always likely to generate useful error messages.

Operator Precidence Parser

Actions and NQP

NQP Basics

The Match Object $/

Making Trees

7 POD Errors

The following errors were encountered while parsing the POD:

Around line 3:

Unknown directive: =head0

Around line 5:

A non-empty Z<>

Around line 117:

Deleting unknown formatting code N<>

Around line 129:

Deleting unknown formatting code N<>

Around line 177:

Deleting unknown formatting code N<>

Around line 187:

Deleting unknown formatting code N<>

Around line 275:

Deleting unknown formatting code N<>