The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

Parrot Grammar Engine (PGE)

This is an initial implementation of a regular expression/rules/grammar engine designed to run in Parrot. It's definitely a work in progress, and much of the current implementation is designed simply to "bootstrap" us along (i.e., some parts such as the parser and generator are expected to be discarded). The current work is also largely incomplete -- although it has support for groups (capturing and non-capturing), quantifiers, alterations, etc., many of the standard assertions and character classes are not implemented yet but will be coming soon.

In addition there's some experimental work here in being able to parse and convert Perl *5* regular expressions into PIR subroutines, but the focus should be on building the perl6 capabilities first and we'll fold in perl 5 syntax a bit later.

Installation

PGE assumes that it is part of the parrot distribution in the compilers/pge directory. Simply type make in this directory to build the pge.so shared library it needs -- pge.so is then placed in runtime/parrot/dynext. In some environments you may need to set LD_LIBRARY_PATH to search this correctly.

The distribution comes with a small demo.pir program that gives an example of using PGE. To run the demo, simply do parrot demo.pir. The demo understands the following commands:

    rule <pattern>    - a perl 6 rule to be matched
    text              - a text string to match against previously entered rule
    glob <pattern>    - a glob (wildcard) pattern to be matched
    print             - display the PIR cod generated for current rule
    trace             - toggle pattern execution tracing
    next              - repeat last match on target string

If you get errors about "cannot open shared object file", that usually means that Parrot was unable to locate the pge.so object file.

Using PGE

Because I couldn't find a lot of detailed information about writing a Parrot compiler in C, I went ahead and wrote this using Parrot's native call interface. Essentially the library provides a pge_p6rule_pir() function that builds a PIR code string from a given rule string, this PIR code string can then be sent to the PIR compiler to get a final subroutine. (Having the PIR code string easily available and printable is also very handy for debugging as we build the rest of the grammar engine.)

There's a PGE.pir library in runtime/parrot/library that provides PIR-compatible calling interfaces to the compiler. Here's a short example for compiling a regular expression and then matching it against another string:

    load_bytecode "library/PGE.pir"          
    .local pmc p6rule_compile
    find_global p6rule_compile, "PGE", "p6rule"  # get the compiler

    .local string pattern       
    .local pmc rulesub                     
    pattern = "^(From|Subject):"           # pattern to compile
    rulesub = p6rule_compile(pattern)     # compile it to rulesub

    .local pmc match
    $S0 = "From: pmichaud@pobox.com"       # target string
    match = rulesub($S0)                   # execute rule on target string

  match_loop:
    unless match goto match_fail           # if match fails stop
    print "match succeeded\n"

    match."_print"()                       # display captures ($0, $1, etc.)

    match."_next"()                        # find the next match
    goto match_loop

  match_fail:
    print "match failed\n"            

One can also get the intermediate PIR code that PGE generates for the rule subroutine -- just use

    (rulesub, $S0) = p6rule_compiler(target)

and you can print/inspect the contents of $S0 to see the generated code.

Known Limitations

At the moment, PGE is not very aware of Unicode strings -- it's pretty much limited to 8-bit Latin-1 characters in patterns. This is because the PGE parser is still in C, and because support for character class operations in Parrot is still being worked out. Eventually these things will be resolved and we'll fix PGE up for Unicode then.

Most of the backslash sequences (\d, \D, \N, etc.) are unimplemented, although backslashes do properly escape metacharacters in literals.

PGE doesn't (yet) properly handle nested repetitions of zero-length patterns in groups -- that's coming next.

This is just the first-cut framework for building the remainder of the engine, so many items (assertions, lookaround, conjunctions, closures, character classes, and hypotheticals) just aren't implemented yet. They're on their way!

Also, many well-known optimizations (e.g., Boyer-Moore) aren't implemented yet -- my primary goals at this point are to "release early, often" and to get sufficient features in place so that more people can be testing and building upon the engine.

Lastly, error handling needs to be improved, but this will likely be decided as we discover how PGE integrates with the rest of Parrot.

Implementation notes

Basically, PGE consists of a parser and a PIR code generator written in C. The parser takes a string representing a P6 rule and builds a parse tree for the rule. The generator then produces a PIR subroutine (matching engine) that can match strings according to the components of the rule.

The matching engine uses bsr/ret for its internal subroutine calls (also optimized for tailcalls) and then uses Parrot calling conventions for all interfaces with external callers/callees. The compiler is designed such that we can switch the matching engine to use PCC for its internal calls at some point in the future.

PGE also uses Parrot coroutines for the matching engine, so that after a successful match is found, the next match within the same string can be found by simply returning control to the matching coroutine (which then picks up from where it had previously left off until another match is discovered).