This is the second implementation of a regular expression/rules/grammar engine designed to run in Parrot. It's still a work in progress, and some parts of the implementation are designed simply to "bootstrap" us along (i.e., some parts such as the parser and generator are expected to be discarded). The current work is also largely incomplete -- although it has support for groups (capturing and non-capturing), quantifiers, alterations, etc., many of the standard assertions and character classes are not implemented yet but will be coming soon.
The previous version of PGE used a C-based parser and generator, but after the capture semantics were redesigned (Winter 2005) Pm decided that it would be better to just write the parser and generator in Parrot. Thus, the current version.
In addition we'll be looking at writing a parser and compiler for Perl *5* regular expressions, but the focus for the time being is (obviously) on Perl 6.
PGE assumes that it is part of the parrot distribution in the compilers/pge directory. Simply type
make in this directory to build the various *.pbc files and install them into runtime/parrot/library.
The distribution comes with a small demo.pir program that gives an example of using PGE. To run the demo, simply do
parrot demo.pir. The demo understands the following commands:
rule pattern - compile a Perl 6 rule from "pattern" save name - save the current rule as "name" text - a text string to match against previously entered rule pir - display the PIR cod generated for current rule exp - display the expression tree for the current rule trace - toggle pattern execution tracing next - repeat last match on target string
Once PGE is compiled and installed, you generally load it using the load_bytecode operation, as in
This imports the
PGE::p6rule subroutine, which can be used to compile Perl 6 rules. A sample compile sequence would be:
.local pmc p6rule_compile find_global p6rule_compile, "PGE", "p6rule" # get the compiler .local string pattern .local pmc rulesub pattern = "^(From|Subject):" # pattern to compile rulesub = p6rule_compile(pattern) # compile it to rulesub
Then, to match a target string we simply call the subroutine to get back a
.local pmc match $S0 = "From: email@example.com" # target string match = rulesub($S0) # execute rule
The Match object is true if it successfully matched, and contains the strings and subpatterns that were matched as part of the capture. The
dump method can be used to quickly view the results of the match:
match_loop: unless match goto match_fail # if match fails stop print "match succeeded\n" match."dump"() # display matches match."next"() # find the next match goto match_loop match_fail: print "match failed\n"
One can also get the intermediate PIR code that PGE generates for the rule subroutine -- just use
(rulesub, $S0) = p6rule_compiler(target)
and you can print/inspect the contents of $S0 to see the generated code.
Since the Parrot rewrite, PGE knows and uses as much of Unicode strings as Parrot does.
Some backslashes aren't implemented yet, although the major ones are (\d, \s, \n, \D, \S, \N).
PGE doesn't (yet) properly handle nested repetitions of zero-length patterns in groups -- that's coming soon.
This is just the first-cut framework for building the remainder of the engine, so many items (lookaround, conjunctions, closures, and hypotheticals) just aren't implemented yet. They're on their way!
Also, many well-known optimizations (e.g., Boyer-Moore) aren't implemented yet -- my primary goals at this point are to "release early, often" and to get sufficient features in place so that more people can be testing and building upon the engine.
Lastly, error handling needs to be improved, but this will likely be decided as we discover how PGE integrates with the rest of Parrot.
Basically, PGE is a compiler just like any other, except that its "language" is the Perl 6 rules syntax and its output is a subroutine that can match strings. So, PGE consists of a series of parsers (for each pattern matching language), an intermediate expression format, and a code generator.
The generated code uses bsr/ret for its internal subroutine calls (also optimized for tailcalls) and then uses Parrot calling conventions for all interfaces with external callers/callees. This should give some performance improvements.
PGE also uses Parrot coroutines for the matching engine, so that after a successful match is found, the next match within the same string can be found by simply returning control to the matching coroutine (which then picks up from where it had previously left off until another match is discovered).
The code still needs a fair amount of commenting. In general, if you have a question about a particular section of code, send Pm an email and he'll write the comments for it.
Patrick Michaud (firstname.lastname@example.org) is the author and maintainer. Patches and suggestions should be sent to the Perl 6 compiler list (email@example.com).