The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Parse::Gnaw - Define a grammar and create a parser by calling nothing but perl subroutines.

SYNOPSIS

Gnaw is a perl module which implements full regular expressions and full text parsing grammars using nothing but pure perl code limited to subroutine closures and basic perl variables such as scalars, hashes, arrays, and references.

You write your grammar in pure perl. There is no intermediate "parser language" that then gets interpreted into something executable.

When you do a "use Parse::Gnaw", the Gnaw module will import a number of functions directly into your namespace. Yes, this is completely bad form for normal modules. But this is not a normal module. The imported subroutines include regular expression and parsing functions for matching, quantifiers, literals, alternations, character classes, and so on. You build up your grammar by calling these functions. The final call will return a code reference. This code reference is your grammar.

When you dereference that grammar, if it is a "match" function, then you pass in the string you want to parse.

        use Parse::Gnaw;

        # create the grammar
        my $grammar = match('hello');

        # apply the grammar to a string
        if($grammar->('hello world')) {
                print "match\n";
        } else {
                print "no match";
        }

You can also create the grammar and execute it in one step:

        my $texttoparse = "howdy partner";

        if(match('hello', 'world')->($texttoparse)) {
                print "match\n";
        } else {
                print "no match\n";
        }

Note the above example translated into perls regular expression syntax would look something like this:

        my $texttoparse = "howdy partner";

        if($texttoparse =~ m{hello\s*world}) {
                print "match\n";
        } else {
                print "no match\n";
        }

You can build up more complicated grammars fairly easily. This one looks for a sentence about fruits.

        $grammar = match(ql('I would like to buy'), some('a', qa('banana apple pear peach')));

The "match" function looks for a match to the grammar in the string being parsed.

The "ql" function (quoted literal) allows you to put a sequence of literals into a single string. It the splits the string up into individual literals much like perls "qw" function does. Then it puts them into a grammar sequence for you. This saves you from putting quotes around each individual literal.

The "some" function is a quantifier looking for "1 or more" of whatever it surrounds, in this case, a sequence of the literal "a" followed by an alternation of different possible fruit.

The "qa" function (quoted alternation) takes a single string and splits it into individual words, also similar to perls "qw" function. The "qa" function then takes those individual words and creates an alternation that tries to match any individual word in the string as a valid alternate.

BETA

Please note that this is a beta release. This is more of a proof of concept than something ready for production code or for massive grammars. The interfaces may change completely in the future. When the interfaces have settled, I will release this as a version 1.0+ module. Until then, please do not use this to develop some gigantic parser when the grammar may have to completely change.

EXPORT

Currently, EVERYTHING is exported into the callers namespace. Yes, I know, bad programmer. No biscuit.

The problem is I did not want grammars to be weighed down with a bunch of package prefixes at every step.

        my $grammar = Parse::Gnaw::match(Parse::Gnaw::a(Parse::Gnaw::l('hello'),Parse::Gnaw::l('howdy'));

Gets rather tedious.

Hopefully, no one will have collisions with subroutine names. If they do, then it might make sense to create a new package, declare the grammar in there, then simply pull out the code reference and use that in the rest of your code.

        package mynormalpackage;

        our $grammar;

        {
                package mylocalpackage;

                use Parse::Gnaw; # all subs imported into mylocalpackage

                $mynormalpackage::grammar = match('hello', 'world');

        }


        $grammar->('howdy');

Again, everything may change tomorrow as I sort this out.

ok function

In the examples provided, the "ok" function is used repeatedly. This allows the examples to be used directly in the Parse::Gnaw tests directly. The "ok" function is defined there by the Test::More package. Here is a stub for the "ok" function so that you can use the examples independent of Test::More.

        use Parse::Gnaw;

        sub ok {
                my $boolean = shift(@_);
        
                if($boolean) {
                        print "pass";
                } else {
                        print "FAIL";
                }
                print " : ";
        
                my $msg = shift(@_);
        
                if(defined($msg)) {
                        chomp($msg);
                        print $msg;
                }
        
                print "\n";
        }

        my $grammar; # declare it here so other examples can just use it.

The above code is assumed to be contained in all the examples that follow.

OVERVIEW

So, again, when you do a

        use Parse::Gnaw;

in your code, a number of subroutines are imported into your namespace. You can then call these subroutines to build up a grammar. The grammar is held in the form of a code reference. You then dereference the coderef to call the parser and provide it with the string you want to parse. The dereferenced code reference that is the parser will return a boolean to indicate whether a match was found or not.

        $grammar = match('hello', 'world');
        
        ok($grammar->('hello world')==1), "match");
        ok($grammar->('howdy world')==0), "NO match");

        > pass : match
        > pass : NO match

If you want your grammar to match a literal string, simply pass that string into the "parse" or "match" function. Strings passed into the Gnaw functions are interpreted to mean that they are literals to look for in the text being parsed.

Top Level and Low Level Functions

The Gnaw module provides a number of "top level" functions to create your grammars. These top level functions return code references which are your grammar. You then dereference these code references to call the grammar and pass in the string you want to parse.

Only top level functions return grammar code references. They include:

        match   Find the first match of the grammar in the string.
                Skip any text until we find a match.

        parse   Parse the string from the beginning only. 
                Do not skip any text looking for a match.

        swap    Find the first match of the grammar in the string.
                Skip any text until we find a match.
                Upon finding the first match, replace the matched text
                with something else.

                !!!THE "swap" FUNCTION IS CURRENTLY NOT IMPLEMENTED!!!

These functions return code references that are the grammar, code references you dereference to run the grammar on a string.

        my $grammar = match('hello', 'world');
        ok($grammar->("these are not the droids you are looking for")==0, 'no match');

Low level functions return code references that are called "stitchers". These are low level subroutines used by Parse::Gnaw to build up the grammar. Most of the low level and high level Gnaw functions know how to handle stitchers. A "stitcher" is not a grammar, and if you try to dereference a coderef from a low level function, your script will die with strange errors.

        my $stitcher = a('hello', 'howdy');
        
        # WRONG! Don't do it!
        $stitcher->("these are not the droids you are looking for");  

Always make sure you use a top level function as your final function of your grammar.

Literals

Most of the Parse::Gnaw subroutines will recognize strings passed in as parameters as string literals to become part of the grammar. We already saw this in the above example:

        $grammar = match('hello', 'world');

This example will create a grammar that looks for the literal 'hello' followed by the literal 'world'.

Actually, the grammar will look for the literal 'hello', it will then Skip over any text that the Skip function tells it to Skip (by default, it will skip over any whitespace), and then look for the literal 'world'.

See the "Skip" function for more information.

Strings passed into Parse::Gnaw functions are perl strings, meaning that they follow the normal perl rules for string evaluation. i.e. Single quotes do NOT get evaluated. Double quotes do.

If you want to match a tab character, then be sure to use "\t" instead of '\t'.

List Context

Before we go too far into talking about Gnaw, we need to address a meta issue around Parse::Gnaw.

The thing to keep in mind is that all Parse::Gnaw subroutines are...

... perl subroutines.

This means that they are susceptible to list context issues that occur in perl when you call the functions, and that they are susceptible to list context issues and return mechanism issues when you declare your own subroutines to wrap up a subrule of your grammar.

When you call any perl subroutine, the parameters passed in are flattend into one gigantic list by the list context mechanism in perl.

        # all parameters are flattened. subroutine receives 6 parameters.
        my_perl_sub( (1,2,3),('a', 'b', 'c') ); 

When the my_perl_sub subroutine is called, perl takes all the parameters and flattens them into a single list context. This means in the above example that my_perl_sub receives 6 parameters:

        1, 2, 3, 'a', 'b', 'c'

This also means that my_perl_sub has no idea that the numbers had been grouped together in one set of parens and the letters had been grouped together in another set of parens. The my_perl_sub subroutine sees nothing but a flat list of parameters being passed in.

If you want to call a subroutine and have it treat parameters in groups, then the most common way to do that is to group the parameters you want grouped by putting them into arrays and passing references to those arrays.

        # my_perl_sub receives two parameters, two array references.
        my_perl_sub( [1,2,3] , ['a', 'b', 'c'] ); 

This exact same list context mechanism affects Parse::Gnaw. And the solution within Parse::Gnaw is the same as for general perl: wrap groups of items in anonymous arrays.

Example. You want a grammar with two possible alternates. The first alternate is "hello" followed by "world". The second alternate is "howdy" followed by "planet". If you try to put them into the alternate function (the "a" function), you might try this:

        $grammar = match( a( 
                        "hello", "world",
                        "howdy", "planet"
        ));

The problem is that the "a" function has no way of knowing how it is supposed to group the parameters. Instead, the above example will look for four possible alternates:

        "hello"
        or
        "world"
        or
        "howdy"
        or
        "planet"

The solution is to group the parameters with an anonymous array reference. The above example can be rewritten as follows:

        $grammar = match( a( ["hello", "world"],["howdy", "planet"]     ));

This will look for two possible alternates:

        "hello" "world"
        or
        "howdy" "planet"

Most of the Parse::Gnaw functions (such as the "a" (alternate) function) are smart enough to recognize array references and treat them as a series of grammar components meant to be addressed as a single unit.

declare subroutines for your own sub-rules

The other issue around perl list context and Parse::Gnaw occurs around perl and its "return" function in subroutines. In Parse::Gnaw, if you have a complex grammar, you can break it up into rules which take the form of perl subroutines you declare. The subroutine is some subset of the whole grammar, and must return something that the rest of the grammar can use.

This means you must be aware of perls "return" functionality when declaring subroutines as your rules.

For example, you can declare the following subroutine in any perl code:

        sub mysub { 'hello', 'there' };

When you call a perl subroutine, it will return whatever was the last expression evaluated. And in the above example, "there" is the last expression evaluated, and "there" is what mysub will return.

        my $retval = tester;
        print "retval is '$retval'\n";

        > retval is 'there'

Again this is an issue relating to perls list context. If you want a subroutine to return a list of items, then enclose them in parenthesis to force list context on the return value.

        sub fruit { ('apple', 'banana', 'pear') } # parens force list context

You could then call this subroutine and pass it to the Parse::Gnaw "a" function as a list of possible alternates.

        $grammar = match(a( fruit ));

However, if you want to declare a subroutine which is intended to contain a sequence of items, then you will want to wrap them in an anonymous array so that a single array reference is returned and is treated as a sequence by the rest of the grammar.

        # square brackets groups return value
        sub greetworld { ['hello', 'world'] }   
        sub greetplanet{ ['howdy', 'planet'] }

        $grammar = match(a( greetworld, greetplanet ));

Note that the two subroutines define a sequence of literals that get passed into the "a" function. To keep the "a" function from flattening the return value of the two subroutines, the subroutines need to group their sequences into array references.

Always keep in mind that Parse::Gnaw functions are perl functions and are susceptible to list context issues relating to the things being passed into a function as well as what is returned by the function.

As a general rule, when declaring subroutines to describe pieces of your grammar, if that subroutine contains more than one grammar component, you should wrap the return value in parenthesis if you intend to use the components in list context (a list of different alternates to pass in to "a"), or you should wrap the return value in square brackets if you intend to use the components as one sequence of components.

If the subroutine contains ONLY one Parse::Gnaw grammar component, then you do not have to wrap that component in parens or square brackets. Parse::Gnaw low level functions always return one scalar, a stitcher code reference that can be used by other Parse::Gnaw functions.

Since Parse::Gnaw grammar functions always return one scalar, you do not have to worry about forcing scalar context with square brackets. Just call the Gnaw function and perl will return the stitcher for you.

        # return value of 'a' function is a scalar, 
        # (a scalar containing a code ref that is a stitcher function)
        # therefore, subroutine "fruit" doesn't have to put 
        # the "a" function in square brackets to force scalar context on
        # return value.
        sub fruit { a('apple', 'banana', 'pear') }

summary of return values in subroutines/subrules

Always using parens or square brackets in your subroutines will show how you intend the subroutine to be used in a larger grammar.

        # parens forces return in list context. items may be used in parallel
        sub fruit { ('apple', 'pear', 'banana') } 

        # square brackets forces return in scalar context. items are seqential.
        sub greeting { ['may', 'the', 'force', 'be', 'with', 'you'] }

Do not use a wrapper (no parens, no square brackets) if subroutine only calls a single Gnaw grammar component.

        # 'name' subroutine only calls one function 'a'. don't have to wrap.
        sub name { a('Alice', 'Bob', 'Charlie') }

Now back to the grammar functions.

cc (character classes)

The Parse::Gnaw "cc" function is used to declare a character class. The input to the "cc" function is a single string which lists all the characters that are to be part of the character class. The class will match any single character that is listed in the string.

        $grammar = match( cc('0123456789') ); # any single digit. 0,1,2,3,4,5,6,7,8,9

The "cc" function recognizes the '-' character as a meta character which operates the same as the '-' in perl character classes, creating a character class with all the characters between the two characters (inclusive) separated by the '-' character.

        $grammar = match( cc('0-9') ); # any single digit. 0,1,2,3,4,5,6,7,8,9

If you want the '-' character to be part of your character class, make it the first character in the string.

        $grammar = match( cc('-aeiou') ); # a hyphen or any vowel

Character class shortcuts are provided for common character classes.

        ccdigit => 0-9
        ccword => a-zA-Z0-9_
        ccspace => \n \t \r \f or literal space

        $grammar = match( ccdigit ); # any single digit. 0,1,2,3,4,5,6,7,8,9

Note that the "cc" function and its related shortcuts are one of the functions in Parse::Gnaw that do NOT handle array references, nor do they handle multiple strings being passed into them in list context. The only valid parameter to pass to the "cc" function is a single string containing all the characters that you want in the character class.

The cc shortcuts take no parameters at all.

CC (inverted character class)

The Parse::Gnaw "CC" function is an inverted character class. The input is a single string which defines the characters that you do NOT want to be part of the character class. Note that the "CC" function is uppercase and the "cc" function is lower case.

        $grammar = match( CC('0-9') ); # any single character EXCEPT a digit

Short cuts are provided for common inverted character classes:

        CCDIGIT => anything BUT 0-9
        CCWORD => anything BUT a-zA-Z0-9_
        CCSPACE => anything BUT \n \t \r \f or literal space

        $grammar = match( CCDIGIT ); # any single character EXCEPT a digit

Note that the "CC" function has the same parameter restrictions as "cc". Only one string containing the characters defining the NOT character class. No more than a single parameter. No array references. The shortcuts take no parameters at all.

thing

The Parse::Gnaw "thing" function is a special character class that matches any single character. It is equivalent to the perl regular expression '.' character. But perl does not allow '.' to be the name of a subroutine, so I called it "thing". Part of the reason for this name was because of how it works well with quantifier shortcuts of "any" and "some", described below.

The "thing" function takes no parameters.

        $grammar=match('b', thing, 'b');

        ok($grammar->('bob')==1, "bob matches");
        ok($grammar->('bb')==0, "bb does not match");

a (alternation)

The "a" function is an alternation function. It interprets each of its parameters to be one of several possible alternates.

The following will look for one of any possible fruit in your text.

        $grammar=match(a('apple', 'banana', 'pear', 'peach'));

        ok($grammar->("she is the apple of my eye")==1, "found apple");

Because listing a bunch of alternate literals is fairly common, the "qa" function allows you to do that with fewer quotation marks. The "qa" function takes a single string, breaks it up into individual literals based on whitespace, and then treats each literal as an alternate in the grammar.

The above example could be shortened to this:

        $grammar = match(qa('apple banana pear peach'));

        ok($grammar->("she is the apple of my eye")==1, "found apple");
        

Grouping of a sequence of components into one alternate is done by putting the sequence in an array reference, see "list context" issues described above for a more detailed explanation of why this is.

The following example looks for two possible alternates. The first alternate is the literal 'hello' followed by the literal 'world'. The second alternate is the literal 'howdy' followed by 'partner'.

        $grammar = match(a( ['hello', 'world'], ['howdy', 'partner'] ));
        ok($grammar->("why howdy partner !")==1, "found alternate");

The alternates can be any sequence of grammar components wrapped in an array reference.

g and t (greedy and thrifty quantifiers)

The "g" and "t" functions are quantifier functions. They allow you to define a grammar that looks for some pattern that repeats for some number of times.

The format for a quantifier call looks like this:

        g ( [min, max(?)], grammar component(, component)* )

The first parameter to the "g" or "t" function is an array reference. This contains the parameters that control how the quantifier works. The "min" value determines the minimum number of times the pattern must match before considered a pass. The "max" value is optional. If max is not defined, it is assumed to be infinite, and the parser will attempt to fit as many patterns into memory as can fit.

The remaining one or more parameters define the pattern to which the quantifier will be applied. If more than one are supplied, they will be applied as one large sequence.

The difference between "g" and "t" is how quickly the quantifier will try to match. The "g" function is greedy and will try to match as many patterns as possible. If it fails to match, the greedy quantifier will start backing off pattern counts and try to let the rest of the grammar find a match.

The "t" function is thrifty and will try to match as few patterns as possible. If the rest of the grammar fails to match, the thrifty pattern will try to match another pattern, before trying the rest of the grammar.

        # seventeen or more 'g' or 'a' or 't' or 'c' characters.
        $grammar = match( g([17], cc('gatc')) ); 

        # one or more "word" characters
        $grammar = match( g([1], ccword) );

        # zero or more "digit" characters
        $grammar = match( g([0], ccdigit) );

        # at least 3 and up to 8 letters
        $grammar = match( g([3,8], cc('a-zA-Z')));

Note that the "g" and "t" quantifiers are functions that can accept one or more parameters passed into them. These parameters can be other grammar components, string literals, or array references for grouped bits of grammar.

        # one or more 'hello' followed by 'world'
        $grammar = match( g([1], 'hello', 'world') )

Shorcuts are provided for g([0],...) and g([1], ...), called "any" and "some" respectively

any and some functions

Because greedy quantifiers of "zero or more" and "one or more" are fairly common, shortcuts 'any' and 'some' have been provided.

        $grammar = match( any(ccdigit) ); # zero or more digits

        $grammar = match( some(ccword) ); # one or more word characters.

The "any" and "some" functions take any number of input parameters, and the parameters can be string literals, array references, and other grammar components.

        # one or more of "hello" followed by "world"
        $grammar = match( some('hello', 'world') );

anything and something functions

The character set "thing" matches any single character. The "any" and "some" functions are quantifiers of zero or more and one or more. These two functions are commonly combined, therefore a shortcut has been provided:

        anything  => g([0], thing)
        something => g([1], thing)

You can use these in your grammars to make them slightly more readable.

        # literal 'hello' followed by 1 or more characters, followed by
        # literal 'how', literal 'are', and literal 'you'
        $grammar = match( 'hello', something, 'how', 'are', 'you' );

get

The "get" function is used to capture the text that matches a specific portion of the grammar and pass it into a user defined subroutine.

The first parameter of the "get" function can be one of several possible things.

(1) a scalar reference

(2) an array reference

(3) a subroutine reference.

If the first parameter is a scalar reference, the scalar will be used to hold whatever matched the "get" function.

        my $name;

        $grammar = match('hello', get(\$name, some(ccword)), '!' );

        $grammar->('hello Alice !');

        print "name is '$name'\n";

        > name is 'Alice'

If your grammar gets more than one match while it parses, the scalar reference only holds the last match. If you use the same grammar to parse different strings, scalars are initialized to whatever value they held when the grammar was generated, and this value is overwritten only if a match occurs.

        my $name;

        $grammar = match('hello', get(\$name, some(ccword)), '!' );
        $grammar->('hello Alice !');
        $grammar->('hello Bob !');
        $grammar->('hello Charlie !');

        print "name is '$name'\n";

        > name is 'Charlie'

If the first parameter is an array reference, then every match by the "get" function will get pushed into the array. If you call the same grammar multiple times, the array is emptied at the start of parsing. Note this behaviour is different from scalars.

        my @name;

        $grammar = match(some('hello', get(\@name, some(ccword)), '!') );
        $grammar->('hello Alice ! hello Bob ! hello Charlie !');

        print "name is ". Dumper \@name;

        > name is $VAR1 = [
        >       'Alice',
        >       'Bob',
        >       'Charlie'
        > ];

If the first parameter is a subroutine reference, the parser will call this subroutine on every match and pass in as the first parameter the value of the string that matched. You may then do whatever you wish with this string inside the subroutine you define.

        my $name = sub{
                my ($string) = @_;
                # write $string to a file, or whatever
                print "called subroutine, received '$string'\n";
        };

        $grammar = match('hello', get($name, some(ccword)), '!' );

        $grammar->('hello Alice !');

        > called subroutine, received 'Alice'

The remaining parameters of the "get" function are one or more grammar components. Literals, character classes, alternations, quantifiers, array refs used for grouping sequences of grammar components, etc.

The "get" function can be nested, like nested capturing parens in perl regular expressions.

        my ($g1, $g2, $g3, $g4, $g5, $g6, $g7, $g8);

        $grammar = parse( 
                get(\$g8,
                        get(\$g1, get(\$g2, thing), get(\$g3, thing) ),
                        get(\$g4, get(\$g5, thing), get(\$g6, thing), get(\$g7, thing) ),
                ) 
        
        );

        $grammar->('abcdefghijklmnop');
        
        ok($g2 eq 'a', "g2");
        ok($g3 eq 'b', "g3");
        ok($g1 eq 'ab', "g1");
        ok($g5 eq 'c', "g5");
        ok($g6 eq 'd', "g6");
        ok($g7 eq 'e', "g7");
        ok($g4 eq 'cde', "g4");
        ok($g8 eq 'abcde', "g8");
        

c (capture)

For short grammars or regular expressions, the "get" function might be a bit of needless overhead. For those who want to save a bit of typing, the "c" function is available.

The "c" function is somewhat like "capture parens" in perl regular expressions in that it captures whatever that matches inside the parens and puts it into a numbered variable. The "c" function uses numbered variables that look like this:

        $c1 $c2 $c3 and so on.

You can have up to 99 capture variables. The variables are declared and imported into your namespace when you use the Parse::Gnaw package.

An exmaple using the C function might look like this:

        $grammar = parse( 
                c( 
                        c( c(thing),c(thing) ),
                        c( c(thing), c(thing), c(thing) ),
                )       
        );

        $grammar->('abcdefghijklmnop');

        ok($c1 eq 'abcde',      "checking c1 ");
        ok($c2 eq 'ab',         "checking c2 ");
        ok($c3 eq 'a',          "checking c3 ");
        ok($c4 eq 'b',          "checking c4 ");
        ok($c5 eq 'cde',        "checking c5 ");
        ok($c6 eq 'c',          "checking c6 ");
        ok($c7 eq 'd',          "checking c7 ");
        ok($c8 eq 'e',          "checking c8 ");
        

One significant difference between "c" numbered variables and perl regular expression capture parens is that the "c" function puts the captured results into the numbered variable that corresponds to the number of left parens that actually match the particular string being parsed.

This shows up when your grammar uses the "a" (alternation) function. In the below example, we have a grammar that uses an alternation to look for one of two possible sentences, and then capture different bits of each sentence.

        $grammar = parse( 
                a(  
                        ['hello', c('world'), 'my', 'name', 'is', c(something)],
                        ['howdy', c(something)]
                ) 
        
        );

        ok($grammar->('hello world my name is Alice')==1, "1 match");

        ok($c1 eq 'world', 'c1 is world');
        ok($c2 eq 'Alice', 'c2 is alice');

        ok($grammar->('howdy Bob')==1, "1 match");
                
        ok($c1 eq 'Bob', 'c1 is Bob');

With the first string we parse, the first alternate matches, therefore the first "c" function is assigned to "$c1" and the second "c" function is assigned to "$c2".

With the second string we parse, the second alternate matches and the c(something) function after 'howdy' is assigned to "$c1". The two "c" functions in the first alternate were not part of the correct interpretation, and therefore were not used to count the numbered variables.

now versus defer

The "now" function allows you to define a subroutine to be called every time the parser hits that part of the grammar. The code subroutine will be called whether that part of the grammar eventually matches the text being parsed or not.

The "defer" function allows you to define a subroutine to be called only if that branch of the grammar matches the text being parsed. The subroutine handed to the "defer" function is called at the end of parsing when a match occurs or when the "commit" function is called.

This example uses the "now" function:

        sub howdy { [ now(sub{print"trying howdy\n"}), 'howdy' ] }

        sub hi    { [ now(sub{print"trying hi   \n"}), 'hi' ] }

        $grammar = match(a(howdy, hi);

        $grammar->("hi");

The above example will print

        trying howdy
        trying hi

Even though the alternative "howdy" ended up failing.

The next example uses the "defer" function instead:

        sub howdy { [ defer(sub{print"trying howdy\n"}), 'howdy' ] }

        sub hi    { [ defer(sub{print"trying hi   \n"}), 'hi' ] }

        $grammar = match(a(howdy, hi);

        $grammar->("hi");

The above example with print only

        trying hi

Because "howdy" ended up failing, the defered subroutine for "howdy" did not get called.

Because the "defer" function is more likely to be useful in more grammars, the short cut to do a "defer" function is to simply pass in a subroutine to any of the other grammar functions. Any code reference will be assumed to be a "defer".

Rewriting the "defer" example above but with the shorthand would look like this:

        sub howdy { [ sub{print"trying howdy\n"}, 'howdy' ] }

        sub hi    { [ sub{print"trying hi   \n"}, 'hi' ] }

        $grammar = match(a(howdy, hi);

        $grammar->("hi");

As before, the above example with print only

        trying hi

The "defer" function is useful for doing things only when the parser knows it has a correct interpretation of the text being parsed. The "get" function is a "defer" function that has been wrapped.

The "now" function is useful for doing something even if the interpretation might fail later. One example is printing debug statements of a grammar to understand where ther parser is going as it works through a piece of text. The "commit" function is a "now" function that has been wrapped with extra features.

commit

The "commit" function is used to commit the parser to a particular interpretation of a grammar. If the parser is working a grammar and applying it to a text, and the parser hits a "commit" function, then that path in the grammar must match. If the grammar fails to match after a "commit", rather than try any other possible interpretations, the parser will quit with an error.

The "commit" function will cause any currently scheduled "defer" functions to be executed by the parser. The "commit" function can also trigger the text being parsed to be deleted up to the current character being parsed.

Use "commit" if your grammar contains a reserved keyword and you want to force the parser to interpret the text as a keyword. Use the "commit" function if you are parsing a large chunk of text and you want to be able to remove the previous text from memory.

        $grammar = match(some('keyword', commit, 'blah', 'blah', 'blah'));

EXAMPLES

trekisms

The following grammar shows how one might identify dialogue and attribute it to a particular fictional character.

The "attributeline" subroutine allowed us to bundle the functionality of the output into one spot and then easily use that functionality at several different points in the grammar.

        my $captured="";

        sub attributeline { 
                my ($name, $restofgrammar) = @_;
                my $callback = sub{$captured.=sprintf ("%8s says: %s\n", $name,shift(@_));};
                my $stitcher = get($callback, $restofgrammar);
                return $stitcher;
        }

        sub trekname { qa('Jim Captain Spock Bones Doctor Scotty') } 
        sub occupation {a('ditch digger', 'bricklayer', 'mechanic')}
        sub mccoy_job { [ql("I'm a doctor, not a"), occupation, a('!', '.')] }
        sub mccoy_diag { [ "He's", 'dead', ',', trekname, a('!', '.') ] }
        sub mccoy_rant1 { [ql('You green-blooded Vulcan'), a('!', '.') ] }
        sub mccoy_isms {
                attributeline('McCoy', a(mccoy_job, mccoy_diag, mccoy_rant1)) 
        }

        sub spock_awe {['Fascinating', ',', trekname, '.']}
        sub spock_logic {['Highly', 'illogical',',', trekname, '.']}
        sub spock_sensors { [ql("It's life ,"), trekname, ql(', but not as we know it .')]}
        sub spock_isms {
                attributeline('Spock', a(spock_awe, spock_logic, spock_sensors))
        }
        
        sub kirk_dipolomacy1 {ql('We come in peace .')}
        sub kirk_dipolomacy2 {ql('Shoot to kill .')}
        sub kirk_to_scotty {ql('I need warp speed now, Scotty !')}
        sub kirk_to_spock {ql('What is it , Spock ?')}
        sub kirk_to_bones {ql('Just fix him , Bones')}
        sub kirk_solution {ql('Activate ship self-destruct mechanism .')}
        sub kirk_isms {
                attributeline('Kirk', a(
                        kirk_dipolomacy1, 
                        kirk_dipolomacy2,
                        kirk_to_scotty,
                        kirk_to_spock,  
                        kirk_to_bones,  
                        kirk_solution
                ))
        }

        sub scotty_phy101 {ql('Ya kenna change the laws of physics .')}
        sub time_units {qa('minutes hours days weeks')}
        sub scotty_estimate {[ ql("I'll have it ready for you in three"), time_units, '.' ]}
        
        sub scotty_isms {attributeline('Scotty', a(scotty_phy101, scotty_estimate))}
        
        
        sub alien_isms {attributeline('alien', 'weeboo')}
        
        
        sub trek_isms {a(mccoy_isms, spock_isms, kirk_isms, scotty_isms, alien_isms )}
        sub trek_script {some(trek_isms)}       
        
        $grammar = parse(  trek_script );
                
        my $script = <<'SCRIPT';
        What is it, Spock?
        It's life, Jim, but not as we know it.
        We come in peace.
        weeboo
        Shoot to kill.
        weeboo
        I need warp speed now, Scotty!
        I'll have it ready for you in three minutes.
        weeboo
        I need warp speed now, Scotty!
        Ya kenna change the laws of physics.    
        weeboo
        weeboo
        Shoot to kill.
        Shoot to kill.
        I'm a doctor, not a bricklayer.
        Highly illogical, Doctor.
        You green-blooded Vulcan.
        Shoot to kill.
        Shoot to kill.
        He's dead, Jim.
        Activate ship self-destruct mechanism.
        Highly illogical, Captain.
        SCRIPT
        ;

        #print "script is '$script'\n";

        ok($grammar->( $script )==1, "1 match");

        my $expected =  <<'EXPECTED';
            Kirk says: What is it, Spock?
           Spock says: It's life, Jim, but not as we know it.
            Kirk says: We come in peace.
           alien says: weeboo
            Kirk says: Shoot to kill.
           alien says: weeboo
            Kirk says: I need warp speed now, Scotty!
          Scotty says: I'll have it ready for you in three minutes.
           alien says: weeboo
            Kirk says: I need warp speed now, Scotty!
          Scotty says: Ya kenna change the laws of physics.
           alien says: weeboo
           alien says: weeboo
            Kirk says: Shoot to kill.
            Kirk says: Shoot to kill.
           McCoy says: I'm a doctor, not a bricklayer.
           Spock says: Highly illogical, Doctor.
           McCoy says: You green-blooded Vulcan.
            Kirk says: Shoot to kill.
            Kirk says: Shoot to kill.
           McCoy says: He's dead, Jim.
            Kirk says: Activate ship self-destruct mechanism.
           Spock says: Highly illogical, Captain.
        EXPECTED
        ;
        
                
        
        ok($captured eq $expected, "checking captured string matches expected");

The above example is included as a test in the t directory of the tarball from CPAN.

NOT IMPLEMENTED YET

recursive

The "recursive" function has not been implemented yet.

If you define a grammar which is recursive, you must wrap the recursive call in the "recursive" function.

If you declare a recursive grammar without using a "recursive" function, then perl will recursively call the subroutine until it dies a horrible death.

        sub expression;
        sub expression{ a([ '(' expression ')'], 'a+b') }

        # this call to "expression" will recursively call itself
        # until perl crashes.
        my $grammar = match(expression); 

The above example will crash. Note that it crashes during the "generate" stage of creating the grammar. The grammar isnt even being used to try to match a string yet. The mere act of declaring a recursive grammar without the "recursive" function will cause a crash.

What you need to do is wrap any recursive call to a rule in the "recursive" function.

        sub expression;
        sub expression{ a([ '(' recursive(\&expression) ')'], 'a+b') }

        my $grammar = match(expression);

INTERNALS

These are notes to developers who may need to understand how the internals of Parse::Gnaw operate so they can create advanced grammars.

Also, these are notes to myself to remember how things function.

Text being parsed.

The text being parsed is held internally in a linked list. This allows Parse::Gnaw to operate on a small subset of the entire string being parsed. Theoretically, Parse::Gnaw can parse an infinite amount of text as long as it can be parsed in chunks small enough that the rules can disambiguate text that fits in memory.

Each element in the linked list is an array.

Array stores the following data:

        0: numeric indicator of what type of element this is
        1: payload
        2: previous element pointer
        3: next element pointer
        4: used for debugging

There are several "types" of elements as indicated by the value in index 0 of an element:

        0: element has been deleted. If code tries to use 
                a deleted element, this should act as a 
                flag that something went wrong
        1: element holds a single text character. 
                letter is stored in payload index.
        2: element holds a "marker". Markers are placed between
                text elements by the parser to keep track of
                where it is in the code, where captured strings
                are located, where to fallback to in the text
                if a particular branch of a parse fails, etc.
                A marker contains no payload.
        3: element contains a callback to be executed on
                successful completion of parsing or a commit.
                Payload contains code ref to callback.
        4: element is a head/tail endpoint

The linked list starts and ends with two signpost elements: 'head' and 'tail'. No text is ever stored in these signpost elements. Any text being parsed is always contained between these two signpost elements.

The 'current' pointer always points to the text that is about to be processed.

When the __gnaw__parse_failed function is called, it will pop the last fallback position off the fallback array, which will change the next instruction to be executed and will also move the text pointer back to teh fallback marker. The last thing the __gnaw__parse_failed function does is delete the fallback marker and any other elements in the text linked list that are not text.

When any instruction moves the text pointer forward, it will also delete any old markers it encounters.

The current pointer may point to the tail signpost element. This may mean that we are at the end of the string, or it may mean that the next time we try to get the current letter that we need to insert more text from a file or something.

All interfaces to the linked list should be through the subroutines designed for working with the linked list. These start with the prefix "__gnaw__" to help prevent subroutine colisions.

All parse level commands should manipulate text based on markers.

No user-level or grammar-level commands are currently supported. You can always call the __gnaw__* functions, but they are currently subject to change.

Generating the Parser

tbd

AUTHOR

Greg London, <email at greglondon.com>

BUGS

Please report any bugs or feature requests to bug-parse-gnaw at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Parse-Gnaw. I will be notified, and then you will automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

    perldoc Parse::Gnaw

You can also look for information at:

ACKNOWLEDGEMENTS

COPYRIGHT & LICENSE

Copyright 2008 Greg London, all rights reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.