The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Gnaw - Define parse grammars using perl subroutine calls. No intermediate grammar languages.

VERSION

Version 0.10

SYNOPSIS

Gnaw is a perl module which implements full regular expressions and full text parsing grammars using nothing but pure perl code limited to subroutine closures, exception trapping via eval, and basic perl variables such as scalars, hashes, and arrays.

You write your grammar in pure perl. There is no intermediate "parser language" that then gets interpreted into something executable.

When you do a "use Gnaw", the Gnaw module will import a number of functions directly into your namespace. Yes, this is completely bad form for normal modules. But this is not a normal module. The imported subroutines include regular expression and parsing equivalents for matching, quantifiers, literals, alternations, character classes, and so on. You build up your grammar by calling these functions. The final call will return a code reference. This code reference is your grammar.

When you dereference that grammar, if it is a "match" grammar (i.e. $string =~ m//) then you pass in the string you want to parse.

        use Gnaw;

        # create the grammar
        my $grammar = match(lit('hello'));

        # apply the grammar to a string
        if($grammar->('hello world')) {
                print "match\n";
        } else {
                print "no match";
        }

Please note that this is a beta release. This is more of a proof of concept than something ready for production code or for massive grammars. The interfaces may change completely in the future. When the interfaces have settled, I will release this as a version 1.0+ module. Until then, please do not use this to develop some gigantic parser when the grammar may have to completely change.

EXPORT

Currently, EVERYTHING is exported into the callers namespace. Yes, I know, bad programmer. No biscuit.

The problem is I did not want grammars to be weighed down with a bunch of package prefixes at every step.

        my $grammar = Gnaw::match(Gnaw::alternation(Gnaw::lit('hello'),Gnaw::lit('howdy'));

Gets rather tedious.

Hopefully, no one will have collisions with subroutine names. If they do, then it might make sense to create a new package, declare the grammar in there, then simply pull out the code reference and use that in the rest of your code.

Again, everything may change tomorrow as I sort this out.

FUNCTIONS

match

This is equivalent to the "m" part of a perl regexp of $string\=~m//. The match function takes a grammar and attempts to find the first match within the string. If a match is found, the function returns true (1), else it returns false (0). The match function takes a series() of grammar components such as lit, set, quantifier, etc. The match function returns a coderef to the grammar. The "match" function should have no other grammar components outside of it. When calling the grammar, dereference the coderef returned by "match" and pass it the string you want to apply to the grammar.

        # create the grammar
        my $grammar = match(lit('hello'));
        
        # apply the grammar to a string
        if($grammar->('hello world')) {
                print "match\n";
        } else {
                print "no match";
        }

series

The "series" function is a gnaw grammar component which takes a series of other grammar components. This is the only way to define a grammar with one component occurring after another. The "series" function takes a series of other grammar components and returns a coderef to that portion of the grammar.

The "series" function returns a coderef that is used in part of a larger grammar.

        # look for a series of two literals, "hello" followed by "world"
        my $grammar = match( series(lit('hello'), lit('world')) );
        
        # apply the grammar to a string
        if($grammar->('hello world')) {
                print "match\n";
        } else {
                print "no match";       
        }       

lit

The "lit" function is a gnaw grammar component which applies a literal string value to the string being parsed. The literal value may be a single character or more than one character.

The "lit" function returns a coderef that is used in part of a larger grammar.

        # look for the literal string "hello" in the string being parsed
        my $grammar = match(lit('hello'));
        
        # apply the grammar to a string
        if($grammar->('hello world')) {
                print "match\n";
        } else {
                print "no match";
        }

set

The "set" function is a gnaw grammar component which applies a character class or character set to the string being parsed. Since "class" is a perl reserved word, gnaw uses the word "set" for character set. The "set" function takes a string which describes the character class. The "set" function parses one metacharacter within the string and that is a '-' character. This is used to define a range of charcters. All digits can be described as "0-9". All letters can be described as "a-zA-Z". If you want the "-" character to be part of the set itself, make it the first character in the string you pass into set.

The "set" function returns a coderef that is used in part of a larger grammar.

        # look for an x, y, or z within the string being parsed
        my $grammar = match(set("xyz"));
        
        # apply the grammar to a string
        if($grammar->('hello world')) {
                print "match\n";
        } else {
                print "no match";
        }

SET

The "SET" function is a gnaw grammar component which applies a NEGATIVE character class or NEGATIVE character set to the string being parsed. The "SET" function takes a string which describes the character class. The "SET" function parses one metacharacter within the string and that is a '-' character. This is used to define a range of charcters. All digits can be described as "0-9". All letters can be described as "a-zA-Z". If you want the "-" character to be part of the set itself, make it the first character in the string you pass into set.

The "SET" function returns a coderef that is used in part of a larger grammar.

        # look for anything other than an x, y, or z in the string being parsed.
        my $grammar = match(SET("xyz"));
        
        # apply the grammar to a string
        if($grammar->('hello world')) {
                print "match\n";
        } else {
                print "no match";
        }

set_digit

The "set_digit" is a shortcut equivalent to set('0-9').

SET_DIGIT

The "SET_DIGIT" is a shortcut equivalent to SET('0-9'). i.e. [^0-9]

set_whitespace

The "set_whitespace" is a shortcut equivalent to set("\t\n\r\f").

SET_WHITESPACE

The "SET_WHITESPACE" is a shortcut equivalent to SET("\t\n\r\f"). i.e. [^\t\n\r\f]

set_identifier

The "set_identifier" is a shortcut equivalent to set('a-zA-Z0-9_').

SET_IDENTIFIER

The "SET_IDENTIFIER" is a shortcut equivalent to SET('a-zA-Z0-9_'). i.e. [^a-zA-Z0-9_]

thing

The "thing" function is a gnaw grammar component which matches any single character in the string being parsed. It is equivalent to the '.' operator in normal regular expression format. I would have called it "character" but that is a bit long and "char" is usually a reserved word.

        my $grammar = match(lit('b'), thing, lit('b'));

        # these will all match.
        $grammar->("bob");
        $grammar->("bib");
        $grammar->("bub");

        # this will fail to match.
        $grammar->("bb");

The "thing" function returns a coderef that is used in part of a larger grammar.

alternation

The "alternation" function is a gnaw grammar component which applies one of several possible alternatives to the string being parsed. The "alternation" function will attempt each possible alternative in the order it is passed into the function as a parameter. Each alternative must be a single command.

        # look for people we know
        my $grammar = match(alternation(lit('alice'), lit('bob')));
        
        # apply the grammar to a string
        if($grammar->('hello alice')) {
                print "match\n";
        } else {
                print "no match";
        }

If an alternative needs to be made of more than one grammar command, either bundle them together using a "series" function, or create a named subroutine that will act as a named rule for your grammar, and call that subroutine as one of your alternates.

        # one alternative will be a series of two literals, 'hello' followed by 'world'.
        # create a subroutine that will contain this rule.
        sub greet_all { series(lit('hello'), lit('world'));}
        
        # another alternative will be a series of two literals, "howdy" followed by "partner"
        sub greet_one { series(lit('howdy'), lit('partner'));}
        
        # look for either greeting      
        my $grammar = match(alternation(greet_all, greet_one));

        # apply the grammar to a string
        if($grammar->('hello world')) {
                print "match\n";
        } else {
                print "no match";
        }

The "alternation" function returns a coderef that is used in part of a larger grammar.

quantifier

The "quantifier" function is a gnaw grammar component which receives a single grammar component and attempts to apply that command repeatedly to the string being parsed. The parameters passed into the "quantifier" function are defined as follows:

quantifier( thrifty_or_greedy , single_grammar_component , minimum , maximum );

The thrifty_or_greedy parameter is a 't' or a 'g' to indicate whether the single command is applied as a thrifty or greedy quantifier. It is an optional parameter and may be skipped. If no thrifty_or_greedy parameter is provided, the quantifier will assume greedy.

The single_grammar_component is a single gnaw grammar component. If the quantifier is to be applied to more than one command in a series, either bundle those commands up in a series() function or bundle them up in a named subroutine that can act as a separate rule.

The minimum parameter is the minimum number of times the component must successfully apply to the string being parsed for the grammar to be considered successful. The smallest legal value for "minimum" is zero. You must provide a minimum value of some kind.

The maximum parameter is the maximum number of times the component must successfully apply to the string being parsed for the grammar to be considered successful. The maximum value is optional. if no maximum value is provided, the parser will try as many as possible.

The "quantifier" function also allows some shortcuts instead of the numeric "minimum" and "maximum" parameters.

'*' means "0 or more"

's' and '+' means "1 or more"

        # look for 3 to 7 letter 'a'. Use a thrifty search.
        my $grammar = match(quantifier('t', lit('a'), 3,7));

        # look for 3 or more letter 'a'. still thrifty.
        my $grammar = match(quantifier('t', lit('a'), 3));

        # look for 3 or more letter 'a'. greedy search.
        my $grammar = match(quantifier('g', lit('a'), 3));

        # look for 3 or more letter 'a'. greedy search.
        my $grammar = match(quantifier(lit('a'), 3));

        # look for 1 or more letter 'a'. greedy search.
        my $grammar = match(quantifier(lit('a'), 's'));

        # look for 1 or more letter 'a'. greedy search.
        my $grammar = match(quantifier(lit('a'), '+'));

        # look for 0 or more letter 'a'. greedy search.
        my $grammar = match(quantifier(lit('a'), '*'));

The "quantifier" function returns a coderef that is used in part of a larger grammar.

thrifty

The "thrifty" function is a shortcut to the "quantifier" function with the thrifty/greedy parameter forced to thrifty.

greedy

The "greedy" function is a shortcut to the "quantifier" function with the thrifty/greedy parameter forced to greedy.

some

The "some" function is a shortcut to a "quantifier" function set to greedy, and the quantity set to "1 or more".

any

The "any" function is a shortcut to a "quantifier" function set to greedy, and the quantity set to "0 or more".

something

The "something" function is a shortcut to a "quantifier" function set to greedy, the quantity set to "1 or more", and the command set to "thing". This is equivalent to the '.+' operator in the usual regular expression syntax.

anything

The "anything" function is a shortcut to a "quantifier" function set to greedy, the quantity set to "0 or more", and the command set to "thing". This is equivalent to the '.+' operator in the usual regular expression syntax.

callback

If you want to call a user-defined callback any time the parser hits a specific point of the grammar, simply insert a reference to a subroutine in that location in the grammar and it will be called every time the parser hits that location, whether the grammar ends up succeeding later or not.

        # want between 7 and 9 letter 'a'. and every time we try, print the letter "X".
        my $grammar = match(quantifier('t', series(lit('a'), sub{print"X\n";}) , 7,9));

        # this will not match, but it will print out an "X" for every time 
        # "quantifier" tried to match before the parser fails.
        $grammar->('aaaaa');

The "callback" function takes a single code reference to any user-defined subroutine and calls that coderef only if the parser succeeds. Success is defined as either (1) reaching the end of the grammar and successfully matching the string being parsed or (2) the grammar executes a "commit" function, which is defined later.

Note that "callback" is a scheduled call and only makes the actual call when grammar succeeds.

        # look for a series of two literals, "hello" followed by "world", 
        my $grammar = match(greedy( series(lit('a'), callback(sub{print"X\n";})), 7,9) );

        # this will fail to match and no callback will be called.
        $grammar->('aaaaa');

        # this will match and all the callbacks will be called at the end.
        $grammar->('aaaaaaaaaaaaaaaa');

capture

The "capture" function is a specialized version of "callback". The user passes two parameters into "capture", (1) a grammar component and (2) a callback, a code reference to a user-defined subroutine which will be called when the grammar succeeds.

The callback will receive a copy of the string containing whatever was captured within the given grammar component. This will be passed via the @_ variable.

        my $capture_callback = sub{
                my ($string) = @_;
                print "captured string is '$string'\n";
        };


        # capture the first "a" we find.
        my $grammar = match(capture(lit('a'), $capture_callback)));

        # this will not match, the capture callback will not be called.
        $grammar->('xxxxx');

        # this will match and the capture callback will be called upon success.
        $grammar->('aaaaa');

Note that "capture" creates a scheduled call and only makes the actual call when the grammar succeeds.

consumable

The "consumable" function is useful for parsing large portions of text. The "consumable" function receives a single grammar component and whenever that component successfully parses, even though the full grammar has not finished, then "consumable" will remove any internal data structures related to the command. If there are any callbacks scheduled inside of that command, they will be executed before being deleted.

If you are parsing or matching a small block of text, then "consumable" might not be useful for your application.

If you have a hierarchical grammar and all the subrules from some level on down are wrapped as "consumable", then theoretically, that grammar could parse an infinite amount of text without running out of memory. As long as the consumable subrules match blocks of text that fit in memory, then the parser will operate on one consumable block, and then delete it from memory when that block matches.

Note that "consumable" acts as an immediately executed code reference rather than a scheduled callback in the sense that it gets called as soon as the single grammar command it contains succeeds.

If the grammar later fails and part of the reason was because that consumable portion did not actually match, then the parser will not be able to retry the grammar from prior to the consumable portion. That portion is gone.

When the single grammar command succeeds, the "consumable" function deletes the text currently matching that single grammar command and it deletes the portion of the call tree that corresponds to everything executed as part of that single grammar component.

The "consumable" function returns a coderef that is used in part of a larger grammar.

AUTHOR

Greg London, <email at greglondon.com>

BUGS

Please note this is a beta release. Do not use this for production code. This is a proof-of-concept piece of code, and the actual interface is still completely up in the air. Do not create any massively long grammars with this parser as the parser rules themselves may change completely.

If you do find bugs, please be kind.

Please report any bugs or feature requests to bug-gnaw at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Gnaw. I will be notified, and then you will automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

    perldoc Gnaw

You can also look for information at:

ACKNOWLEDGEMENTS

COPYRIGHT & LICENSE

Copyright 2008 Greg London, all rights reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.