The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Gnaw - The great new Gnaw!

VERSION

Version 0.01

SYNOPSIS

Gnaw is a perl module which implements full regular expressions and full text parsing grammars using nothing but pure perl code limited to subroutine closures, exception trapping via eval, and basic perl variables such as scalars, hashes, and arrays.

You write your grammar in pure perl. There is no intermediate "parser language" that then gets interpreted into something executable.

When you do a "use Gnaw", the Gnaw module will import a number of functions directly into your namespace. Yes, this is completely bad form for normal modules. But this is not a normal module. The imported subroutines include regular expression and parsing equivalents for matching, quantifiers, literals, alternations, character classes, and so on. You build up your grammar by calling these functions. The final call will return a code reference. This code reference is your grammar.

When you dereference that grammar, if it is a "match" grammar (i.e. $string =~ m//) then you pass in the string you want to parse.

        use Gnaw;

        # create the grammar
        my $grammar = match(lit('hello'));

        # apply the grammar to a string
        if($grammar->('hello world')) {
                print "match\n";
        } else {
                print "no match";
        }

Please note that this is a beta release. This is more of a proof of concept than something ready for production code or for massive grammars. The interfaces may change completely in the future. When the interfaces have settled, I will release this as a version 1.0+ module. Until then, please do not use this to develop some gigantic parser when the grammar may have to completely change.

EXPORT

Currently, EVERYTHING is exported into the callers namespace. Yes, I know, bad programmer. No biscuit.

The problem is I did not want grammars to be weighed down with a bunch of package prefixes at every step.

        my $grammar = Gnaw::match(Gnaw::lit('hello'));

Gets rather tedious.

Hopefully, no one will have collisions with subroutine names. If they do, then it might make sense to create a new package, declare the grammar in there, then simply pull out the code reference and use that in the rest of your code.

Again, everything may change tomorrow as I sort this out.

FUNCTIONS

match This is equivalent to the "m" part of a perl regexp of $string=~m//. The match function takes a grammar and attempts to find the first match within the string. If a match is found, the function returns true (1), else it returns false (0). The match function takes a series() of grammar components such as lit, set, quantifier, etc. The match function returns a coderef to the grammar. The "match" function should have no other grammar components outside of it. When calling the grammar, dereference the coderef returned by "match" and pass it the string you want to apply to the grammar.

# create the grammar my $grammar = match(lit('hello'));

# apply the grammar to a string if($grammar->('hello world')) { print "match\n"; } else { print "no match"; }

series The "series" function is a gnaw grammar component which takes a series of other grammar components. This is the only way to define a grammar with one component occurring after another. The "series" function takes a series of other grammar components and returns a coderef to that portion of the grammar.

The "series" function returns a coderef that is used in part of a larger grammar.

# look for a series of two literals, "hello" followed by "world", in the string being parsed. my $grammar = match( series(lit('hello'), lit('world')) );

# apply the grammar to a string if($grammar->('hello world')) { print "match\n"; } else { print "no match"; }

lit The "lit" function is a gnaw grammar component which applies a literal string value to the string being parsed. The literal value may be a single character or more than one character.

The "lit" function returns a coderef that is used in part of a larger grammar.

# look for the literal string "hello" in the string being parsed my $grammar = match(lit('hello'));

# apply the grammar to a string if($grammar->('hello world')) { print "match\n"; } else { print "no match"; }

set The "set" function is a gnaw grammar component which applies a character class or character set to the string being parsed. Since "class" is a perl reserved word, gnaw uses the word "set" for character set. The "set" function takes a string which describes the character class. The "set" function parses one metacharacter within the string and that is a '-' character. This is used to define a range of charcters. All digits can be described as "0-9". All letters can be described as "a-zA-Z". If you want the "-" character to be part of the set itself, make it the first character in the string you pass into set.

The "set" function returns a coderef that is used in part of a larger grammar.

# look for an x, y, or z within the string being parsed my $grammar = match(set("xyz"));

# apply the grammar to a string if($grammar->('hello world')) { print "match\n"; } else { print "no match"; }

set_n The "set_n" function is a gnaw grammar component which applies a NEGATIVE character class or NEGATIVE character set to the string being parsed. The "set_n" function takes a string which describes the character class. The "set_n" function parses one metacharacter within the string and that is a '-' character. This is used to define a range of charcters. All digits can be described as "0-9". All letters can be described as "a-zA-Z". If you want the "-" character to be part of the set itself, make it the first character in the string you pass into set.

The "set_n" function returns a coderef that is used in part of a larger grammar.

# look for anything other than an x, y, or z in the string being parsed. my $grammar = match(set_n("xyz"));

# apply the grammar to a string if($grammar->('hello world')) { print "match\n"; } else { print "no match"; }

set_digit The "set_digit" is a shortcut equivalent to set('0-9').

set_DIGIT The "set_digit" is a shortcut equivalent to set_n('0-9').

set_whitespace The "set_whitespace" is a shortcut equivalent to set("\t\n\r\f").

set_WHITESPACE The "set_WHITESPACE" is a shortcut equivalent to set_n("\t\n\r\f").

set_identifier The "set_identifier" is a shortcut equivalent to set('a-zA-Z0-9_').

set_IDENTIFIER The "set_IDENTIFIER" is a shortcut equivalent to set_n('a-zA-Z0-9_').

alternation The "alternation" function is a gnaw grammar component which applies one of several possible alternatives to the string being parsed. The "alternation" function will attempt each possible alternative in the order it is passed into the function as a parameter. Each alternative must be a single command.

# look for people we know my $grammar = match(alternation(lit('alice'), lit('bob')));

# apply the grammar to a string if($grammar->('hello alice')) { print "match\n"; } else { print "no match"; }

If an alternative needs to be made of more than one grammar command, either bundle them together using a "series" function, or create a named subroutine that will act as a named rule for your grammar, and call that subroutine as one of your alternates.

# one alternative will be a series of two literals, 'hello' followed by 'world'. # create a subroutine that will contain this rule. sub greet_all { series(lit('hello'), lit('world'));}

# another alternative will be a series of two literals, "howdy" followed by "partner" sub greet_one { series(lit('howdy'), lit('partner'));}

# look for either greeting my $grammar = match(alternation(greet_all, greet_one));

# apply the grammar to a string if($grammar->('hello world')) { print "match\n"; } else { print "no match"; }

The "alternation" function returns a coderef that is used in part of a larger grammar.

quantifier The "quantifier" function is a gnaw grammar component which receives a single grammar component and attempts to apply that command repeatedly to the string being parsed. The parameters passed into the "quantifier" function are defined as follows:

quantifier( thrifty_or_greedy , single_grammar_component , minimum , maximum );

The thrifty_or_greedy parameter is a 't' or a 'g' to indicate whether the single command is applied as a thrifty or greedy quantifier. It is an optional parameter and may be skipped.

The single_grammar_component is a single gnaw grammar component. If the quantifier is to be applied to more than one command in a series, either bundle those commands up in a series() function or bundle them up in a named subroutine that can act as a separate rule.

The minimum parameter is the minimum number of times the component must successfully apply to the string being parsed for the grammar to be considered successful. The smallest legal value for "minimum" is zero. You must provide a minimum value of some kind.

The maximum parameter is the maximum number of times the component must successfully apply to the string being parsed for the grammar to be considered successful. The maximum value is optional. if no maximum value is provided, the parser will try as many as possible.

The "quantifier" function also allows some shortcuts instead of the numeric "minimum" and "maximum" parameters.

'*' means "0 or more" 's' and '+' means "1 or more"

# look for 3 to 7 letter 'a'. Use a thrifty search. my $grammar = match(quantifier('t', lit('a'), 3,7));

# look for 3 or more letter 'a'. still thrifty. my $grammar = match(quantifier('t', lit('a'), 3));

# look for 3 or more letter 'a'. greedy search. my $grammar = match(quantifier('g', lit('a'), 3));

# look for 3 or more letter 'a'. greedy search. my $grammar = match(quantifier(lit('a'), 3));

# look for 1 or more letter 'a'. greedy search. my $grammar = match(quantifier(lit('a'), 's'));

# look for 1 or more letter 'a'. greedy search. my $grammar = match(quantifier(lit('a'), '+'));

# look for 0 or more letter 'a'. greedy search. my $grammar = match(quantifier(lit('a'), '*'));

The "quantifier" function returns a coderef that is used in part of a larger grammar.

thrifty The "thrifty" function is a shortcut to the "quantifier" function with the thrifty/greedy parameter forced to thrifty.

greedy The "greedy" function is a shortcut to the "quantifier" function with the thrifty/greedy parameter forced to greedy.

callback If you want to call a user-defined callback any time the parser hits a specific point of the grammar, simply insert a reference to a subroutine in that location in the grammar and it will be called every time the parser hits that location, whether the grammar ends up succeeding later or not.

# want between 7 and 9 letter 'a'. and every time we try, print the letter "X". my $grammar = match(quantifier('t', series(lit('a'), sub{print"X\n";}) , 7,9));

# this will not match, but it will print out an "X" for every time "quantifier" tried to match before the parser fails. $grammar->('aaaaa');

The "callback" function takes a single code reference to any user-defined subroutine and calls that coderef only if the parser succeeds. Success is defined as either (1) reaching the end of the grammar and successfully matching the string being parsed or (2) the grammar executes a "commit" function, which is defined later.

Note that "callback" is a scheduled call and only makes the actual call when grammar succeeds.

# look for a series of two literals, "hello" followed by "world", in the string being parsed. my $grammar = match(greedy( series(lit('a'), callback(sub{print"X\n";})), 7,9) );

# this will fail to match and no callback will be called. $grammar->('aaaaa');

# this will match and all the callbacks will be called at the end. $grammar->('aaaaaaaaaaaaaaaa');

capture The "capture" function is a specialized version of "callback". The user passes two parameters into "capture", (1) a grammar component and (2) a callback, a code reference to a user-defined subroutine which will be called when the grammar succeeds.

The callback will receive a copy of the string containing whatever was captured within the given grammar component. This will be passed via the @_ variable.

my $capture_callback = sub{ my ($string) = @_; print "captured string is '$string'\n"; };

# capture the first "a" we find. my $grammar = match(capture(lit('a', $capture_callback));

# this will not match, the capture callback will not be called. $grammar->('xxxxx');

# this will match and the capture callback will be called upon success. $grammar->('aaaaa');

Note that "capture" creates a scheduled call and only makes the actual call when the grammar succeeds.

capture NOTE: THE "commit" FUNCTION DOES NOT WORK RIGHT NOW.

In fact, it makes the parser go all wonky. I believe to get "commit" to work I will have to rewrite a bunch of other bits in the parser. But, when "commit" does hopefully work, this is what it will do:

xxxxxxxxxxxxxxxxxxxxxxxxx

The "commit" function can be placed anywhere within a grammar where the user has decided that once the parser has reached this location, no other interpretation of the string is possible. The parser has found the correct parse of the string to this point.

The "commit" function will cause any callbacks and any captures to execute upon reaching that point in the grammar.

The "commit" function also causes a number of things to occur behind the scenes which mean that un-committing is not possible. Internal data structures which have accumulated during the parsing of the string will be deleted. If you have a "commit" function in your grammar, if your parser hits that point, you can not uncommit at a later time.

# the callback will schedule the "x" to be printed on success, # but the "commit" function immediately after will be treated as a success # and cause the callback to be called. my $grammar = match(greedy( series(lit('a'), callback(sub{print"X\n";}), commit) 7,9) );

# this will fail to match but the callbacks will be called because "commit" forces them. $grammar->('aaaaa');

AUTHOR

Greg London, <email at greglondon.com>

BUGS

Please note this is a beta release. Do not use this for production code. This is a proof-of-concept piece of code, and the actual interface is still completely up in the air. Do not create any massively long grammars with this parser as the parser rules themselves may change completely.

If you do find bugs, please be kind.

Please report any bugs or feature requests to bug-gnaw at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Gnaw. I will be notified, and then you will automatically be notified of progress on your bug as I make changes.

SUPPORT

You can find documentation for this module with the perldoc command.

    perldoc Gnaw

You can also look for information at:

ACKNOWLEDGEMENTS

COPYRIGHT & LICENSE

Copyright 2008 Greg London, all rights reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.