Ron Savage
and 1 contributors

NAME

Regexp::Parsertron - Parse a Perl regexp into a data structure of type Tree

Warning: Development version. See "Version Numbers" for details.

Synopsis

Sample Code

This is scripts/synopsis.pl:

        #!/usr/bin/env perl

        use v5.10;
        use strict;
        use warnings;

        use Regexp::Parsertron;

        # ---------------------

        my($re)     = qr/Perl|JavaScript/i;
        my($parser) = Regexp::Parsertron -> new(verbose => 1);

        # Return 0 for success and 1 for failure.

        my($result)  = $parser -> parse(re => $re);
        my($node_id) = 5; # Obtained from displaying and inspecting the tree.

        print "Calling append(text => '|C++', uid => $node_id) \n";

        $parser -> append(text => '|C++', uid => $node_id);
        $parser -> print_raw_tree;
        $parser -> print_cooked_tree;

        my($as_string) = $parser -> as_string;

        print "Original:    $re. Result: $result (0 is success) \n";
        print "as_string(): $as_string \n";

        $result = $parser -> validate;

        print "validate():  Result: $result (0 is success) \n";

        # Return 0 for success and 1 for failure.

        $parser -> reset;
        $parser -> verbose(0);

        $re     = qr/Perl|JavaScript|(?:Flub|BCPL)/i;
        $result = $parser -> parse(re => $re);

        print "\nAdd complexity to the regexp by parsing a new regexp \n";

        $parser -> print_raw_tree;

And its output:

        Test count: 1. Parsing (in qr/.../ form): '(?^i:Perl|JavaScript)'.
        Root. Attributes: {text => "Root", uid => "0"}
            |--- open_parenthesis. Attributes: {text => "(", uid => "1"}
            |    |--- query_caret. Attributes: {text => "?^", uid => "2"}
            |    |--- flag_set. Attributes: {text => "i", uid => "3"}
            |    |--- colon. Attributes: {text => ":", uid => "4"}
            |    |--- string. Attributes: {text => "Perl|JavaScript", uid => "5"}
            |--- close_parenthesis. Attributes: {text => ")", uid => "6"}

        Calling append(text => '|C++', uid => 5)
        Root. Attributes: {text => "Root", uid => "0"}
            |--- open_parenthesis. Attributes: {text => "(", uid => "1"}
            |    |--- query_caret. Attributes: {text => "?^", uid => "2"}
            |    |--- flag_set. Attributes: {text => "i", uid => "3"}
            |    |--- colon. Attributes: {text => ":", uid => "4"}
            |    |--- string. Attributes: {text => "Perl|JavaScript|C++", uid => "5"}
            |--- close_parenthesis. Attributes: {text => ")", uid => "6"}

        Name                            Uid  Text
        ----                            ---  ----
        open_parenthesis                  1  (
        query_caret                       2  ?^
        flag_set                          3  i
        colon                             4  :
        string                            5  Perl|JavaScript|C++
        close_parenthesis                 6  )
        Original:    (?^i:Perl|JavaScript). Result: 0 (0 is success)
        as_string(): (?^i:Perl|JavaScript|C++)
        validate():  Result: 0 (0 is success)

        Adding complexity to the regexp by parsing a new regexp:
        Root. Attributes: {text => "Root", uid => "0"}
            |--- open_parenthesis. Attributes: {text => "(", uid => "1"}
            |    |--- query_caret. Attributes: {text => "?^", uid => "2"}
            |    |--- flag_set. Attributes: {text => "i", uid => "3"}
            |    |--- colon. Attributes: {text => ":", uid => "4"}
            |    |--- string. Attributes: {text => "Perl|JavaScript|", uid => "5"}
            |    |--- colon_prefix. Attributes: {text => "(?:", uid => "6"}
            |    |    |--- string. Attributes: {text => "Flub|BCPL", uid => "7"}
            |    |--- close_parenthesis. Attributes: {text => ")", uid => "8"}
            |--- close_parenthesis. Attributes: {text => ")", uid => "9"}

Note: The 1st tree is printed due to verbose => 1 in the call to "new([%opts])", while the 2nd is due to the call to "print_raw_tree()". The columnar output is due to the call to "print_cooked_tree()".

Tutorial

o Start with a simple program and a simple regexp

This code, scripts/tutorial.pl, is a cut-down version of scripts/synopsis.pl:

        #!/usr/bin/env perl

        use v5.10;
        use strict;
        use warnings;

        use Regexp::Parsertron;

        # ---------------------

        my($re)     = qr/Perl|JavaScript/i;
        my($parser) = Regexp::Parsertron -> new(verbose => 1);

        # Return 0 for success and 1 for failure.

        my($result) = $parser -> parse(re => $re);

        print "Original:  $re. Result: $result. (0 is success) \n";

Running it outputs:

        Test count: 1. Parsing (in qr/.../ form): '(?^i:Perl|JavaScript)'.
        Root. Attributes: {text => "Root", uid => "0"}
            |--- open_parenthesis. Attributes: {text => "(", uid => "1"}
            |    |--- query_caret. Attributes: {text => "?^", uid => "2"}
            |    |--- flag_set. Attributes: {text => "i", uid => "3"}
            |    |--- colon. Attributes: {text => ":", uid => "4"}
            |    |--- string. Attributes: {text => "Perl|JavaScript", uid => "5"}
            |--- close_parenthesis. Attributes: {text => ")", uid => "6"}

        Original:  (?^i:Perl|JavaScript). Result: 0. (0 is success)
o Examine the tree and determine which nodes you wish to edit

The nodes are uniquely identified by their uids.

o Proceed as does scripts/synopsis.pl

Add these lines to the end of the tutorial code, and re-run:

        my($node_id) = 5; # Obtained from displaying and inspecting the tree.

        $parser -> append(text => '|C++', uid => $node_id);
        $parser -> print_raw_tree;

The extra output, showing the change to node uid == 5, is:

        Root. Attributes: {text => "Root", uid => "0"}
            |--- open_parenthesis. Attributes: {text => "(", uid => "1"}
            |    |--- query_caret. Attributes: {text => "?^", uid => "2"}
            |    |--- flag_set. Attributes: {text => "i", uid => "3"}
            |    |--- colon. Attributes: {text => ":", uid => "4"}
            |    |--- string. Attributes: {text => "Perl|JavaScript|C++", uid => "5"}
            |--- close_parenthesis. Attributes: {text => ")", uid => "6"}
o Test also with "prepend(%opts)" and "set(%opts)"

See t/get.set.t for sample code.

o Since everything works, make a cup of tea

The Edit Methods

The edit methods simply means any one or more of these methods, which can all change the text of a node:

o "append(%opts)"
o "prepend(%opts)"
o "set(%opts)"

The edit methods are exercised in t/get.set.t, as well as scripts/synopsis.pl (above).

Description

Parses a regexp into a tree object managed by the Tree module, and provides various methods for updating and retrieving that tree's contents.

This module uses Marpa::R2 and Moo.

Distributions

This module is available as a Unix-style distro (*.tgz).

See http://savage.net.au/Perl-modules/html/installing-a-module.html for help on unpacking and installing distros.

Installation

Install Regexp::Parsertron as you would any Perl module:

Run:

        cpanm Regexp::Parsertron

or run:

        sudo cpan Regexp::Parsertron

or unpack the distro, and then use:

        perl Makefile.PL
        make (or dmake or nmake)
        make test
        make install

Constructor and Initialization

new() is called as my($parser) = Regexp::Parsertron -> new(k1 => v1, k2 => v2, ...).

It returns a new object of type Regexp::Parsertron.

Key-value pairs accepted in the parameter list (see corresponding methods for details [e.g. "re([$regexp])"]):

o re => $regexp

The does() method of Scalar::Does is called to see what re is. If it's already of the form qr/$re/, then it's processed as is, but if it's not, then it's transformed using qr/$re/.

Warning: Currently, the input is expected to have been pre-processed by Perl via qr/$regexp/.

Default: ''.

o verbose => $integer

Takes values 0, 1 or 2, which print more and more progress reports.

Used for debugging.

Default: 0 (print nothing).

Methods

append(%opts)

Append some text to the text of a node.

%opts is a hash with these (key => value) pairs:

o text => $string

The text to append.

o uid => $uid

The uid of the node to update.

The code calls die() if %opts does not have these 2 keys, or if either value is undef.

See scripts/synopsis.pl for sample code.

Note: Calling append() never changes the uids of nodes, so repeated calling of append() with the same uid will apply more and more updates to the same node.

See also "prepend(%opts)", "set(%opts)" and t/get.set.t.

as_string()

Returns the parsed regexp as a string. The string contains all edits applied with "The Edit Methods".

find($target)

Returns an arrayref of node uids whose text contains the given string.

If the arrayref is empty, there were no matches.

The Perl function index() is used here to test for $target being a substring of the text associated with each node.

The code calls die() if $target is undef.

See t/get.set.t for sample usage of find().

See "search($target)" for a regexp-based test. See also "get($uid)".

get($uid)

Get the text of the node with the given $uid.

The code calls die() if $uid is undef, or outside the range 1 .. $self -> uid. The latter value is the highest uid so far assigned to any node.

Returns undef if the given $uid is not found.

See also "find($target)".

new([%opts])

Here, '[]' indicate an optional parameter.

See "Constructor and Initialization" for details on the parameters accepted by "new()".

parse([%opts])

Here, '[]' indicate an optional parameter.

Parses the regexp supplied with the parameter re in the call to "new()" or in the call to "re($regexp)", or in the call to parse(re => $regexp) itself. The latter takes precedence.

The hash %opts takes the same (key => value) pairs as "new()" does.

See "Constructor and Initialization" for details.

prepend(%opts)

Prepend some text to the text of a node.

%opts is a hash with these (key => value) pairs:

o text => $string

The text to prepend.

o uid => $uid

The uid of the node to update.

The code calls die() if %opts does not have these 2 keys, or if either value is undef.

Note: Calling prepend() never changes the uids of nodes, so repeated calling of prepend() with the same uid will apply more and more updates to the same node.

See also "append(%opts)", "set(%opts)", and t/get.set.t.

Prints, in a pretty format, the tree built from parsing.

See the </Synopsis> for sample output.

See also "print_raw_tree".

Prints, in a simple format, the tree built from parsing.

See the </Synopsis> for sample output.

See also "print_cooked_tree".

re([$regexp])

Here, '[]' indicate an optional parameter.

Gets or sets the regexp to be processed.

Note: re is a parameter to "new([%opts])".

reset()

Resets various internal things, except test_count.

Used basically for debugging.

search($target)

Returns an arrayref of node uids whose text contains the given string.

If the arrayref is empty, there were no matches.

$target is converted to a regexp if a simple string is passed in.

The code calls die() if $target is undef.

See t/search.t for sample usage of search().

See "find($target)" for a non-regexp search. See also "get($uid)".

set(%opts)

Set the text of a node to $opt{text}.

%opts is a hash with these (key => value) pairs:

o text => $string

The text to use to overwrite the text of the node.

o uid => $uid

The uid of the node to update.

The code calls die() if %opts does not have these 2 keys, or if either value is undef.

See also "append(%opts)" and "prepend(%opts)".

tree()

Returns an object of type Tree. Ignore the root node.

Each node's meta() method returns a hashref of information about the node. See the "What is the format of the nodes in the tree built by this module?" for details.

See also the source code for "print_cooked_tree()" and "print_raw_tree()" for ideas on how to use this object.

uid()

Returns the last-used uid.

Each node in the tree is given a uid, which allows methods like "append(%opts)" to work.

verbose([$integer])

Here, '[]' indicate an optional parameter.

Gets or sets the verbosity level, within the range 0 .. 2. Higher numbers print more progress reports.

Used basically for debugging.

Note: verbose is a parameter to "new([%opts])".

warning_str()

Returns the last Marpa warning.

In short, Marpa will always report 'Marpa parse exhausted' in warning_str() if the parse is not ambiguous, but do not worry - this is not an error.

See After calling parse(), warning_str() contains the string '... Parse ambiguous ...' and Is this a (Marpa) exhaustion-hating or exhaustion-loving app?.

FAQ

Can I add a subtree to the tree?

Not yet.

There is a private method, _add_daughter(), which I could make public, if I felt it was safe to do so.

Why does the BNF not accept an empty regexp?

Simple answer: Changing the BNF to handle this creates a massive problem elsewhere in the BNF.

Complex answer:

The BNF contains this countable rule to allow patterns to be juxtaposed without '|', say, to separate them:

        global_sequence ::= pattern_type+

And in turn (further toward the leaves of the tree of BNF), I then use:

        pattern_sequence ::= pattern_set+

To allow an empty regexp would mean changing this rule to:

        pattern_sequence ::= pattern_set*

But that makes this rule nullable, and Marpa rejects the global_sequence rule on the grounds that a countable rule is not allowed to be nullable. ATM I cannot see a way of rewriting the rules to avoid this problem. But I'm hopeful such a rewrite is possible.

Why does the code sometimes not store '|' - as in qr/(Perl|JavaScript/) - in its own node?

It could be done by, for example, splitting such a string into three nodes, 'Perl', '|', 'Javascript'. But does that offer any benefit?

It makes processing by the user more complex because then if they wish to edit the list of alternatives, they might have to edit two or three nodes instead of one. Here, editing means perhaps replacing any existing string with the empty string.

Further, to extend the list of alternatives, the user will be confused by not being sure if they should change 'Javascript' to 'Javascript|C' or if they have to add two nodes, containing '|' and 'C'. And ATM adding nodes is contraindicated!

Despite this, when the input stream triggers two events, string and vertical_bar, simultaneously because the '|' is at the start of a string, special code in the private method _validate_event() does put '|' in its own node. IOW the BNF does not do the work, which is really what I would prefer.

Does this module ever use \Q...\E to quote regexp metacharacters?

No.

What is the format of the nodes in the tree built by this module?

Each node's name is the name of the Marpa-style event which was triggered by detection of some text within the regexp.

Each node's meta() method returns a hashref with these (key => value) pairs:

o text => $string

This is the text within the regexp which triggered the event just mentioned.

o uid => $integer

This is the unique id of the 'current' node.

This uid is often used by you to specify which node to work on.

See t/get.set.t and t/simple.t for sample code.

The code never changes the uid of a node.

See also the source code for "print_cooked_tree()" and "print_raw_tree()" for ideas on how to use the tree.

See the "Synopsis" for sample code and a report after parsing a tiny regexp.

Does the root node in the tree ever hold useful information?

No. Always ignore it.

Why does the BNF never use the lexeme adverb priority?

Because with Marpa::R2 the priority is only used when lexemes are the same length.

See FAQ #140.

Does this module interpret regexps in any way?

No. You have to run your own Perl code to do that. This module just parses them into a data structure.

And that really means this module does not match the regexp against anything. If I appear to do that while debugging new code, you can't rely on that appearing in production versions of the module.

Does this module rewrite regexps?

No, unless you call one of "The Edit Methods".

Does this module handle both Perl 5 and Perl 6?

No. It will only handle Perl 5 syntax.

Does this module handle regexps for various versions of Perl5?

Not yet. Version-dependent regexp syntax will be supported for recent versions of Perl. This is done by having tokens within the BNF which are replaced at start-up time with version-dependent details.

There are no such tokens at the moment.

All debugging is done assuming the regexp syntax as documented online. See "References" for the urls in question.

So which version of Perl is supported?

The code is expected to work for Perls back to V 5.14.0, which is when stringification of regexps changed. See "References" below for more.

I'm (2018-01-14) using Perl V 5.20.2 and making the BNF match the Perl regexp docs listed in "References" below.

The program t/perl-5.21.11.t reads the file 'xt/author/re_tests' which I copied from the source code of Perl V 5.21.11. This test is the one which currently provides 858 passing tests out of the 1027 tests which pass for me using prove -lv t.

Could Perl and this module generate different parses of the same regexp?

Absolutely! There is no escape from this fact simply because the code used in each program bears no relationship to the code in the other one.

The real question is: How do we make the code in each program accept and reject exactly the same regexps as the code in the other program. I think trial-and-error is all we have available to us for dealing with this issue.

After calling parse(), warning_str() contains the string '... Parse ambiguous ...'

This is almost certainly an error with the BNF, although of course it may be an error with an exceptionally-badly formed regexp.

See examples/ambiguous.pl and this email thread.

See examples/commit.pl and this email thread.

In such cases the code dies, as of V 1.04.

Please report it via https://rt.cpan.org/Public/Dist/Display.html?Name=Regexp-Parsertron, and include the regexp in the report. Thanx!

Is this a (Marpa) exhaustion-hating or exhaustion-loving app?

Exhaustion-loving.

See https://metacpan.org/pod/distribution/Marpa-R2/pod/Exhaustion.pod#Exhaustion

Will this code be modified to run under Marpa::R3 when the latter is stable?

Yes.

What is the purpose of this module?

o To provide a stand-alone parser for regexps
o To help me learn more about regexps
o To become, I hope, a replacement for the horrendously complex Regexp::Assemble

Who crafted the BNF?

I did.

Scripts

This diagram indicates the flow of logic from script to script:

        xt/author/re_tests
        |
        V
        xt/author/generate.tests.pl
        |
        V
        xt/authors/perl-5.21.11.tests
        |
        V
        perl -Ilib t/perl-5.21.11.t > xt/author/perl-5.21.11.log 2>&1

If xt/author/perl-5.21.11.log only contains lines starting with 'ok', then all Perl and Marpa errors have been hidden, so t/perl-5.21.11.t is ready to live in t/. Before that time it lives in xt/author/.

TODO

o How to best define 'code' in the BNF.
o I could traverse the tree and store a pointer to each node in an array

This would mean fast access to nodes in random order. But is there any point? Yes, it would speed up various methods. Specifically, any module which calls traverse() on the tree object would benefit.

o Allow users to add nodes and hence subtrees to the tree

References

https://www.rexegg.com/regex-lookarounds.html. Mastering Lookahead and Lookbehind.

http://www.pcre.org/. PCRE - Perl Compatible Regular Expressions.

http://perldoc.perl.org/perlre.html. This is the definitive document.

http://perldoc.perl.org/perlrecharclass.html#Extended-Bracketed-Character-Classes.

http://perldoc.perl.org/perlretut.html. Samples with commentary.

http://perldoc.perl.org/perlop.html#Regexp-Quote-Like-Operators

http://perldoc.perl.org/perlrequick.html

http://perldoc.perl.org/perlrebackslash.html

http://perldoc.perl.org/perl5140delta.html#Regular-Expressions. This is when stringification changed to return (?^...) rather than (?-xism...).

https://www.endpoint.com/blog/2018/01/23/regular-expression-inconsistencies-with-unicode

http://www.nntp.perl.org/group/perl.perl5.porters/2016/02/msg234642.html. Regular Expression Inconsistencies With Unicode.

https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

https://code.activestate.com/lists/perl5-porters/209610/

https://stackoverflow.com/questions/46200305/a-strict-regular-expression-for-matching-chemical-formulae

See Also

Graph::Regexp

Regexp::Assemble

Regexp::Debugger

Regexp::ERE

Regexp::Keywords

Regexp::Lexer

Regexp::List

Regexp::Optimizer

Regexp::Parser

Regexp::SAR. This is vaguely a version of Set::FA::Element.

Regexp::Stringify

Regexp::Trie

And many others...

Machine-Readable Change Log

The file Changes was converted into Changelog.ini by Module::Metadata::Changes.

Version Numbers

Version numbers < 1.00 represent development versions. From 1.00 up, they are production versions.

CPAN Tester Results

http://fast-matrix.cpantesters.org/?dist=Regexp-Parsertron

Repository

https://github.com/ronsavage/Regexp-Parsertron

Support

Email the author, or log a bug on RT:

https://rt.cpan.org/Public/Dist/Display.html?Name=Regexp::Parsertron.

Author

Regexp::Parsertron was written by Ron Savage <ron@savage.net.au> in 2011.

Marpa's homepage: http://savage.net.au/Marpa.html.

My homepage.

Copyright

Australian copyright (c) 2016, Ron Savage.

        All Programs of mine are 'OSI Certified Open Source Software';
        you can redistribute them and/or modify them under the terms of
        The Artistic License 2.0, a copy of which is available at:
        http://opensource.org/licenses/alphabetical.