Regexp::Parsertron - Parse a Perl regexp into a data structure of type Tree
Regexp::Parsertron
Warning: Development version. See "Version Numbers" for details.
This is scripts/synopsis.pl:
#!/usr/bin/env perl use v5.10; use strict; use warnings; use Regexp::Parsertron; # --------------------- my($re) = qr/Perl|JavaScript/i; my($parser) = Regexp::Parsertron -> new(verbose => 1); # Return 0 for success and 1 for failure. my($result) = $parser -> parse(re => $re); my($node_id) = 5; # Obtained from displaying and inspecting the tree. print "Calling append(text => '|C++', uid => $node_id) \n"; $parser -> append(text => '|C++', uid => $node_id); $parser -> print_raw_tree; $parser -> print_cooked_tree; my($as_string) = $parser -> as_string; print "Original: $re. Result: $result (0 is success) \n"; print "as_string(): $as_string \n"; $result = $parser -> validate; print "validate(): Result: $result (0 is success) \n"; # Return 0 for success and 1 for failure. $parser -> reset; $parser -> verbose(0); $re = qr/Perl|JavaScript|(?:Flub|BCPL)/i; $result = $parser -> parse(re => $re); print "\nAdd complexity to the regexp by parsing a new regexp \n"; $parser -> print_raw_tree;
And its output:
Test count: 1. Parsing (in qr/.../ form): '(?^i:Perl|JavaScript)'. Root. Attributes: {text => "Root", uid => "0"} |--- open_parenthesis. Attributes: {text => "(", uid => "1"} | |--- query_caret. Attributes: {text => "?^", uid => "2"} | |--- flag_set. Attributes: {text => "i", uid => "3"} | |--- colon. Attributes: {text => ":", uid => "4"} | |--- string. Attributes: {text => "Perl|JavaScript", uid => "5"} |--- close_parenthesis. Attributes: {text => ")", uid => "6"} Calling append(text => '|C++', uid => 5) Root. Attributes: {text => "Root", uid => "0"} |--- open_parenthesis. Attributes: {text => "(", uid => "1"} | |--- query_caret. Attributes: {text => "?^", uid => "2"} | |--- flag_set. Attributes: {text => "i", uid => "3"} | |--- colon. Attributes: {text => ":", uid => "4"} | |--- string. Attributes: {text => "Perl|JavaScript|C++", uid => "5"} |--- close_parenthesis. Attributes: {text => ")", uid => "6"} Name Uid Text ---- --- ---- open_parenthesis 1 ( query_caret 2 ?^ flag_set 3 i colon 4 : string 5 Perl|JavaScript|C++ close_parenthesis 6 ) Original: (?^i:Perl|JavaScript). Result: 0 (0 is success) as_string(): (?^i:Perl|JavaScript|C++) validate(): Result: 0 (0 is success) Adding complexity to the regexp by parsing a new regexp: Root. Attributes: {text => "Root", uid => "0"} |--- open_parenthesis. Attributes: {text => "(", uid => "1"} | |--- query_caret. Attributes: {text => "?^", uid => "2"} | |--- flag_set. Attributes: {text => "i", uid => "3"} | |--- colon. Attributes: {text => ":", uid => "4"} | |--- string. Attributes: {text => "Perl|JavaScript|", uid => "5"} | |--- colon_prefix. Attributes: {text => "(?:", uid => "6"} | | |--- string. Attributes: {text => "Flub|BCPL", uid => "7"} | |--- close_parenthesis. Attributes: {text => ")", uid => "8"} |--- close_parenthesis. Attributes: {text => ")", uid => "9"}
Note: The 1st tree is printed due to verbose => 1 in the call to "new([%opts])", while the 2nd is due to the call to "print_raw_tree()". The columnar output is due to the call to "print_cooked_tree()".
This code, scripts/tutorial.pl, is a cut-down version of scripts/synopsis.pl:
#!/usr/bin/env perl use v5.10; use strict; use warnings; use Regexp::Parsertron; # --------------------- my($re) = qr/Perl|JavaScript/i; my($parser) = Regexp::Parsertron -> new(verbose => 1); # Return 0 for success and 1 for failure. my($result) = $parser -> parse(re => $re); print "Original: $re. Result: $result. (0 is success) \n";
Running it outputs:
Test count: 1. Parsing (in qr/.../ form): '(?^i:Perl|JavaScript)'. Root. Attributes: {text => "Root", uid => "0"} |--- open_parenthesis. Attributes: {text => "(", uid => "1"} | |--- query_caret. Attributes: {text => "?^", uid => "2"} | |--- flag_set. Attributes: {text => "i", uid => "3"} | |--- colon. Attributes: {text => ":", uid => "4"} | |--- string. Attributes: {text => "Perl|JavaScript", uid => "5"} |--- close_parenthesis. Attributes: {text => ")", uid => "6"} Original: (?^i:Perl|JavaScript). Result: 0. (0 is success)
The nodes are uniquely identified by their uids.
Add these lines to the end of the tutorial code, and re-run:
my($node_id) = 5; # Obtained from displaying and inspecting the tree. $parser -> append(text => '|C++', uid => $node_id); $parser -> print_raw_tree;
The extra output, showing the change to node uid == 5, is:
Root. Attributes: {text => "Root", uid => "0"} |--- open_parenthesis. Attributes: {text => "(", uid => "1"} | |--- query_caret. Attributes: {text => "?^", uid => "2"} | |--- flag_set. Attributes: {text => "i", uid => "3"} | |--- colon. Attributes: {text => ":", uid => "4"} | |--- string. Attributes: {text => "Perl|JavaScript|C++", uid => "5"} |--- close_parenthesis. Attributes: {text => ")", uid => "6"}
See t/get.set.t for sample code.
The edit methods simply means any one or more of these methods, which can all change the text of a node:
The edit methods are exercised in t/get.set.t, as well as scripts/synopsis.pl (above).
Parses a regexp into a tree object managed by the Tree module, and provides various methods for updating and retrieving that tree's contents.
This module uses Marpa::R2 and Moo.
This module is available as a Unix-style distro (*.tgz).
See http://savage.net.au/Perl-modules/html/installing-a-module.html for help on unpacking and installing distros.
Install Regexp::Parsertron as you would any Perl module:
Perl
Run:
cpanm Regexp::Parsertron
or run:
sudo cpan Regexp::Parsertron
or unpack the distro, and then use:
perl Makefile.PL make (or dmake or nmake) make test make install
new() is called as my($parser) = Regexp::Parsertron -> new(k1 => v1, k2 => v2, ...).
new()
my($parser) = Regexp::Parsertron -> new(k1 => v1, k2 => v2, ...)
It returns a new object of type Regexp::Parsertron.
Key-value pairs accepted in the parameter list (see corresponding methods for details [e.g. "re([$regexp])"]):
The does() method of Scalar::Does is called to see what re is. If it's already of the form qr/$re/, then it's processed as is, but if it's not, then it's transformed using qr/$re/.
does()
re
qr/$re/
Warning: Currently, the input is expected to have been pre-processed by Perl via qr/$regexp/.
Default: ''.
Takes values 0, 1 or 2, which print more and more progress reports.
Used for debugging.
Default: 0 (print nothing).
Append some text to the text of a node.
%opts is a hash with these (key => value) pairs:
The text to append.
The uid of the node to update.
The code calls die() if %opts does not have these 2 keys, or if either value is undef.
die()
See scripts/synopsis.pl for sample code.
Note: Calling append() never changes the uids of nodes, so repeated calling of append() with the same uid will apply more and more updates to the same node.
append()
uid
See also "prepend(%opts)", "set(%opts)" and t/get.set.t.
Returns the parsed regexp as a string. The string contains all edits applied with "The Edit Methods".
Returns an arrayref of node uids whose text contains the given string.
If the arrayref is empty, there were no matches.
The Perl function index() is used here to test for $target being a substring of the text associated with each node.
index()
The code calls die() if $target is undef.
See t/get.set.t for sample usage of find().
find()
See "search($target)" for a regexp-based test. See also "get($uid)".
Get the text of the node with the given $uid.
The code calls die() if $uid is undef, or outside the range 1 .. $self -> uid. The latter value is the highest uid so far assigned to any node.
Returns undef if the given $uid is not found.
See also "find($target)".
Here, '[]' indicate an optional parameter.
See "Constructor and Initialization" for details on the parameters accepted by "new()".
Parses the regexp supplied with the parameter re in the call to "new()" or in the call to "re($regexp)", or in the call to parse(re => $regexp) itself. The latter takes precedence.
parse(re => $regexp)
The hash %opts takes the same (key => value) pairs as "new()" does.
%opts
See "Constructor and Initialization" for details.
Prepend some text to the text of a node.
The text to prepend.
Note: Calling prepend() never changes the uids of nodes, so repeated calling of prepend() with the same uid will apply more and more updates to the same node.
prepend()
See also "append(%opts)", "set(%opts)", and t/get.set.t.
Prints, in a pretty format, the tree built from parsing.
See the </Synopsis> for sample output.
See also "print_raw_tree".
Prints, in a simple format, the tree built from parsing.
See also "print_cooked_tree".
Gets or sets the regexp to be processed.
Note: re is a parameter to "new([%opts])".
Resets various internal things, except test_count.
Used basically for debugging.
$target is converted to a regexp if a simple string is passed in.
See t/search.t for sample usage of search().
search()
See "find($target)" for a non-regexp search. See also "get($uid)".
Set the text of a node to $opt{text}.
The text to use to overwrite the text of the node.
See also "append(%opts)" and "prepend(%opts)".
Returns an object of type Tree. Ignore the root node.
Each node's meta() method returns a hashref of information about the node. See the "What is the format of the nodes in the tree built by this module?" for details.
meta()
See also the source code for "print_cooked_tree()" and "print_raw_tree()" for ideas on how to use this object.
Returns the last-used uid.
Each node in the tree is given a uid, which allows methods like "append(%opts)" to work.
Gets or sets the verbosity level, within the range 0 .. 2. Higher numbers print more progress reports.
Note: verbose is a parameter to "new([%opts])".
verbose
Returns the last Marpa warning.
In short, Marpa will always report 'Marpa parse exhausted' in warning_str() if the parse is not ambiguous, but do not worry - this is not an error.
See After calling parse(), warning_str() contains the string '... Parse ambiguous ...' and Is this a (Marpa) exhaustion-hating or exhaustion-loving app?.
Not yet.
There is a private method, _add_daughter(), which I could make public, if I felt it was safe to do so.
_add_daughter()
Simple answer: Changing the BNF to handle this creates a massive problem elsewhere in the BNF.
Complex answer:
The BNF contains this countable rule to allow patterns to be juxtaposed without '|', say, to separate them:
global_sequence ::= pattern_type+
And in turn (further toward the leaves of the tree of BNF), I then use:
pattern_sequence ::= pattern_set+
To allow an empty regexp would mean changing this rule to:
pattern_sequence ::= pattern_set*
But that makes this rule nullable, and Marpa rejects the global_sequence rule on the grounds that a countable rule is not allowed to be nullable. ATM I cannot see a way of rewriting the rules to avoid this problem. But I'm hopeful such a rewrite is possible.
global_sequence
It could be done by, for example, splitting such a string into three nodes, 'Perl', '|', 'Javascript'. But does that offer any benefit?
It makes processing by the user more complex because then if they wish to edit the list of alternatives, they might have to edit two or three nodes instead of one. Here, editing means perhaps replacing any existing string with the empty string.
Further, to extend the list of alternatives, the user will be confused by not being sure if they should change 'Javascript' to 'Javascript|C' or if they have to add two nodes, containing '|' and 'C'. And ATM adding nodes is contraindicated!
Despite this, when the input stream triggers two events, string and vertical_bar, simultaneously because the '|' is at the start of a string, special code in the private method _validate_event() does put '|' in its own node. IOW the BNF does not do the work, which is really what I would prefer.
string
vertical_bar
_validate_event()
No.
Each node's name is the name of the Marpa-style event which was triggered by detection of some text within the regexp.
name
text
Each node's meta() method returns a hashref with these (key => value) pairs:
This is the text within the regexp which triggered the event just mentioned.
This is the unique id of the 'current' node.
This uid is often used by you to specify which node to work on.
See t/get.set.t and t/simple.t for sample code.
The code never changes the uid of a node.
See also the source code for "print_cooked_tree()" and "print_raw_tree()" for ideas on how to use the tree.
See the "Synopsis" for sample code and a report after parsing a tiny regexp.
No. Always ignore it.
priority
Because with Marpa::R2 the priority is only used when lexemes are the same length.
See FAQ #140.
No. You have to run your own Perl code to do that. This module just parses them into a data structure.
And that really means this module does not match the regexp against anything. If I appear to do that while debugging new code, you can't rely on that appearing in production versions of the module.
No, unless you call one of "The Edit Methods".
No. It will only handle Perl 5 syntax.
Not yet. Version-dependent regexp syntax will be supported for recent versions of Perl. This is done by having tokens within the BNF which are replaced at start-up time with version-dependent details.
There are no such tokens at the moment.
All debugging is done assuming the regexp syntax as documented online. See "References" for the urls in question.
The code is expected to work for Perls back to V 5.14.0, which is when stringification of regexps changed. See "References" below for more.
I'm (2018-01-14) using Perl V 5.20.2 and making the BNF match the Perl regexp docs listed in "References" below.
The program t/perl-5.21.11.t reads the file 'xt/author/re_tests' which I copied from the source code of Perl V 5.21.11. This test is the one which currently provides 858 passing tests out of the 1027 tests which pass for me using prove -lv t.
Absolutely! There is no escape from this fact simply because the code used in each program bears no relationship to the code in the other one.
The real question is: How do we make the code in each program accept and reject exactly the same regexps as the code in the other program. I think trial-and-error is all we have available to us for dealing with this issue.
This is almost certainly an error with the BNF, although of course it may be an error with an exceptionally-badly formed regexp.
See examples/ambiguous.pl and this email thread.
See examples/commit.pl and this email thread.
In such cases the code dies, as of V 1.04.
Please report it via https://rt.cpan.org/Public/Dist/Display.html?Name=Regexp-Parsertron, and include the regexp in the report. Thanx!
Exhaustion-loving.
See https://metacpan.org/pod/distribution/Marpa-R2/pod/Exhaustion.pod#Exhaustion
Yes.
I did.
This diagram indicates the flow of logic from script to script:
xt/author/re_tests | V xt/author/generate.tests.pl | V xt/authors/perl-5.21.11.tests | V perl -Ilib t/perl-5.21.11.t > xt/author/perl-5.21.11.log 2>&1
If xt/author/perl-5.21.11.log only contains lines starting with 'ok', then all Perl and Marpa errors have been hidden, so t/perl-5.21.11.t is ready to live in t/. Before that time it lives in xt/author/.
This would mean fast access to nodes in random order. But is there any point? Yes, it would speed up various methods. Specifically, any module which calls traverse() on the tree object would benefit.
traverse()
https://www.rexegg.com/regex-lookarounds.html. Mastering Lookahead and Lookbehind.
http://www.pcre.org/. PCRE - Perl Compatible Regular Expressions.
http://perldoc.perl.org/perlre.html. This is the definitive document.
http://perldoc.perl.org/perlrecharclass.html#Extended-Bracketed-Character-Classes.
http://perldoc.perl.org/perlretut.html. Samples with commentary.
http://perldoc.perl.org/perlop.html#Regexp-Quote-Like-Operators
http://perldoc.perl.org/perlrequick.html
http://perldoc.perl.org/perlrebackslash.html
http://perldoc.perl.org/perl5140delta.html#Regular-Expressions. This is when stringification changed to return (?^...) rather than (?-xism...).
https://www.endpoint.com/blog/2018/01/23/regular-expression-inconsistencies-with-unicode
http://www.nntp.perl.org/group/perl.perl5.porters/2016/02/msg234642.html. Regular Expression Inconsistencies With Unicode.
https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
https://code.activestate.com/lists/perl5-porters/209610/
https://stackoverflow.com/questions/46200305/a-strict-regular-expression-for-matching-chemical-formulae
Graph::Regexp
Regexp::Assemble
Regexp::Debugger
Regexp::ERE
Regexp::Keywords
Regexp::Lexer
Regexp::List
Regexp::Optimizer
Regexp::Parser
Regexp::SAR. This is vaguely a version of Set::FA::Element.
Regexp::Stringify
Regexp::Trie
And many others...
The file Changes was converted into Changelog.ini by Module::Metadata::Changes.
Version numbers < 1.00 represent development versions. From 1.00 up, they are production versions.
http://fast-matrix.cpantesters.org/?dist=Regexp-Parsertron
https://github.com/ronsavage/Regexp-Parsertron
Email the author, or log a bug on RT:
https://rt.cpan.org/Public/Dist/Display.html?Name=Regexp::Parsertron.
Regexp::Parsertron was written by Ron Savage <ron@savage.net.au> in 2011.
Marpa's homepage: http://savage.net.au/Marpa.html.
My homepage.
Australian copyright (c) 2016, Ron Savage.
All Programs of mine are 'OSI Certified Open Source Software'; you can redistribute them and/or modify them under the terms of The Artistic License 2.0, a copy of which is available at: http://opensource.org/licenses/alphabetical.
To install Regexp::Parsertron, copy and paste the appropriate command in to your terminal.
cpanm
CPAN shell
perl -MCPAN -e shell install Regexp::Parsertron
For more information on module installation, please visit the detailed CPAN module installation guide.