Text::Balanced::Marpa - Extract delimited text sequences from strings
Text::Balanced::Marpa
#!/usr/bin/env perl use strict; use warnings; use Text::Balanced::Marpa ':constants'; # ----------- my($count) = 0; my($parser) = Text::Balanced::Marpa -> new ( open => ['<:' ,'[%'], close => [':>', '%]'], options => nesting_is_fatal | print_warnings, ); my(@text) = ( q|<: a :>|, q|a [% b <: c :> d %] e|, q|a <: b <: c :> d :> e|, # nesting_is_fatal triggers an error here. ); my($result); for my $text (@text) { $count++; print "Parsing |$text|\n"; $result = $parser -> parse(\$text); print join("\n", @{$parser -> tree2string}), "\n"; print "Parse result: $result (0 is success)\n"; if ($count == 3) { print "Deliberate error: Failed to parse |$text|\n"; print 'Error number: ', $parser -> error_number, '. Error message: ', $parser -> error_message, "\n"; } print '-' x 50, "\n"; }
See scripts/synopsis.pl.
This is the printout of synopsis.pl:
Parsing |<: a :>| Parsed text: root. Attributes: {} |--- open. Attributes: {text => "<:"} | |--- string. Attributes: {text => " a "} |--- close. Attributes: {text => ":>"} Parse result: 0 (0 is success) -------------------------------------------------- Parsing |a [% b <: c :> d %] e| Parsed text: root. Attributes: {} |--- string. Attributes: {text => "a "} |--- open. Attributes: {text => "[%"} | |--- string. Attributes: {text => " b "} | |--- open. Attributes: {text => "<:"} | | |--- string. Attributes: {text => " c "} | |--- close. Attributes: {text => ":>"} | |--- string. Attributes: {text => " d "} |--- close. Attributes: {text => "%]"} |--- string. Attributes: {text => " e"} Parse result: 0 (0 is success) -------------------------------------------------- Parsing |a <: b <: c :> d :> e| Error: Parse failed. Opened delimiter <: again before closing previous one Text parsed so far: root. Attributes: {} |--- string. Attributes: {text => "a "} |--- open. Attributes: {text => "<:"} |--- string. Attributes: {text => " b "} Parse result: 1 (0 is success) Deliberate error: Failed to parse |a <: b <: c :> d :> e| Error number: 2. Error message: Opened delimiter <: again before closing previous one --------------------------------------------------
Text::Balanced::Marpa provides a Marpa::R2-based parser for extracting delimited text sequences from strings.
See the "FAQ" for various topics, including:
See t/utf8.t.
See t/escapes.t.
See t/colons.t.
See t/escapes.t and t/perl.delimiters.
See scripts/traverse.pl.
See t/colons.t and t/percents.t.
See scripts/traverse.pl and t/html.t.
In the same vein, see t/angle.brackets.t, for code where the delimiters are just '<' and '>'.
See t/multiple.delimiters.t.
See t/skip.prefix.t.
See t/silly.delimiters.
This module is available as a Unix-style distro (*.tgz).
See http://savage.net.au/Perl-modules/html/installing-a-module.html for help on unpacking and installing distros.
Install Text::Balanced::Marpa as you would any Perl module:
Perl
Run:
cpanm Text::Balanced::Marpa
or run:
sudo cpan Text::Balanced::Marpa
or unpack the distro, and then either:
perl Build.PL ./Build ./Build test sudo ./Build install
or:
perl Makefile.PL make (or dmake or nmake) make test make install
new() is called as my($parser) = Text::Balanced::Marpa -> new(k1 => v1, k2 => v2, ...).
new()
my($parser) = Text::Balanced::Marpa -> new(k1 => v1, k2 => v2, ...)
It returns a new object of type Text::Balanced::Marpa.
Key-value pairs accepted in the parameter list (see corresponding methods for details [e.g. "text([$stringref])"]):
An arrayref of strings, each one a closing delimiter.
The # of elements must match the # of elements in the 'open' arrayref.
See the "FAQ" for details and warnings.
A value for this option is mandatory.
Default: None.
The maxiumum length of the input string to process.
This parameter works in conjunction with the pos parameter.
pos
See the "FAQ" for details.
Default: Calls Perl's length() function on the input string.
This controls how many characters are printed when displaying 'the next few chars'.
It only affects debug output.
Default: 20.
An arrayref of strings, each one an opening delimiter.
This allows you to turn on various options.
Default: 0 (nothing is fatal).
The offset within the input string at which to start processing.
This parameter works in conjunction with the length parameter.
length
Note: The first character in the input string is at pos == 0.
Default: 0.
Default: \''.
Returns a string containing the grammar constructed based on user input.
Get the arrayref of closing delimiters.
See also "open()".
'close' is a parameter to "new()". See "Constructor and Initialization" for details.
Returns a hashref, where the keys are delimiters and the values are either 'open' or 'close'.
Returns a hashref where the keys are opening and closing delimiters, and the values are the # of times each delimiter appears in the input stream.
The value is incremented for each opening delimiter and decremented for each closing delimiter.
Returns the last error or warning message set when the code died.
Error messages always start with 'Error: '. Messages never end with "\n".
Parsing error strings is not a good idea, ever though this module's format for them is fixed.
See "error_number()".
Returns the last error or warning number set.
Warnings have values < 0, and errors have values > 0.
If the value is > 0, the message has the prefix 'Error: ', and if the value is < 0, it has the prefix 'Warning: '. If this is not the case, it's a reportable bug.
Possible values for error_number() and error_message():
This is the default value.
If "error_number()" returns 1, it's an error, and if it returns -1 it's a warning.
You can set the option overlap_is_fatal to make it fatal.
overlap_is_fatal
If "error_number()" returns 2, it's an error, and if it returns -2 it's a warning.
You can set the option nesting_is_fatal to make it fatal.
nesting_is_fatal
This message is only produced when the parse is ambiguous.
If "error_number()" returns 3, it's an error, and if it returns -3 it's a warning.
You can set the option ambiguity_is_fatal to make it fatal.
ambiguity_is_fatal
This preempts some types of sabotage.
This message can never be just a warning message.
This limitation is due to the syntax of Marpa's DSL.
If "error_number()" returns 6, it's an error, and if it returns -6 it's a warning.
You can set the option exhaustion_is_fatal to make it fatal.
exhaustion_is_fatal
Marpa has trigged an event and it's name is not in the hash of event names derived from the BNF.
The code is written to handle single events at a time, or in rare cases, 2 events at the same time. But here, multiple events have been triggered and the code cannot handle the given combination.
See "error_message()".
Here, the [] indicate an optional parameter.
Get or set the escape char.
Returns a string consisting of the node's name and, optionally, it's attributes.
Possible keys in the $options hashref:
If 1, the node's attributes are not included in the string returned.
Default: 0 (include attributes).
Calls "hashref2string($hashref)".
Called by "node2string($options, $is_last_node, $node, $vert_dashes)".
You would not normally call this method.
If you don't wish to supply options, use format_node({}, $node).
Returns the given hashref as a string.
Called by "format_node($options, $node)".
Returns a hashref where the keys are event names and the values are 1.
Get or set the length of the input string to process.
See also the "FAQ" and "pos([$integer])".
'length' is a parameter to "new()". See "Constructor and Initialization" for details.
Returns a hashref where the keys are opening delimiters and the values are the corresponding closing delimiters.
See "Constructor and Initialization" for details on the parameters accepted by "new()".
Returns a substring of $s, starting at $offset, for use in debug messages.
See next_few_limit([$integer]).
Get or set the number of characters called 'the next few chars', which are printed during debugging.
'next_few_limit' is a parameter to "new()". See "Constructor and Initialization" for details.
Returns a string of the node's name and attributes, with a leading indent, suitable for printing.
Ignore the parameter $vert_dashes. The code uses it as temporary storage.
Calls "format_node($options, $node)".
Called by "tree2string($options, [$some_tree])".
Get the arrayref of opening delimiters.
See also "close()".
'open' is a parameter to "new()". See "Constructor and Initialization" for details.
Get or set the option flags.
For typical usage, see scripts/synopsis.pl.
'options' is a parameter to "new()". See "Constructor and Initialization" for details.
This is the only method the user needs to call. All data can be supplied when calling "new()".
You can of course call other methods (e.g. "text([$stringref])" ) after calling "new()" but before calling parse().
parse()
Note: If a stringref is passed to parse(), it takes precedence over any stringref passed to new(text => $stringref), and over any stringref passed to "text([$stringref])". Further, the stringref passed to parse() is passed to "text([$stringref])", meaning any subsequent call to text() returns the stringref passed to parse().
new(text => $stringref)
text()
See scripts/samples.pl.
Returns 0 for success and 1 for failure.
If the value is 1, you should call "error_number()" to find out what happened.
Get or set the offset within the input string at which to start processing.
See also the "FAQ" and "length([$integer])".
'pos' is a parameter to "new()". See "Constructor and Initialization" for details.
Get or set a reference to the string to be parsed.
'text' is a parameter to "new()". See "Constructor and Initialization" for details.
Returns an object of type Tree, which holds the parsed data.
Obviously, it only makes sense to call tree() after calling parse().
tree()
See scripts/traverse.pl for sample code which processes this tree's nodes.
Here, the [] represent an optional parameter.
If $some_tree is not supplied, uses the calling object's tree ($self -> tree).
Returns an arrayref of lines, suitable for printing. These lines do not end in "\n".
Draws a nice ASCII-art representation of the tree structure.
The tree looks like:
Root. Attributes: {# => "0"} |--- I. Attributes: {# => "1"} | |--- J. Attributes: {# => "3"} | | |--- K. Attributes: {# => "3"} | |--- J. Attributes: {# => "4"} | |--- L. Attributes: {# => "5"} | |--- M. Attributes: {# => "5"} | |--- N. Attributes: {# => "5"} | |--- O. Attributes: {# => "5"} |--- H. Attributes: {# => "2"} | |--- J. Attributes: {# => "3"} | | |--- K. Attributes: {# => "3"} | |--- J. Attributes: {# => "4"} | |--- L. Attributes: {# => "5"} | |--- M. Attributes: {# => "5"} | |--- N. Attributes: {# => "5"} | |--- O. Attributes: {# => "5"} |--- D. Attributes: {# => "6"} | |--- F. Attributes: {# => "8"} | |--- G. Attributes: {# => "8"} |--- E. Attributes: {# => "7"} | |--- F. Attributes: {# => "8"} | |--- G. Attributes: {# => "8"} |--- B. Attributes: {# => "9"} |--- C. Attributes: {# => "9"}
Or, without attributes:
Root |--- I | |--- J | | |--- K | |--- J | |--- L | |--- M | |--- N | |--- O |--- H | |--- J | | |--- K | |--- J | |--- L | |--- M | |--- N | |--- O |--- D | |--- F | |--- G |--- E | |--- F | |--- G |--- B |--- C
Example usage:
print map("$_\n", @{$tree -> tree2string});
Can be called with $some_tree set to any $node, and will print the tree assuming $node is the root.
If you don't wish to supply options, use tree2string({}, $node).
Possible keys in the $options hashref (which defaults to {}):
Calls "node2string($options, $is_last_node, $node, $vert_dashes)".
See "error_message()" and "error_number()".
By backslash-escaping the first character of all open and close delimiters which appear in the text.
As an example, if the delimiters are '<:' and ':>', this means you have to escape all the '<' chars and all the colons in the text.
The backslash is preserved in the output.
The recognizer - an object of type Marpa::R2::Scanless::R - is called in a loop, like this:
for ( $pos = $self -> recce -> read($stringref, $pos, $length); $pos < $length; $pos = $self -> recce -> resume($pos) )
"pos([$integer])" and "length([$integer]) can be used to initialize $pos and $length."
See https://metacpan.org/pod/distribution/Marpa-R2/pod/Scanless/R.pod#read for details.
Yes. See t/escapes.t, t/multiple.quotes.t and t/utf8.t.
See t/perl.delimiters.t.
Don't do that.
To make the code work, you would have to manually call "validate_open_close()". But even then a lot of things would have to be re-initialized to give the code any hope of working.
And that raises the question: Should the tree of text parsed so far be destroyed and re-initialized?
Each of these parameters takes an arrayref as a value.
The # of elements in the 2 arrayrefs must be the same.
The 1st element in the 'open' arrayref is the 1st user-chosen opening delimiter, and the 1st element in the 'close' arrayref must be the corresponding closing delimiter.
It is possible to use a delimiter which is part of another delimiter.
See scripts/samples.pl. It uses both '<' and '<:' as opening delimiters and their corresponding closing delimiters are '>' and ':>'. Neat, huh?
Firstly, to make these constants available, you must say:
use Text::Balanced::Marpa ':constants';
Secondly, more detail on errors and warnings can be found at "error_number()".
Thirdly, for usage of these option flags, see t/angle.brackets.t, t/colons.t, t/escapes.t, t/multiple.quotes.t, t/percents.t and scripts/samples.pl.
Now the flags themselves:
This is the default.
It's value is 0.
Print extra stuff if this flag is set.
It's value is 1.
Print various warnings if this flag is set:
Ambiguity is not, in and of itself, an error. But see the ambiguity_is_fatal option, below.
It's tempting to call this option warnings, but Perl already has use warnings, so I didn't.
warnings
use warnings
It's value is 2.
This means overlapping delimiters cause a fatal error.
So, setting overlap_is_fatal means '{Bold [Italic}]' would be a fatal error.
I use this example since it gives me the opportunity to warn you, this will not do what you want if you try to use the delimiters of '<' and '>' for HTML. That is, '<i><b>Bold Italic</i></b>' is not an error because what overlap are '<b>' and '</i>' BUT THEY ARE NOT TAGS. The tags are '<' and '>', ok? See also t/html.t.
It's value is 4.
This means nesting of identical opening delimiters is fatal.
So, using nesting_is_fatal means 'a <: b <: c :> d :> e' would be a fatal error.
It's value is 8.
This makes "error_number()" return 3 rather than -3.
It's value is 16.
This makes "error_number()" return 6 rather than -6.
It's value is 32.
See "Synopsis".
See scripts/traverse.pl. It is a copy of t/html.t with tree-walking code instead of test code.
The parsed output is held in a tree managed by Tree.
The tree always has a root node, which has nothing to do with the input data. So, even an empty imput string will produce a tree with 1 node. This root has an empty hashref associated with it.
Nodes have a name and a hashref of attributes.
The name indicates the type of node. Names are one of these literals:
For 'open' and 'close', the delimiter is given by the value of the 'text' key in the hashref.
The (key => value) pairs in the hashref are:
If the node name is 'open' or 'close', $string is the delimiter.
If the node name is 'text', $string is the verbatim text from the document.
Verbatim means, for example, that backslashes in the input are preserved.
Try:
perl -Ilib scripts/samples.pl info
The tree does not preserve the nested nature of HTML/XML.
Post-processing (valid) HTML could easily generate another view of the data.
But anyway, to get perfect HTML you'd be grabbing the output of Marpa::R2::HTML, right?
See scripts/traverse.pl and t/html.t for a trivial HTML parser.
http://savage.net.au/Marpa.html.
That page has a long list of links.
This runs both standard and author tests:
shell> perl Build.PL; ./Build; ./Build authortest
See https://jeffreykegler.github.io/Ocean-of-Awareness-blog/individual/2014/11/delimiter.html.
Perhaps this could be a sub-class?
Text::Balanced.
Tree and Tree::Persist.
MarpaX::Demo::SampleScripts - for various usages of Marpa::R2, but not of this module.
The file CHANGES was converted into Changelog.ini by Module::Metadata::Changes.
Version numbers < 1.00 represent development versions. From 1.00 up, they are production versions.
Thanks to Jeffrey Kegler, who wrote Marpa and Marpa::R2.
And thanks to rns (Ruslan Shvedov) for writing the grammar for double-quoted strings used in MarpaX::Demo::SampleScripts's scripts/quoted.strings.02.pl. I adapted it to HTML (see scripts/quoted.strings.05.pl in that module), and then incorporated the grammar into GraphViz2::Marpa, and - after more extensions - into this module.
Lastly, thanks to Robert Rothenberg for Const::Exporter, a module which works the same way Perl does.
https://github.com/ronsavage/Text-Balanced-Marpa
Email the author, or log a bug on RT:
https://rt.cpan.org/Public/Dist/Display.html?Name=Text::Balanced::Marpa.
Text::Balanced::Marpa was written by Ron Savage <ron@savage.net.au> in 2014.
Marpa's homepage: http://savage.net.au/Marpa.html.
My homepage: http://savage.net.au/.
Australian copyright (c) 2014, Ron Savage.
All Programs of mine are 'OSI Certified Open Source Software'; you can redistribute them and/or modify them under the terms of The Artistic License 2.0, a copy of which is available at: http://opensource.org/licenses/alphabetical.
1 POD Error
The following errors were encountered while parsing the POD:
Unterminated L<...> sequence
To install Text::Balanced::Marpa, copy and paste the appropriate command in to your terminal.
cpanm
CPAN shell
perl -MCPAN -e shell install Text::Balanced::Marpa
For more information on module installation, please visit the detailed CPAN module installation guide.