NAME

Text::Balanced::Marpa - Extract delimited text sequences from strings

Synopsis

#!/usr/bin/env perl

use strict;
use warnings;

use Text::Balanced::Marpa ':constants';

# -----------

my($count)  = 0;
my($parser) = Text::Balanced::Marpa -> new
(
	open    => ['<:' ,'[%'],
	close   => [':>', '%]'],
	options => nesting_is_fatal | print_warnings,
);
my(@text) =
(
	q|<: a :>|,
	q|a [% b <: c :> d %] e|,
	q|a <: b <: c :> d :> e|, # nesting_is_fatal triggers an error here.
);

my($result);

for my $text (@text)
{
	$count++;

	print "Parsing |$text|\n";

	$result = $parser -> parse(text => \$text);

	print join("\n", @{$parser -> tree -> tree2string}), "\n";
	print "Parse result: $result (0 is success)\n";

	if ($count == 3)
	{
		print "Deliberate error: Failed to parse |$text|\n";
		print 'Error number: ', $parser -> error_number, '. Error message: ',
				$parser -> error_message, "\n";
	}

	print '-' x 50, "\n";
}

See scripts/synopsis.pl.

This is the printout of synopsis.pl:

Parsing |<: a :>|
Parsed text:
root. Attributes: {}
   |--- open. Attributes: {text => "<:"}
   |   |--- string. Attributes: {text => " a "}
   |--- close. Attributes: {text => ":>"}
Parse result: 0 (0 is success)
--------------------------------------------------
Parsing |a [% b <: c :> d %] e|
Parsed text:
root. Attributes: {}
   |--- string. Attributes: {text => "a "}
   |--- open. Attributes: {text => "[%"}
   |   |--- string. Attributes: {text => " b "}
   |   |--- open. Attributes: {text => "<:"}
   |   |   |--- string. Attributes: {text => " c "}
   |   |--- close. Attributes: {text => ":>"}
   |   |--- string. Attributes: {text => " d "}
   |--- close. Attributes: {text => "%]"}
   |--- string. Attributes: {text => " e"}
Parse result: 0 (0 is success)
--------------------------------------------------
Parsing |a <: b <: c :> d :> e|
Error: Parse failed. Opened delimiter <: again before closing previous one
Text parsed so far:
root. Attributes: {}
   |--- string. Attributes: {text => "a "}
   |--- open. Attributes: {text => "<:"}
       |--- string. Attributes: {text => " b "}
Parse result: 1 (0 is success)
Deliberate error: Failed to parse |a <: b <: c :> d :> e|
Error number: 2. Error message: Opened delimiter <: again before closing previous one
--------------------------------------------------

See also scripts/tiny.pl and scripts/traverse.pl.

Description

Text::Balanced::Marpa provides a Marpa::R2-based parser for extracting delimited text sequences from strings. The text outside and inside the delimiters, and delimiters themselves, are all stored as nodes in a tree managed by Tree.

Nested strings, with the same or different delimiters, are stored as daughters of the nodes which hold the delimiters.

This module is a companion to Text::Delimited::Marpa. The differences are discussed in the "FAQ" below.

See the "FAQ" for various topics, including:

o UFT8 handling

See t/utf8.t.

o Escaping delimiters within the text

See t/escapes.t.

o Options to make nested and/or overlapped delimiters fatal errors

See t/colons.t.

o Using delimiters which are part of another delimiter

See t/escapes.t and t/perl.delimiters.

o Processing the tree-structured output

See scripts/traverse.pl.

o Emulating Text::Xslate's use of '<:' and ':>

See t/colons.t and t/percents.t.

o Implementing a really trivial HTML parser

See t/html.t.

In the same vein, see t/angle.brackets.t, for code where the delimiters are just '<' and '>'.

o Handling multiple sets of delimiters

See t/multiple.delimiters.t.

o Skipping (leading) characters in the input string

See t/skip.prefix.t.

o Implementing hard-to-read text strings as delimiters

See t/silly.delimiters.

Distributions

This module is available as a Unix-style distro (*.tgz).

See http://savage.net.au/Perl-modules/html/installing-a-module.html for help on unpacking and installing distros.

Installation

Install Text::Balanced::Marpa as you would any Perl module:

Run:

cpanm Text::Balanced::Marpa

or run:

sudo cpan Text::Balanced::Marpa

or unpack the distro, and then either:

perl Build.PL
./Build
./Build test
sudo ./Build install

or:

perl Makefile.PL
make (or dmake or nmake)
make test
make install

Constructor and Initialization

new() is called as my($parser) = Text::Balanced::Marpa -> new(k1 => v1, k2 => v2, ...).

It returns a new object of type Text::Balanced::Marpa.

Key-value pairs accepted in the parameter list (see corresponding methods for details [e.g. "text([$stringref])"]):

o close => $arrayref

An arrayref of strings, each one a closing delimiter.

The # of elements must match the # of elements in the 'open' arrayref.

See the "FAQ" for details and warnings.

A value for this option is mandatory.

Default: None.

o length => $integer

The maximum length of the input string to process.

This parameter works in conjunction with the pos parameter.

length can also be used as a key in the hash passed to "parse([%hash])".

See the "FAQ" for details.

Default: Calls Perl's length() function on the input string.

o next_few_limit => $integer

This controls how many characters are printed when displaying 'the next few chars'.

It only affects debug output.

Default: 20.

o open => $arrayref

An arrayref of strings, each one an opening delimiter.

The # of elements must match the # of elements in the 'open' arrayref.

See the "FAQ" for details and warnings.

A value for this option is mandatory.

Default: None.

o options => $bit_string

This allows you to turn on various options.

options can also be used as a key in the hash passed to "parse([%hash])".

Default: 0 (nothing is fatal).

See the "FAQ" for details.

o pos => $integer

The offset within the input string at which to start processing.

This parameter works in conjunction with the length parameter.

pos can also be used as a key in the hash passed to "parse([%hash])".

See the "FAQ" for details.

Note: The first character in the input string is at pos == 0.

Default: 0.

o text => $stringref

This is a reference to the string to be parsed. A stringref is used to avoid copying what could potentially be a very long string.

text can also be used as a key in the hash passed to "parse([%hash])".

Default: \''.

Methods

bnf()

Returns a string containing the grammar constructed based on user input.

close()

Get the arrayref of closing delimiters.

delimiter_action()

Returns a hashref, where the keys are delimiters and the values are either 'open' or 'close'.

delimiter_frequency()

Returns a hashref where the keys are opening and closing delimiters, and the values are the # of times each delimiter appears in the input stream.

The value is incremented for each opening delimiter and decremented for each closing delimiter.

error_message()

Returns the last error or warning message set.

Error messages always start with 'Error: '. Messages never end with "\n".

Parsing error strings is not a good idea, ever though this module's format for them is fixed.

See "error_number()".

error_number()

Returns the last error or warning number set.

Warnings have values < 0, and errors have values > 0.

If the value is > 0, the message has the prefix 'Error: ', and if the value is < 0, it has the prefix 'Warning: '. If this is not the case, it's a reportable bug.

Possible values for error_number() and error_message():

o 0 => ""

This is the default value.

o 1/-1 => "Last open delimiter: $lexeme_1. Unexpected closing delimiter: $lexeme_2"

If "error_number()" returns 1 it's an error, and if it returns -1 it's a warning.

You can set the option overlap_is_fatal to make it fatal.

o 2/-2 => "Opened delimiter $lexeme again before closing previous one"

If "error_number()" returns 2 it's an error, and if it returns -2 it's a warning.

You can set the option nesting_is_fatal to make it fatal.

o 3/-3 => "Ambiguous parse. Status: $status. Terminals expected: a, b, ..."

This message is only produced when the parse is ambiguous.

If "error_number()" returns 3 it's an error, and if it returns -3 it's a warning.

You can set the option ambiguity_is_fatal to make it fatal.

o 4 => "Backslash is forbidden as a delimiter character"

This preempts some types of sabotage.