NAME

Text::Balanced::Marpa - Extract delimited text sequences from strings

Synopsis

        #!/usr/bin/env perl

        use strict;
        use warnings;

        use Text::Balanced::Marpa ':constants';

        # -----------

        my($count)  = 0;
        my($parser) = Text::Balanced::Marpa -> new
        (
                open    => ['<:' ,'[%'],
                close   => [':>', '%]'],
                options => nesting_is_fatal | print_warnings,
        );
        my(@text) =
        (
                q|<: a :>|,
                q|a [% b <: c :> d %] e|,
                q|a <: b <: c :> d :> e|, # nesting_is_fatal triggers an error here.
        );

        my($result);

        for my $text (@text)
        {
                $count++;

                print "Parsing |$text|\n";

                $result = $parser -> parse(text => \$text);

                print join("\n", @{$parser -> tree -> tree2string}), "\n";
                print "Parse result: $result (0 is success)\n";

                if ($count == 3)
                {
                        print "Deliberate error: Failed to parse |$text|\n";
                        print 'Error number: ', $parser -> error_number, '. Error message: ',
                                        $parser -> error_message, "\n";
                }

                print '-' x 50, "\n";
        }

See scripts/synopsis.pl.

This is the printout of synopsis.pl:

        Parsing |<: a :>|
        Parsed text:
        root. Attributes: {}
           |--- open. Attributes: {text => "<:"}
           |   |--- string. Attributes: {text => " a "}
           |--- close. Attributes: {text => ":>"}
        Parse result: 0 (0 is success)
        --------------------------------------------------
        Parsing |a [% b <: c :> d %] e|
        Parsed text:
        root. Attributes: {}
           |--- string. Attributes: {text => "a "}
           |--- open. Attributes: {text => "[%"}
           |   |--- string. Attributes: {text => " b "}
           |   |--- open. Attributes: {text => "<:"}
           |   |   |--- string. Attributes: {text => " c "}
           |   |--- close. Attributes: {text => ":>"}
           |   |--- string. Attributes: {text => " d "}
           |--- close. Attributes: {text => "%]"}
           |--- string. Attributes: {text => " e"}
        Parse result: 0 (0 is success)
        --------------------------------------------------
        Parsing |a <: b <: c :> d :> e|
        Error: Parse failed. Opened delimiter <: again before closing previous one
        Text parsed so far:
        root. Attributes: {}
           |--- string. Attributes: {text => "a "}
           |--- open. Attributes: {text => "<:"}
               |--- string. Attributes: {text => " b "}
        Parse result: 1 (0 is success)
        Deliberate error: Failed to parse |a <: b <: c :> d :> e|
        Error number: 2. Error message: Opened delimiter <: again before closing previous one
        --------------------------------------------------

See also scripts/tiny.pl and scripts/traverse.pl.

Description

Text::Balanced::Marpa provides a Marpa::R2-based parser for extracting delimited text sequences from strings. The text outside and inside the delimiters, and delimiters themselves, are all stored as nodes in a tree managed by Tree.

Nested strings, with the same or different delimiters, are stored as daughters of the nodes which hold the delimiters.

This module is a companion to Text::Delimited::Marpa. The differences are discussed in the "FAQ" below.

See the "FAQ" for various topics, including:

o UFT8 handling

See t/utf8.t.

o Escaping delimiters within the text

See t/escapes.t.

o Options to make nested and/or overlapped delimiters fatal errors

See t/colons.t.

o Using delimiters which are part of another delimiter

See t/escapes.t and t/perl.delimiters.

o Processing the tree-structured output

See scripts/traverse.pl.

o Emulating Text::Xslate's use of '<:' and ':>

See t/colons.t and t/percents.t.

o Implementing a really trivial HTML parser

See t/html.t.

In the same vein, see t/angle.brackets.t, for code where the delimiters are just '<' and '>'.

o Handling multiple sets of delimiters

See t/multiple.delimiters.t.

o Skipping (leading) characters in the input string

See t/skip.prefix.t.

o Implementing hard-to-read text strings as delimiters

See t/silly.delimiters.

Distributions

This module is available as a Unix-style distro (*.tgz).

See http://savage.net.au/Perl-modules/html/installing-a-module.html for help on unpacking and installing distros.

Installation

Install Text::Balanced::Marpa as you would any Perl module:

Run:

        cpanm Text::Balanced::Marpa

or run:

        sudo cpan Text::Balanced::Marpa

or unpack the distro, and then either:

        perl Build.PL
        ./Build
        ./Build test
        sudo ./Build install

or:

        perl Makefile.PL
        make (or dmake or nmake)
        make test
        make install

Constructor and Initialization

new() is called as my($parser) = Text::Balanced::Marpa -> new(k1 => v1, k2 => v2, ...).

It returns a new object of type Text::Balanced::Marpa.

Key-value pairs accepted in the parameter list (see corresponding methods for details [e.g. "text([$stringref])"]):

o close => $arrayref

An arrayref of strings, each one a closing delimiter.

The # of elements must match the # of elements in the 'open' arrayref.

See the "FAQ" for details and warnings.

A value for this option is mandatory.

Default: None.

o length => $integer

The maximum length of the input string to process.

This parameter works in conjunction with the pos parameter.

length can also be used as a key in the hash passed to "parse([%hash])".

See the "FAQ" for details.

Default: Calls Perl's length() function on the input string.

o next_few_limit => $integer

This controls how many characters are printed when displaying 'the next few chars'.

It only affects debug output.

Default: 20.

o open => $arrayref

An arrayref of strings, each one an opening delimiter.

The # of elements must match the # of elements in the 'open' arrayref.

See the "FAQ" for details and warnings.

A value for this option is mandatory.

Default: None.

o options => $bit_string

This allows you to turn on various options.

options can also be used as a key in the hash passed to "parse([%hash])".

Default: 0 (nothing is fatal).

See the "FAQ" for details.

o pos => $integer

The offset within the input string at which to start processing.

This parameter works in conjunction with the length parameter.

pos can also be used as a key in the hash passed to "parse([%hash])".

See the "FAQ" for details.

Note: The first character in the input string is at pos == 0.

Default: 0.

o text => $stringref

This is a reference to the string to be parsed. A stringref is used to avoid copying what could potentially be a very long string.

text can also be used as a key in the hash passed to "parse([%hash])".

Default: \''.

Methods

bnf()

Returns a string containing the grammar constructed based on user input.

close()

Get the arrayref of closing delimiters.

delimiter_action()

Returns a hashref, where the keys are delimiters and the values are either 'open' or 'close'.

delimiter_frequency()

Returns a hashref where the keys are opening and closing delimiters, and the values are the # of times each delimiter appears in the input stream.

The value is incremented for each opening delimiter and decremented for each closing delimiter.

error_message()

Returns the last error or warning message set.

Error messages always start with 'Error: '. Messages never end with "\n".

Parsing error strings is not a good idea, ever though this module's format for them is fixed.

See "error_number()".

error_number()

Returns the last error or warning number set.

Warnings have values < 0, and errors have values > 0.

If the value is > 0, the message has the prefix 'Error: ', and if the value is < 0, it has the prefix 'Warning: '. If this is not the case, it's a reportable bug.

Possible values for error_number() and error_message():

o 0 => ""

This is the default value.

o 1/-1 => "Last open delimiter: $lexeme_1. Unexpected closing delimiter: $lexeme_2"

If "error_number()" returns 1 it's an error, and if it returns -1 it's a warning.

You can set the option overlap_is_fatal to make it fatal.

o 2/-2 => "Opened delimiter $lexeme again before closing previous one"

If "error_number()" returns 2 it's an error, and if it returns -2 it's a warning.

You can set the option nesting_is_fatal to make it fatal.

o 3/-3 => "Ambiguous parse. Status: $status. Terminals expected: a, b, ..."

This message is only produced when the parse is ambiguous.

If "error_number()" returns 3 it's an error, and if it returns -3 it's a warning.

You can set the option ambiguity_is_fatal to make it fatal.

o 4 => "Backslash is forbidden as a delimiter character"

This preempts some types of sabotage.