The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

MarpaX::Demo::StringParser - Conditional preservation of whitespace while parsing

Synopsis

Typical usage:

        perl -Ilib scripts/parse.pl -d '[node]{color:blue; label: "Node name"}' -r 1 -v 1 -t output.tokens

The following refer to data shipped with the distro:

        perl -Ilib scripts/parse.pl -i data/node.04.ge -r 1 -t node.04.tokens
        diff data/node.04.tokens node.04.tokens

You can use scripts/parse.sh to simplify this process:

        scripts/parse.sh data/node.04.ge node.04.tokens -r 1

See the demo page for sample output.

Also, there is an article based on this module.

Description

This module demonstrations how to use Marpa::R2's capabilities to have Marpa repeatedly pass control back to code in your own module, during the parse, to handle certain cases where you don't want Marpa's default processing to occur.

Specifically, it deals with the classic case of when you wish to preserve whitespace in some contexts, but also want Marpa to discard whitespace in all other contexts.

Note that this module's usage of Marpa's adverbs event and pause should be regarded as an intermediate/advanced technique. For people just beginning to use Marpa, use of the action adverb is the recommended technique.

The article mentioned above discusses important issues regarding the timing sequence of pauses and actions.

All this assumes a relatively recent version of Marpa, one in which its Scanless interface (SLIF) is implemented. All my development was done using Marpa::R2 V 2.064000.

Lastly, MarpaX::Demo::StringParser is a cut-down version of Graph::Easy::Marpa V 2.00, and (the former) provides a Marpa-based parser for parts of Graph::Easy::Marpa-style graph definitions. The latter module handles the whole Graph::Easy::Marpa language.

See "What is the Graph::Easy::Marpa language?" in Graph::Easy::Marpa::Parser for details. And see below, "What is the grammar parsed by this module?", for details of the parts supported by this module.

In pragmatic terms, the code in the current module was developed for inclusion in Graph::Easy::Marpa, which in turn is a pre-processor for the DOT language.

Installation

Install MarpaX::Demo::StringParser as you would for any Perl module:

Run:

        cpanm MarpaX::Demo::StringParser

or run:

        sudo cpan MarpaX::Demo::StringParser

or unpack the distro, and then either:

        perl Build.PL
        ./Build
        ./Build test
        sudo ./Build install

or:

        perl Makefile.PL
        make (or dmake or nmake)
        make test
        make install

Scripts Shipped with this Module

All scripts are shipped in the scripts/ directory.

o copy.config.pl

This is for use by the author. It just copies the config file out of the distro, so the script generate.index.pl (which uses HTML template stuff) can find it.

o find.config.pl

This cross-checks the output of copy.config.pl.

o ge2tokens.pl

This transforms all data/*.ge files into their corresponding data/*.tokens files.

o generate.demo.sh

This runs:

o perl -Ilib scripts/ge2tokens.pl
o perl -Ilib ~/bin/ge2svg.pl

See the article mentioned in the Synopsis for details on this script. Briefly, it is not included in the distro because it has Graph::Easy::Marpa::Renderer::GraphViz2 as a pre-req.

o perl -Ilib scripts/generate.index.pl

And then generate.demo.sh copies the demo output to my dev web server's doc root, where I can cross-check it.

o generate.index.pl

This constructs a web page containing all the html/*.svg files.

o parse.pl

This runs a parse on a single input file. Run .parse.pl -h' for details.

o parse.sh

This simplifies running parse.pl.

o pod2html.sh

This converts all lib/*.pm files into their corresponding *.html versions, for proof-reading and uploading to my real web site.

Constructor and Initialization

new() is called as my($parser) = MarpaX::Demo::StringParser -> new(k1 => v1, k2 => v2, ...).

It returns a new object of type MarpaX::Demo::StringParser.

Key-value pairs accepted in the parameter list (see corresponding methods for details [e.g. description($graph)]):

o description => '[node.1]->[node.2]'

Specify a string for the graph definition.

You are strongly encouraged to surround this string with '...' to protect it from your shell if using this module directly from the command line.

See also the input_file key which reads the graph from a file.

The description key takes precedence over the input_file key.

Default: ''.

o input_file => $graph_file_name

Read the graph definition from this file.

See also the description key to read the graph from the command line.

The whole file is slurped in as a single graph.

The first lines of the file can start with /^\s*#/, and will be discarded as comments.

The description key takes precedence over the input_file key.

Default: ''.

o report_tokens => $Boolean

When set to 1, calls "report()" to print the items recognized by the parser.

Default: 0.

o token_file => $file_name

The name of the CSV file in which parsed tokens are to be saved.

If '', the file is not written.

Default: ''.

o verbose => $integer

Prints more (1, 2) or less (0) progress messages.

Default: 0.

Methods

attribute_list($attribute_list)

Returns nothing.

Processes the attribute string found when Marpa pauses during the processing of a set of attributes.

Then, pushes these attributes onto a stack.

The stack's elements are documented below in "FAQ" under "How is the parsed graph stored in RAM?".

description([$graph])

Here, the [] indicate an optional parameter.

Gets or sets the graph string to be parsed.

See also the "input_file([$graph_file_name])" method.

The value supplied to the description() method takes precedence over the value read from the input file.

Also, description is an option to new().

edge($edge_name)

Returns nothing.

Processes the edge name string returned by Marpa::R2 when it pauses during the processing of '->' or '--'.

Pushes this edge name onto a stack.

The stack's elements are documented below in the "FAQ" under "How is the parsed graph stored in RAM?".

find_terminator($stringref, $target, $start)

Returns the offset into $stringref at which the $target is found.

$stringref is a refererence to the input string (stream).

$target is a regexp specifying the closing delimiter to search for.

For attributes, it is qr/}/, and for nodes, $target is qr/]/.

$start is the offset into $stringref at which to start searching. It's assumed to be pointed to the opening delimiter when this method is called, since the value is $start is set by Marpa when it pauses based on the 'pause => before' construct in the grammar.

The return value allows the calling code to extract the substring between the opening and closing delimiters, and to process it in either "attribute_list($attribute_list)" or "node($node_name)".

format_token($item)

Returns a string containing a nicely formatted version of the keys and values of the hashref $item.

$item must be an element of the stack of tokens output by the parse.

The stack's elements are documented below in "FAQ" under "How is the parsed graph stored in RAM?".

generate_token_file($file_name)

Returns nothing.

Writes a CSV file of tokens output by the parse if new() was called with the token_file option.

get_graph_from_command_line()

If the caller has requested a graph be parsed from the command line, with the description option to new(), get it now.

Called as appropriate by run().

get_graph_from_file()

If the caller has requested a graph be parsed from a file, with the input_file option to new(), get it now.

Called as appropriate by run().

grammar()

Returns an object of type Marpa::R2::Scanless::G.

graph_text([$graph])

Here, the [] indicate an optional parameter.

Returns the value of the graph definition string, from either the command line or a file.

input_file([$graph_file_name])

Here, the [] indicate an optional parameter.

Gets or sets the name of the file to read the graph definition from.

See also the "description([$graph])" method.

The whole file is slurped in as a single graph.

The first few lines of the file can start with /^\s*#/, and will be discarded as comments.

The value supplied to the description() method takes precedence over the value read from the input file.

Also, input_file is an option to new().

node()

Returns nothing.

Processes the node name string returned by Marpa::R2 when it pauses during the processing of '[' ... ']'.

Then, pushes this node name onto a stack.

The stack's elements are documented below in the "FAQ" under "How is the parsed graph stored in RAM?".

process()

Returns the result of calling Marpa's value() method.

Does the real work. Called by run() after processing the user's options.

recce()

Returns an object of type Marpa::R2::Scanless::R.

renumber_items()

Ensures each item in the stack as a sequential number 1 .. N.

report()

Reports (prints) the list of items recognized by the parser.

report_tokens([0 or 1])

Here, the [] indicate an optional parameter.

Gets or sets the value which determines whether or not to report the items recognised by the parser.

Also, report_tokens is an option to new().

run()

This is the only method the caller needs to call. All parameters are supplied to new().

Returns 0 for success and 1 for failure.

verbose([0 .. 2])

Here, the [] indicate an optional parameter.

Gets or sets the value which determines how many progress reports are printed.

Also, verbose is an option to new().

FAQ

What is the grammar parsed by this module?

It's a cut-down version of the Graph::Easy::Marpa language. See "What is the Graph::Easy::Marpa language?" in Graph::Easy::Marpa::Parser.

Firstly, a summary:

        Element        Syntax
        ---------------------
        Edge names     Either '->' or '--'
        ---------------------
        Node names     1: Delimited by '[' and ']'.
                       2: May be quoted with " or '.
                       3: Escaped characters, using '\', are allowed.
                       4: Internal spaces in node names are preserved even if not quoted.
        ---------------------
        Attributes     1: Delimited by '{' and '}'.
                       2: Within that, any number of "key : value" pairs separated by ';'.
                       3: Values may be quoted with " or ' or '<...>' or '<<table>...</table>>'.
                       4: Escaped characters, using '\', are allowed.
                       5: Internal spaces in attribute values are preserved even if not quoted.
        ---------------------

Note: Both edges and nodes can have attributes.

Note: HTML-like labels trigger special-case processing in Graphviz. See "Why doesn't the parser handle my HTML-style labels?" below.

Demo page:

        L<http://savage.net.au/Perl-modules/html/marpax.demo.stringparser/>
        L<Graph::Easy::Marpa|http://savage.net.au/Perl-modules/html/graph.easy.marpa/>

The latter page utilizes the entire Graph::Easy::Marpa language. See "What is the Graph::Easy::Marpa language?" in Graph::Easy::Marpa::Parser.

And now the details:

o Attributes

Both nodes and edges can have any number of attributes.

Attributes are delimited by '{' and '}'.

These attributes are listed immdiately after their owing node or edge.

Each attribute consists of a key:value pair, where ':' must appear literally.

These key:value pairs must be separated by the ';' character. A trailing ';' is optional.

The values for 'key' are reserved words used by Graphviz's attributes. These keys match the regexp /^[a-zA-Z_]+$/.

For the 'value', any printable character can be used.

Some escape sequences are a special meaning within Graphviz.

E.g. if you use [node name] {label: \N}, then if that graph is input to Graphviz's dot, \N will be replaced by the name of the node.

Some literals - ';', '}', '<', '>', '"', "'" - can be used in the attribute's value, but they must satisfy one of these conditions. They must be:

o Escaped using '\'.

Eg: \;, \}, etc.

o Placed inside " ... "
o Placed inside ' ... '
o Placed inside <...>

This does not mean you can use <<Some text>>. See the next point.

o Placed inside <<table> ... </table>>

Using this construct allows you to use HTML entities such as &amp;, &lt;, &gt; and &quot;.

Internal spaces are preserved within an attribute's value, but leading and trailing spaces are not (unless quoted).

Samples:

        [node.1] {color: red; label: Green node}
        -> {penwidth: 5; label: From Here to There}
        [node.2]
        -> {label: "A literal semicolon '\;' in a label"}

Note: That '\;' does not actually need those single-quote characters, since it is within a set of double-quotes.

Note: Attribute values quoted with a balanced pair or single- or double-quotes will have those quotes stripped.

o Comments

The first few lines of the input file can start with /^\s*#/, and will be discarded as comments.

o Daisy-chains

See Wikipedia for the origin of this term.

o Edges

Edges can be daisy-chained by juxtaposition, or by using a comma (','), newline, space, or attributes ('{...}') to separate them.

Hence both of these are valid: '->,->{color:green}' and '->{color:red}->{color:green}'.

See data/edge.02.ge and data/edge.06.ge.

o Groups

Groups can be daisy chained by juxtaposition, or by using a newline or space to separate them.

o Nodes

Nodes can be daisy-chained by juxtaposition, or by using a comma (','), newline, space, or attributes ('{...}') to separate them.

Hence all of these are valid: '[node.1][node.2]' and '[node.1],[node.2]' and '[node.1]{color:red}[node.2]'.

o Edges

Edge names are either '->' or '--'.

No other edge names are accepted.

Note: The syntax for edges is just a visual clue for the user. The directed 'v' undirected nature of the graph depends on the value of the 'directed' attribute present (explicitly or implicitly) in the input stream. Nevertheless, usage of '->' or '--' must match the nature of the graph, or Graphviz will issue a syntax error.

Samples:

        ->
        --
o Graphs

Graphs are sequences of nodes and edges, in any order.

The sample given just above for attributes is in fact a single graph.

A sample:

        [node]
        [node] ->
        -> {label: Start} -> {color: red} [node.1] {color: green} -> [node.2]
        [node.1] [node.2] [node.3]

For more samples, see the data/*.ge files shipped with the distro.

o Line-breaks

These are converted into a single space.

o Nodes

Nodes are delimited by '[' and ']'.

Within those, any printable character can be used for a node's name.

Some literals - ']', '"', "'" - can be used in the node's value, but they must satisfy one of these conditions. They must be:

o Escaped using '\'

Eg: \].

o Placed inside " ... "
o Placed inside ' ... '

Internal spaces are preserved within a node's name, but leading and trailing spaces are not (unless quoted).

Lastly, the node's name can be empty. I.e.: You use '[]' in the input stream to create an anonymous node.

Samples:

        []
        [node.1]
        [node 1]
        [[node\]]
        ["[node]"]
        [     From here     ] -> [     To there     ]

Note: Node names quoted with a balanced pair or single- or double-quotes will have those quotes stripped.

Does this module handle utf8?

Yes. See the last sample on the demo page.

How is the parsed graph stored in RAM?

Items are stored in an arrayref managed by Set::Array. This arrayref is available via the "items()" method.

Each element in the array is a hashref, listed here in alphabetical order by type.

Note: Items are numbered from 1 up.

o Attributes

An attribute can belong to a node or an edge. An attribute definition of '{color: red;}' would produce a hashref of:

        {
                count => $n,
                name  => 'color',
                type  => 'attribute',
                value => 'red',
        }

An attribute definition of '{color: red; shape: circle}' will produce 2 hashrefs, i.e. 2 sequential elements in the arrayref:

        {
                count => $n,
                name  => 'color',
                type  => 'attribute',
                value => 'red',
        }

        {
                count => $n + 1,
                name  => 'shape',
                type  => 'attribute',
                value => 'circle',
        }

Attribute hashrefs appear in the arrayref immediately after the item (edge or node) to which they belong.

o Edges

An edge definition of '->' would produce a hashref of:

        {
                count => $n,
                name  => '->',
                type  => 'edge',
                value => '',
        }
o Nodes

A node definition of '[Name]' would produce a hashref of:

        {
                count => $n,
                name  => 'Name',
                type  => 'node',
                value => '',
        }

A node can have a definition of '[]', which means it has no name. Such nodes are called anonymous (or invisible) because while they take up space in the output stream, they have no printable or visible characters if the output stream is turned into a graph by Graphviz's dot program.

Each anonymous node will have at least these 2 attributes:

        {
                count => $n,
                name  => '',
                type  => 'node',
                value => '',
        }

        {
                count => $n + 1,
                name  => 'color',
                type  => 'attribute',
                value => 'invis',
        }

You can of course give your anonymous nodes any attributes, but they will be forced to have these attributes.

E.g. If you give it a color, that would become element $n + 2 in the arrayref, and hence that color would override the default color 'invis'. See the output for data/node.03.ge on the demo page.

Node names are case-sensitive in dot, but that does not matter within the context of this module.

Why doesn't the parser handle my HTML-style labels?

Traps for young players:

o The <br /> component must include the '/'
o If any tag's attributes use double-quotes, they will be doubled in the CSV output file

That is, just like double-quotes everywhere else.

See http://www.graphviz.org/content/dot-language for details of Graphviz's HTML-like syntax.

See data/table.*.ge for a set of examples.

Why do I get error messages like the following?

        Error: <stdin>:1: syntax error near line 1
        context: digraph >>>  Graph <<<  {

Graphviz reserves some words as keywords, meaning they can't be used as an ID, e.g. for the name of the graph. So, don't do this:

        strict graph graph{...}
        strict graph Graph{...}
        strict graph strict{...}
        etc...

Likewise for non-strict graphs, and digraphs. You can however add double-quotes around such reserved words:

        strict graph "graph"{...}

Even better, use a more meaningful name for your graph...

The keywords are: node, edge, graph, digraph, subgraph and strict. Compass points are not keywords.

See keywords in the discussion of the syntax of DOT for details.

Where are the action subs named in the grammar?

In MarpaX::Demo::StringParser::Actions.

What is the homepage of Marpa?

http://jeffreykegler.github.io/Ocean-of-Awareness-blog/.

How do I reconcile Marpa's approach with classic lexing and parsing?

I've included in a recent article a section called Constructing a Mental Picture of Lexing and Parsing which is aimed at helping us think about this issue.

How did you generate the html/*.svg files?

With a private script which uses Graph::Easy::Marpa::Renderer::GraphViz2 V 2.00. This script is not shipped in order to avoid a dependency on that module. Also, another private script which validates Build.PL and Makefile.PL would complain about the missing dependency.

See the demo page for details.

Machine-Readable Change Log

The file Changes was converted into Changelog.ini by Module::Metadata::Changes.

Version Numbers

Version numbers < 1.00 represent development versions. From 1.00 up, they are production versions.

Repository

https://github.com/ronsavage/MarpaX-Demo-StringParser

Support

Email the author, or log a bug on RT:

https://rt.cpan.org/Public/Dist/Display.html?Name=MarpaX::Demo::StringParser.

Author

MarpaX::Demo::StringParser was written by Ron Savage <ron@savage.net.au> in 2013.

Home page: http://savage.net.au/.

Copyright

Australian copyright (c) 2013, Ron Savage.

        All Programs of mine are 'OSI Certified Open Source Software';
        you can redistribute them and/or modify them under the terms of
        The Artistic License, a copy of which is available at:
        http://www.opensource.org/licenses/index.html