The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

MarpaX::Languages::SVG::Parser - A nested SVG parser, using XML::SAX and Marpa::R2

Synopsis

        #!/usr/bin/env perl

        use strict;
        use warnings;

        use MarpaX::Languages::SVG::Parser;

        # ---------------------------------

        my(%option) =
        (
                input_file_name => 'data/ellipse.01.svg',
        );
        my($parser) = MarpaX::Languages::SVG::Parser -> new(%option);
        my($result) = $parser -> run;

        die "Parse failed\n" if ($result == 1);

        for my $item (@{$parser -> items -> print})
        {
                print sprintf "%-16s  %-16s  %s\n", $$item{type}, $$item{name}, $$item{value};
        }

This script ships as scripts/synopsis.pl. Run it as:

        shell> perl -Ilib scripts/synopsis.pl

See also scripts/parse.file.pl for code which takes command line parameters. For help, run:

        shell> perl -Ilib scripts/parse.file.pl -h

Description

MarpaX::Languages::SVG::Parser uses XML::SAX and Marpa::R2 to parse SVG into an array of hashrefs.

XML::SAX parses the input file, and then certain tags' attribute values are parsed by Marpa::R2. The attribute values treated specially each have their own BNFs. This is why it's called nested parsing.

Examples of these special cases are the path's 'd' attribute and the 'transform' attribute of various tags.

The SVG versions of the attribute-specific BNFs are here.

See the "FAQ" for details.

Installation

Install MarpaX::Languages::SVG::Parser as you would for any Perl module:

Run:

        cpanm MarpaX::Languages::SVG::Parser

or run:

        sudo cpan MarpaX::Languages::SVG::Parser

or unpack the distro, and then either:

        perl Build.PL
        ./Build
        ./Build test
        sudo ./Build install

or:

        perl Makefile.PL
        make (or dmake or nmake)
        make test
        make install

Constructor and Initialization

new() is called as my($parser) = MarpaX::Languages::SVG::Parser -> new(k1 => v1, k2 => v2, ...).

It returns a new object of type MarpaX::Languages::SVG::Parser.

Key-value pairs accepted in the parameter list (see also the corresponding methods [e.g. "input_file_name([$string])"]):

o input_file_name => $string

The names the input file to be parsed.

When calling "run(%args)" this is an SVG file (e.g. data/*.svg).

But when calling "test(%args)", this is a text file (e.g. data/*.dat).

This option is mandatory.

Default: ''.

o logger => aLog::HandlerObject

By default, an object of type Log::Handler is created which prints to STDOUT, but given the default setting (maxlevel => 'info'), nothing is actually printed.

See maxlevel and minlevel below.

Set logger to '' (the empty string) to stop a logger being created.

Default: undef.

o maxlevel => logOption1

This option affects Log::Handler objects.

See the Log::Handler::Levels docs.

Since the "report()" method is always called and outputs at log level info, the first of these produces no output, whereas the second lists all the parse results. The third adds a tiny bit to the output.

        shell> perl -Ilib scripts/parse.file.pl -i data/ellipse.01.svg
        shell> perl -Ilib scripts/parse.file.pl -i data/ellipse.01.svg -max info
        shell> perl -Ilib scripts/parse.file.pl -i data/ellipse.01.svg -max debug

The extra output produced by debug includes the input file name and the string which Marpa::R2 is trying to parse. This helps debug the BNFs themselves.

Default: 'notice'.

o minlevel => logOption2

This option affects Log::Handler object.

See the Log::Handler::Levels docs.

Default: 'error'.

No lower levels are used.

o output_file_name => $string

The names the CSV file to be written.

Note: This name is only used when calling "run(%args)". It is of course ignored when calling "test(%args)".

If not set, nothing is written.

See data/circle.01.csv and data/utf8.01.csv, which were created by running:

        shell> perl -Ilib scripts/parse.file.pl -i data/circle.01.svg -o data/circle.01.csv
        shell> perl -Ilib scripts/parse.file.pl -i data/utf8.01.svg   -o data/utf8.01.csv

Default: ''.

Methods

attribute($attribute)

Get or set the name of the attribute being processed.

This is only used in testing, in calls from scripts/test.file.pl and (indirectly) scripts/test.fileset.pl.

It is needed because the test files, data/*.dat, do not contain tag/attribute names, and hence the code needs to be told explicitly which attribute it is parsing.

Note: attribute is a parameter to new().

input_file_name([$string])

Here, the [] indicate an optional parameter.

Get or set the name of the file to parse.

When calling "run(%args)" this is an SVG file (e.g. data/*.svg).

But when calling "test(%args)", this is a text file (e.g. data/*.dat).

Note: input_file_name is a parameter to new().

item_count([$new_value])

Here, the [] indicate an optional parameter.

Get or set the counter used to populate the count key in the hashref in the array of parsed tokens.

Used internally.

See the "FAQ" for details.

items()

Returns the instance of Set::Array which manages the array of hashrefs holding the parsed tokens.

$object -> items -> print returns an array ref.

See "Synopsis" in MarpaX::Languages::SVG::Parser for sample code.

See also "new_item($type, $name, $value)".

log($level, $s)

Calls $self -> logger -> log($level => $s) if ($self -> logger).

logger([$log_object])

Here, the [] indicate an optional parameter.

Get or set the log object.

$log_object must be a Log::Handler-compatible object.

To disable logging, just set logger to the empty string.

Note: logger is a parameter to new().

maxlevel([$string])

Here, the [] indicate an optional parameter.

Get or set the value used by the logger object.

This option is only used if an object of type Log::Handler is created. See Log::Handler::Levels.

Note: maxlevel is a parameter to new().

minlevel([$string])

Here, the [] indicate an optional parameter.

Get or set the value used by the logger object.

This option is only used if an object of type Log::Handler is created. See Log::Handler::Levels.

Note: minlevel is a parameter to new().

new()

This method is auto-generated by Moo.

new_item($type, $name, $value)

Pushes another hashref onto the stack managed by $self -> items.

See the "FAQ" for details.

output_file_name([$string])

Here, the [] indicate an optional parameter.

Get or set the name of the (optional) CSV file to write.

Note: output_file_name is a parameter to new().

report()

Prints a nicely-formatted report of the items array via the logger.

run(%args)

The method which does all the work.

%args is a hash which is currently not used.

Returns 0 for a successful parse and 1 for failure.

The code dies if Marpa::R2 itself can't parse the given string.

See also "test(%args)".

save()

Save the parsed tokens to a CSV file, but only if an output file name was provided in the call to "new()" or to "output_file_name([$string])".

test(%args)

This method is used by scripts/test.fileset.pl, since that calls scripts/test.file.pl, to run tests.

%args is a hash which is currently not used.

Returns 0 for a successful parse and 1 for failure.

See also "run(%args)".

Files Shipped with this Module

Data Files

These are all shipped in the data/ directory.

o *.log

The logs of running this on each *.svg file:

        shell> perl -Ilib scripts/parse.file.pl -i data/ellipse.02.svg -max debug > data/ellipse.02.log

The *.log files are generated by scripts/svg2.log.pl.

o circle.01.csv

Output from scripts/parse.file.pl

o circle.01.svg

Test data for scripts/parse.file.pl

o d.bnf

This is the grammar for the 'd' attribute of the 'path' tag.

Note: The module does not read this file. A copy of the grammar is stored at the end of the source code for Marpa::Languages::SVG::Parser::SAXHandler, and read by Data::Section::Simple.

o d.*.dat

Fake data to test d.bnf.

Input for scripts/test.file.pl.

o html/d.svg

This is the graph of the grammar d.bnf.

It was generated by scripts/bnf2graph.pl.

o ellipse.*.svg

Test data for scripts/parse.file.pl

o line.01.svg

Test data for scripts/parse.file.pl

o points.bnf

This grammar is for both the polygon and polyline 'points' attributes.

o points.*.dat

Fake data to test points.bnf.

Input for scripts/test.file.pl.

o polygon.01.svg

Test data for scripts/parse.file.pl

o polyline.01.svg

Test data for scripts/parse.file.pl

o preserveAspectRatio.bnf

This grammar is for the 'preserveAspectRatio' attribute of various tags.

o preserveAspectRatio.*.dat

Fake data to test preserveAspectRatio.bnf.

Input for scripts/test.file.pl.

o preserveAspectRatio.01.svg

Test data for scripts/parse.file.pl

o html/preserveAspectRatio.svg

This is the graph of the grammar preserveAspectRatio.bnf.

It was generated by scripts/bnf2graph.sh.

o rect.*.svg

Test data for scripts/parse.file.pl

o transform.bnf

This grammar is for the 'transform' attribute of various tags.

o transform.*.dat

Fake data to test transform.bnf.

Input for scripts/test.file.pl.

o utf8.01.csv

Output from scripts/parse.file.pl

o utf8.01.log

The log of running:

        shell> perl -Ilib scripts/parse.file.pl -i data/utf8.01.svg -max debug > data/utf8.01.log
o utf8.01.svg

Test data for scripts/parse.file.pl

o viewBox.bnf

This grammar is for the 'viewBox' attribute of various tags.

o viewBox.*.dat

Fake data to test viewBox.bnf.

Input for scripts/test.file.pl.

o html/viewBox.svg

This is the graph of the grammar viewBox.bnf.

It was generated by scripts/bnf2graph.sh.

Scripts

These are all shipped in the scripts/ directory.

o bnf2graph.pl

Finds all data/*.bnf files and converts them into html/*.svg.

        shell> perl -Ilib scripts/bnf2graph.pl

Requires MarpaX::Grammar::GraphViz2.

o copy.config.pl

This is for use by the author. It just copies the config file out of the distro, so the script generate.demo.pl (which uses HTML template stuff) can find it.

o find.config.pl

This cross-checks the output of copy.config.pl.

o float.pl

This was posted by Jean-Damien Durand on the Marpa Google Group, as a demonstration of a grammar for parsing floats and hex numbers.

o generate.demo.pl

Run by generate.demo.sh.

Input files are data/*.bnf and html/*.svg. Output file is html/*.html.

o generate.demo.sh

Runs generate.demo.pl and then copies html/* to my web server's doc dir ($DR).

o number.pl

This also was posted by Jean-Damien Durand on the Marpa Google Group, as a demonstration of a grammar for parsing floats and integers, and binary, octal and hex numbers.

o parse.file.pl

This is the script you'll probably use most frequently. Run with '-h' for help.

o pod2html.sh

This lets me quickly proof-read edits to the docs.

o svg2log.pl

Runs parse.file.pl on each data/*.svg file and saves the output in data/*.log.

o synopsis.pl

The code as per the "Synopsis".

o t/test.fake.data.t

A test script. It parses data/*.dat, which are not SVG files, but just contain attribute value data.

o t/test.real.data.t

A test script. It parses data/*.svg, which are SVG files, and compares them to the shipped files data/*.log.

o test.file.pl

This runs the code on a single test file (data/*.dat, not an svg file). Try:

        shell> perl -Ilib scripts/test.file.pl -a d -i data/d.30.dat -max debug
o test.fileset.pl

This runs the code on a set of files (data/d.*.dat, data/points.*.dat or data/transform.*.dat). Try:

        shell> perl -Ilib scripts/test.fileset.pl -a transform -max debug
o t/version.t

A test script.

FAQ

See also "FAQ" in MarpaX::Languages::SVG::Parser::Actions.

What exactly does this module do?

It parses SVG files (using XML::SAX), and applies special parsing (using Marpa::R2) to certain attributes of certain tags.

The output is an array of hashrefs, whose structure is described below.

Which SVG attributes are treated specially by this module?

o d

This is the 'd' attribute of the 'path' tag.

o points

This is the 'points' attribute of both the 'polygon' and 'polyline' tags.

o preserveAspectRatio

Various tags can have the 'preserveAspectRatio' attribute.

o transform

Various tags can have the 'transform' attribute.

o viewBox

Various tags can have a 'viewBox' attribute.

Each of these special cases has its own Marpa-style BNF.

The SVG versions of the attribute-specific BNFs are here.

Where are the specs for SVG and the BNFs?

W3C's SVG specs. In particular, see paths and shapes.

The BNFs have been translated into the syntax used by Marpa::R2. See Marpa::R2::Scanless::DSL for details.

These BNFs are actually stored at the end of the source code of MarpaX::Languages::SVG::Parser::SAXHandler, and loaded one at a time into Marpa using that fine module Data::Section::Simple.

Also, the BNFs are shipped in data/*.bnf, and in html/*.svg.

Is the stuff at the start of the SVG file preserved in the array?

If by 'stuff' you mean:

        <?xml version="1.0" standalone="no"?>
        <!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN"
                "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">

Then, no.

I could not get the xml_decl etc events to fire using XML::SAX V 0.99 and XML::SAX::ParserFactory V 1.01.

Why don't you capture comments?

Because Perl instantly segfaults if I try. Code tried in SAXHandler.pm:

        sub comment
        {
                my($self, $element) = @_;
                my($comment) = $$element{Data};

                $self -> log(debug => "Comment: $comment");  # Prints ok.
                $self -> new_item('comment', '-', $comment); # Segfaults.

        }       # End of comment.

Hence - No comment.

How do I get access to this array?

The "Synopsis" contains a runnable program, which ships as scripts/synopsis.pl.

How is the parser's output stored in RAM?

It is stored in an array of hashrefs managed by the Set::Array module.

The hashref structure is documented in the next item.

Using Set::Array is much simpler than using an arrayref. Compare:

        $self -> items -> push
                ({
                        count => $self -> item_count,
                        name  => $name,
                        type  => $type,
                        value => $value,
                });

With:

        $self -> items([]);
        ...
        my($araref) = $self -> items;
        push @$araref,
                {
                        count => $self -> item_count,
                        name  => $name,
                        type  => $type,
                        value => $value,
                };
        $self -> items($araref);

What exactly is the structure of the hashrefs output by the parser?

Firstly, since the following text may be confusing, the very next item in this FAQ, "Annotated output", is designed to clarify things.

Also, it may be necessary to study data/*.log to fully grasp this structure.

Each hashref has these (key => value) pairs:

o count => $integer

This simply counts the number of the hashref within the array, starting from 1.

o name => $string
o tags and attributes

If the type's value matches /^(attribute|tag)$/, then this is the tag name or attribute name from the SVG.

Note: The SAX parser used, XML::SAX, outputs these names with a '{}' prefix. The code strips this prefix.

However, for other items, where the '{...}' is not empty, the specific string is left intact. See data/utf8.01.log for this sample:

        Item  Type              Name              Value
           1  tag               svg               open
           2  attribute         {http://www.w3.org/2000/xmlns/}xlink  http://www.w3.org/1999/xlink
        ...

You have been warned.

o Parser-generated tokens

In the case that this current array element has been generated by parsing the value of the attribute, the name's value depends on the value of the type field.

In all such cases, the array contains a hashref with the name 'raw', and with the value being the tag's original value.

The elements which follow the one named 'raw' are the output of Marpa parsing the value.

o type => $string

This key can take the following values:

o attribute

This is an attribute for the most-recently opened tag.

The name and value fields are for an attribute which has not been specially parsed.

The next element in the array is necessarily another token from the SVG.

See raw for the other case (i.e. compared to attribute).

o boolean

The value must be 0 or 1.

The name field in this case will be a counter of parameters for the preceeding command (see next point).

o command

The name field is the letter (Mm, ..., Zz) for the command itself. In these cases, the value is '-'.

Note: As of V 1.01, in the hashref returned by the action sub command, the value is actually an arrayref of the commands parameters. In V 1.00, the name was '-' and the value was the commany letter. This change was made when I stopped pushing hashrefs onto a stack, and converted the return value of the sub from scalar to hashref.

o content

This is the text content for the most recently opened, but still unclosed, tag. It may be the empty string. Likewise, it may contain any number of newlines, since it's copied faithfully from the input *.svg file.

It will actually be followed by an array element flagging the closing of the tag it belongs to.

o float

Any float.

The name field in this case will be a counter of parameters for the preceeding command.

o integer

Any integer, but probably always 0, because of the way Marpa handles the BNF.

The name field in this case will be a counter of parameters for the preceeding command.

o raw

The name and value fields are for an attribute which has been specially parsed.

The next element in the array is necessarily not another token from the SVG.

Rather, the array elements following this one are output from the Marpa-based parse of the value in the current hashref's value key.

What this means is that if you are scanning the array, and detect a type of raw, all elements in the array (after this one), up to the next item of type =~ /^(attribute|content|raw|tag)$/, must be parameters output from the parse of the value in the current hashref's value key.

There is one exception to the claim that 'The next element in the array is necessarily not another token from the SVG.' Consider:

        <polygon points="350,75  379,161 469,161 397,215
        423,301 350,250 277,301 303,215 231,161 321,161z" />

The 'z' (which itself takes no parameters) at the end of the points is the last thing output for this tag, so the close tag item will be next array element.

See attribute for the other case (i.e. compared to raw).

o tag

The name and value fields are for a tag.

The name is the name of the tag, and the value is 'open' or 'close'.

o value => $string

The interpretation of this string depends on the value of the type key. Basically:

In the case of tags, this string is either 'open' or 'close'.

In the case of attributes, it is the attribute's value.

In the case of parsed attributes, it is an SVG command or one of that command's parameters.

See the next FAQ item for details.

Annotated output

Here is a fragment of data/ellipse.02.svg:

        <path d="M300,200 h-150 a150,150 0 1,0 150,-150 z"
                fill="red" stroke="blue" stroke-width="5" />

And here is the output from the built-in reporting mechanism (see data/ellipse.02.log):

        Item  Type        Name              Value
           1  tag         svg               open
                ...
          27  tag         path              open
          28  raw         d                 M300,200 h-150 a150,150 0 1,0 150,-150 z
          29  command     M                 -
          30  float       1                 300
          31  float       2                 200
          32  command     h                 -
          33  float       1                 -150
          34  command     a                 -
          35  float       1                 150
          36  float       2                 150
          37  integer     3                 0
          38  boolean     4                 1
          39  boolean     5                 0
          40  float       6                 150
          41  float       7                 -150
          42  command     z                 -
          43  attribute   fill              red
          44  attribute   stroke            blue
          45  attribute   stroke-width      5
          46  content     path
          47  tag         path              close
                ...
          66  tag         svg               close

Let's go thru it:

o Item 27 is the open tag for the path
        Type:  tag
        Name:  path
        Value: open
o Item 28 is the path's 1st attribute, 'd'
        Type:  raw
        Name:  d
        Value: M300,200 h-150 a150,150 0 1,0 150,-150 z

But since the type is raw we know both that it's an attribute, and that it must be followed by the parsed output of that value.

Note: Attributes are reported in sorted order, but the parameters after parsing the attributes' values cannot be, because drawing the coordinates of the value is naturally order-dependent.

o Item 29
        Type:   command
        Name:   M
        Values: '-'

This in turn is followed by its respective parameters, if any.

Note: 'Z' and 'z' have no parameters.

o Item 30 .. 31

Two floats. Commas are discarded in the parsing of all special values.

Also, you'll notice they are numbered for your convenience by the name key in their hashrefs.

o Item 32
        Type:   command
        Name:   h
        Values: '-'
o Item 33

This is the float which belongs to 'h'.

o Item 34
        Type:   command
        Name:   a
        Values: '-'
o Items 35 .. 41

The 7 parameters of the 'a' command. You'll notice the parser calls 0 an integer rather than a float. SVG does not care, and neither should you. But, since the code knows it is, it might as well tell you.

The two Boolean flags are picked up explicitly, and the code tells you that, too.

o Item 42
        Type:   command
        Name:   z
        Values: '-'

As stated, it has no following parameters.

o Items 43 .. 46

The remaining attributes of the 'path'. None of these are treated specially.

o Item 47 is the close tag for the path
        Type:  tag
        Name:  path
        Value: close

And, yes, this does mean self-closing tags, such as 'path', have 2 items in the array, with values of 'open' and 'close'. This allows code scanning the array to know absolutely where the data for the tag finishes.

Why did you use XML::SAX::ParserFactory to parse the SVG?

I find the SAX mechanism for handling XML particularly easy to work with.

I did start with XML::Rules, a great module, for the debugging of the BNFs, but the problem is that too many tags shared attributes (see 'transform' etc above), which made the code awkward.

Also, that module triggers a callback for closing a tag before triggering the call to process the attributes defined by the opening of that tag. This adds yet more complexity.

How are file encodings handled?

I let File::Slurper choose the encoding.

For output, scripts/parse.file.pl uses the pragma:

        use open qw(:std :utf8); # Undeclared streams in UTF-8.

This is needed if reading files encoded in utf-8, such as data/utf8.01.svg, and at the same time trying to print the parsed results to the screen by calling "maxlevel([$string])" with $string set to info or debug.

Without this pragma, data/utf8.01.svg gives you the dread 'Wide character in print...' message.

The pragma is not in the module because it's global, and the end user's program may not want it at all.

In scripts/tests.real.data.t change the call to Path::Tiny.spew() to spew_utf8() so UTF8 in the log is written in raw mode. Now the test file logs created under Debian and shipped can be safely compared with the logs written when the code is tested under MS Windows.

Lastly, I have unilaterally set the utf8 attribute used by Log::Handler. This is harmless for non-utf-8 file, and is vital for data/utf8.01.svg and similar end-user files. It allows the log output (STDOUT) to be redirected. And indeed, this is what some of the tests do.

TODO

This lists some possibly nice-to-have items, but none of them are important:

o Store BNF's in an array

This could be done by reading them once using Data::Section::Simple, in MarpaX::Languages::SVG::Parser::SAXHandler, and caching them, rather than re-reading them each time a BNF is required.

o Re-write grammars to do left-recursion

Well, Jeffrey suggested this, but I don't have the skills (yet).

Machine-Readable Change Log

The file Changes was converted into Changelog.ini by Module::Metadata::Changes.

Version Numbers

Version numbers < 1.00 represent development versions. From 1.00 up, they are production versions.

Repository

https://github.com/ronsavage/MarpaX-Languages-SVG-Parser

Support

Email the author, or log a bug on RT:

https://rt.cpan.org/Public/Dist/Display.html?Name=MarpaX::Languages::SVG::Parser.

Credits

The BNFs are partially based on the W3C's SVG specs, and partially (for numbers) on 2 programs posted by Jean-Damien Durand to the Marpa Google group. The thread is titled 'Space (\s) problems with my grammar'.

Note: Some posts (as of 2013-10-16) in that thread can't be displayed. This may be a temporary issue. See scripts/float.pl and scripts/number.pl for Jean-Damien's original code, which were of considerable help to me.

Specifically, I use number.pl for integers and floats, with these adjustments:

o The code did not handle negative numbers, but an optional sign was already defined, so that was easy
o The code did not handle 0
o The code included hex and octal and binary numbers, which I did not need

Author

MarpaX::Languages::SVG::Parser was written by Ron Savage <ron@savage.net.au> in 2013.

Home page: http://savage.net.au/.

Copyright and Licence

Australian copyright (c) 2013, Ron Savage.

        All Programs of mine are 'OSI Certified Open Source Software';
        you can redistribute them and/or modify them under the terms of
        The Artistic License 2.0, a copy of which is available at:
        http://www.opensource.org/licenses/index.html