MarpaX::Languages::SVG::Parser - A nested SVG parser, using XML::SAX and Marpa::R2
MarpaX::Languages::SVG::Parser
#!/usr/bin/env perl use strict; use warnings; use MarpaX::Languages::SVG::Parser; # --------------------------------- my(%option) = ( input_file_name => 'data/ellipse.01.svg', ); my($parser) = MarpaX::Languages::SVG::Parser -> new(%option); my($result) = $parser -> run; die "Parse failed\n" if ($result == 1); for my $item (@{$parser -> items -> print}) { print sprintf "%-16s %-16s %s\n", $$item{type}, $$item{name}, $$item{value}; }
This script ships as scripts/synopsis.pl. Run it as:
shell> perl -Ilib scripts/synopsis.pl
See also scripts/parse.file.pl for code which takes command line parameters. For help, run:
shell> perl -Ilib scripts/parse.file.pl -h
MarpaX::Languages::SVG::Parser uses XML::SAX and Marpa::R2 to parse SVG into an array of hashrefs.
XML::SAX parses the input file, and then certain tags' attribute values are parsed by Marpa::R2. The attribute values treated specially each have their own BNFs. This is why it's called nested parsing.
Examples of these special cases are the path's 'd' attribute and the 'transform' attribute of various tags.
The SVG versions of the attribute-specific BNFs are here.
See the "FAQ" for details.
Install MarpaX::Languages::SVG::Parser as you would for any Perl module:
Perl
Run:
cpanm MarpaX::Languages::SVG::Parser
or run:
sudo cpan MarpaX::Languages::SVG::Parser
or unpack the distro, and then either:
perl Build.PL ./Build ./Build test sudo ./Build install
or:
perl Makefile.PL make (or dmake or nmake) make test make install
new() is called as my($parser) = MarpaX::Languages::SVG::Parser -> new(k1 => v1, k2 => v2, ...).
new()
my($parser) = MarpaX::Languages::SVG::Parser -> new(k1 => v1, k2 => v2, ...)
It returns a new object of type MarpaX::Languages::SVG::Parser.
Key-value pairs accepted in the parameter list (see also the corresponding methods [e.g. "input_file_name([$string])"]):
The names the input file to be parsed.
When calling "run(%args)" this is an SVG file (e.g. data/*.svg).
But when calling "test(%args)", this is a text file (e.g. data/*.dat).
This option is mandatory.
Default: ''.
By default, an object of type Log::Handler is created which prints to STDOUT, but given the default setting (maxlevel => 'info'), nothing is actually printed.
See maxlevel and minlevel below.
maxlevel
minlevel
Set logger to '' (the empty string) to stop a logger being created.
logger
Default: undef.
This option affects Log::Handler objects.
See the Log::Handler::Levels docs.
Since the "report()" method is always called and outputs at log level info, the first of these produces no output, whereas the second lists all the parse results. The third adds a tiny bit to the output.
info
shell> perl -Ilib scripts/parse.file.pl -i data/ellipse.01.svg shell> perl -Ilib scripts/parse.file.pl -i data/ellipse.01.svg -max info shell> perl -Ilib scripts/parse.file.pl -i data/ellipse.01.svg -max debug
The extra output produced by debug includes the input file name and the string which Marpa::R2 is trying to parse. This helps debug the BNFs themselves.
debug
Default: 'notice'.
This option affects Log::Handler object.
Default: 'error'.
No lower levels are used.
The names the CSV file to be written.
Note: This name is only used when calling "run(%args)". It is of course ignored when calling "test(%args)".
If not set, nothing is written.
See data/circle.01.csv and data/utf8.01.csv, which were created by running:
shell> perl -Ilib scripts/parse.file.pl -i data/circle.01.svg -o data/circle.01.csv shell> perl -Ilib scripts/parse.file.pl -i data/utf8.01.svg -o data/utf8.01.csv
Get or set the name of the attribute being processed.
This is only used in testing, in calls from scripts/test.file.pl and (indirectly) scripts/test.fileset.pl.
It is needed because the test files, data/*.dat, do not contain tag/attribute names, and hence the code needs to be told explicitly which attribute it is parsing.
Note: attribute is a parameter to new().
attribute
Here, the [] indicate an optional parameter.
Get or set the name of the file to parse.
Note: input_file_name is a parameter to new().
input_file_name
Get or set the counter used to populate the count key in the hashref in the array of parsed tokens.
count
Used internally.
Returns the instance of Set::Array which manages the array of hashrefs holding the parsed tokens.
$object -> items -> print returns an array ref.
See "Synopsis" in MarpaX::Languages::SVG::Parser for sample code.
See also "new_item($type, $name, $value)".
Calls $self -> logger -> log($level => $s) if ($self -> logger).
Get or set the log object.
$log_object must be a Log::Handler-compatible object.
$log_object
To disable logging, just set logger to the empty string.
Note: logger is a parameter to new().
Get or set the value used by the logger object.
This option is only used if an object of type Log::Handler is created. See Log::Handler::Levels.
Note: maxlevel is a parameter to new().
Note: minlevel is a parameter to new().
This method is auto-generated by Moo.
Pushes another hashref onto the stack managed by $self -> items.
Get or set the name of the (optional) CSV file to write.
Note: output_file_name is a parameter to new().
output_file_name
Prints a nicely-formatted report of the items array via the logger.
items
The method which does all the work.
%args is a hash which is currently not used.
%args
Returns 0 for a successful parse and 1 for failure.
The code dies if Marpa::R2 itself can't parse the given string.
See also "test(%args)".
Save the parsed tokens to a CSV file, but only if an output file name was provided in the call to "new()" or to "output_file_name([$string])".
This method is used by scripts/test.fileset.pl, since that calls scripts/test.file.pl, to run tests.
See also "run(%args)".
These are all shipped in the data/ directory.
The logs of running this on each *.svg file:
shell> perl -Ilib scripts/parse.file.pl -i data/ellipse.02.svg -max debug > data/ellipse.02.log
The *.log files are generated by scripts/svg2.log.pl.
Output from scripts/parse.file.pl
Test data for scripts/parse.file.pl
This is the grammar for the 'd' attribute of the 'path' tag.
Note: The module does not read this file. A copy of the grammar is stored at the end of the source code for Marpa::Languages::SVG::Parser::SAXHandler, and read by Data::Section::Simple.
Fake data to test d.bnf.
Input for scripts/test.file.pl.
This is the graph of the grammar d.bnf.
It was generated by scripts/bnf2graph.pl.
This grammar is for both the polygon and polyline 'points' attributes.
Fake data to test points.bnf.
This grammar is for the 'preserveAspectRatio' attribute of various tags.
Fake data to test preserveAspectRatio.bnf.
This is the graph of the grammar preserveAspectRatio.bnf.
It was generated by scripts/bnf2graph.sh.
This grammar is for the 'transform' attribute of various tags.
Fake data to test transform.bnf.
The log of running:
shell> perl -Ilib scripts/parse.file.pl -i data/utf8.01.svg -max debug > data/utf8.01.log
This grammar is for the 'viewBox' attribute of various tags.
Fake data to test viewBox.bnf.
This is the graph of the grammar viewBox.bnf.
These are all shipped in the scripts/ directory.
Finds all data/*.bnf files and converts them into html/*.svg.
shell> perl -Ilib scripts/bnf2graph.pl
Requires MarpaX::Grammar::GraphViz2.
This is for use by the author. It just copies the config file out of the distro, so the script generate.demo.pl (which uses HTML template stuff) can find it.
This cross-checks the output of copy.config.pl.
This was posted by Jean-Damien Durand on the Marpa Google Group, as a demonstration of a grammar for parsing floats and hex numbers.
Run by generate.demo.sh.
Input files are data/*.bnf and html/*.svg. Output file is html/*.html.
Runs generate.demo.pl and then copies html/* to my web server's doc dir ($DR).
This also was posted by Jean-Damien Durand on the Marpa Google Group, as a demonstration of a grammar for parsing floats and integers, and binary, octal and hex numbers.
This is the script you'll probably use most frequently. Run with '-h' for help.
This lets me quickly proof-read edits to the docs.
Runs parse.file.pl on each data/*.svg file and saves the output in data/*.log.
The code as per the "Synopsis".
A test script. It parses data/*.dat, which are not SVG files, but just contain attribute value data.
A test script. It parses data/*.svg, which are SVG files, and compares them to the shipped files data/*.log.
This runs the code on a single test file (data/*.dat, not an svg file). Try:
shell> perl -Ilib scripts/test.file.pl -a d -i data/d.30.dat -max debug
This runs the code on a set of files (data/d.*.dat, data/points.*.dat or data/transform.*.dat). Try:
shell> perl -Ilib scripts/test.fileset.pl -a transform -max debug
A test script.
See also "FAQ" in MarpaX::Languages::SVG::Parser::Actions.
It parses SVG files (using XML::SAX), and applies special parsing (using Marpa::R2) to certain attributes of certain tags.
The output is an array of hashrefs, whose structure is described below.
This is the 'd' attribute of the 'path' tag.
This is the 'points' attribute of both the 'polygon' and 'polyline' tags.
Various tags can have the 'preserveAspectRatio' attribute.
Various tags can have the 'transform' attribute.
Various tags can have a 'viewBox' attribute.
Each of these special cases has its own Marpa-style BNF.
W3C's SVG specs. In particular, see paths and shapes.
The BNFs have been translated into the syntax used by Marpa::R2. See Marpa::R2::Scanless::DSL for details.
These BNFs are actually stored at the end of the source code of MarpaX::Languages::SVG::Parser::SAXHandler, and loaded one at a time into Marpa using that fine module Data::Section::Simple.
Also, the BNFs are shipped in data/*.bnf, and in html/*.svg.
If by 'stuff' you mean:
<?xml version="1.0" standalone="no"?> <!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
Then, no.
I could not get the xml_decl etc events to fire using XML::SAX V 0.99 and XML::SAX::ParserFactory V 1.01.
Because Perl instantly segfaults if I try. Code tried in SAXHandler.pm:
sub comment { my($self, $element) = @_; my($comment) = $$element{Data}; $self -> log(debug => "Comment: $comment"); # Prints ok. $self -> new_item('comment', '-', $comment); # Segfaults. } # End of comment.
Hence - No comment.
The "Synopsis" contains a runnable program, which ships as scripts/synopsis.pl.
It is stored in an array of hashrefs managed by the Set::Array module.
The hashref structure is documented in the next item.
Using Set::Array is much simpler than using an arrayref. Compare:
$self -> items -> push ({ count => $self -> item_count, name => $name, type => $type, value => $value, });
With:
$self -> items([]); ... my($araref) = $self -> items; push @$araref, { count => $self -> item_count, name => $name, type => $type, value => $value, }; $self -> items($araref);
Firstly, since the following text may be confusing, the very next item in this FAQ, "Annotated output", is designed to clarify things.
Also, it may be necessary to study data/*.log to fully grasp this structure.
Each hashref has these (key => value) pairs:
This simply counts the number of the hashref within the array, starting from 1.
If the type's value matches /^(attribute|tag)$/, then this is the tag name or attribute name from the SVG.
value
Note: The SAX parser used, XML::SAX, outputs these names with a '{}' prefix. The code strips this prefix.
However, for other items, where the '{...}' is not empty, the specific string is left intact. See data/utf8.01.log for this sample:
Item Type Name Value 1 tag svg open 2 attribute {http://www.w3.org/2000/xmlns/}xlink http://www.w3.org/1999/xlink ...
You have been warned.
In the case that this current array element has been generated by parsing the value of the attribute, the name's value depends on the value of the type field.
name's
type
In all such cases, the array contains a hashref with the name 'raw', and with the value being the tag's original value.
name
The elements which follow the one named 'raw' are the output of Marpa parsing the value.
named
This key can take the following values:
This is an attribute for the most-recently opened tag.
The name and value fields are for an attribute which has not been specially parsed.
The next element in the array is necessarily another token from the SVG.
See raw for the other case (i.e. compared to attribute).
raw
The value must be 0 or 1.
The name field in this case will be a counter of parameters for the preceeding command (see next point).
command
The name field is the letter (Mm, ..., Zz) for the command itself. In these cases, the value is '-'.
Note: As of V 1.01, in the hashref returned by the action sub command, the value is actually an arrayref of the commands parameters. In V 1.00, the name was '-' and the value was the commany letter. This change was made when I stopped pushing hashrefs onto a stack, and converted the return value of the sub from scalar to hashref.
action
This is the text content for the most recently opened, but still unclosed, tag. It may be the empty string. Likewise, it may contain any number of newlines, since it's copied faithfully from the input *.svg file.
It will actually be followed by an array element flagging the closing of the tag it belongs to.
Any float.
The name field in this case will be a counter of parameters for the preceeding command.
Any integer, but probably always 0, because of the way Marpa handles the BNF.
The name and value fields are for an attribute which has been specially parsed.
The next element in the array is necessarily not another token from the SVG.
Rather, the array elements following this one are output from the Marpa-based parse of the value in the current hashref's value key.
current
What this means is that if you are scanning the array, and detect a type of raw, all elements in the array (after this one), up to the next item of type =~ /^(attribute|content|raw|tag)$/, must be parameters output from the parse of the value in the current hashref's value key.
type =~ /^(attribute|content|raw|tag)$/
There is one exception to the claim that 'The next element in the array is necessarily not another token from the SVG.' Consider:
<polygon points="350,75 379,161 469,161 397,215 423,301 350,250 277,301 303,215 231,161 321,161z" />
The 'z' (which itself takes no parameters) at the end of the points is the last thing output for this tag, so the close tag item will be next array element.
See attribute for the other case (i.e. compared to raw).
The name and value fields are for a tag.
The name is the name of the tag, and the value is 'open' or 'close'.
The interpretation of this string depends on the value of the type key. Basically:
In the case of tags, this string is either 'open' or 'close'.
In the case of attributes, it is the attribute's value.
In the case of parsed attributes, it is an SVG command or one of that command's parameters.
See the next FAQ item for details.
Here is a fragment of data/ellipse.02.svg:
<path d="M300,200 h-150 a150,150 0 1,0 150,-150 z" fill="red" stroke="blue" stroke-width="5" />
And here is the output from the built-in reporting mechanism (see data/ellipse.02.log):
Item Type Name Value 1 tag svg open ... 27 tag path open 28 raw d M300,200 h-150 a150,150 0 1,0 150,-150 z 29 command M - 30 float 1 300 31 float 2 200 32 command h - 33 float 1 -150 34 command a - 35 float 1 150 36 float 2 150 37 integer 3 0 38 boolean 4 1 39 boolean 5 0 40 float 6 150 41 float 7 -150 42 command z - 43 attribute fill red 44 attribute stroke blue 45 attribute stroke-width 5 46 content path 47 tag path close ... 66 tag svg close
Let's go thru it:
Type: tag Name: path Value: open
Type: raw Name: d Value: M300,200 h-150 a150,150 0 1,0 150,-150 z
But since the type is raw we know both that it's an attribute, and that it must be followed by the parsed output of that value.
Note: Attributes are reported in sorted order, but the parameters after parsing the attributes' values cannot be, because drawing the coordinates of the value is naturally order-dependent.
Type: command Name: M Values: '-'
This in turn is followed by its respective parameters, if any.
Note: 'Z' and 'z' have no parameters.
Two floats. Commas are discarded in the parsing of all special values.
Also, you'll notice they are numbered for your convenience by the name key in their hashrefs.
Type: command Name: h Values: '-'
This is the float which belongs to 'h'.
Type: command Name: a Values: '-'
The 7 parameters of the 'a' command. You'll notice the parser calls 0 an integer rather than a float. SVG does not care, and neither should you. But, since the code knows it is, it might as well tell you.
The two Boolean flags are picked up explicitly, and the code tells you that, too.
Type: command Name: z Values: '-'
As stated, it has no following parameters.
The remaining attributes of the 'path'. None of these are treated specially.
Type: tag Name: path Value: close
And, yes, this does mean self-closing tags, such as 'path', have 2 items in the array, with values of 'open' and 'close'. This allows code scanning the array to know absolutely where the data for the tag finishes.
values
I find the SAX mechanism for handling XML particularly easy to work with.
I did start with XML::Rules, a great module, for the debugging of the BNFs, but the problem is that too many tags shared attributes (see 'transform' etc above), which made the code awkward.
Also, that module triggers a callback for closing a tag before triggering the call to process the attributes defined by the opening of that tag. This adds yet more complexity.
I let File::Slurper choose the encoding.
For output, scripts/parse.file.pl uses the pragma:
use open qw(:std :utf8); # Undeclared streams in UTF-8.
This is needed if reading files encoded in utf-8, such as data/utf8.01.svg, and at the same time trying to print the parsed results to the screen by calling "maxlevel([$string])" with $string set to info or debug.
$string
Without this pragma, data/utf8.01.svg gives you the dread 'Wide character in print...' message.
The pragma is not in the module because it's global, and the end user's program may not want it at all.
In scripts/tests.real.data.t change the call to Path::Tiny.spew() to spew_utf8() so UTF8 in the log is written in raw mode. Now the test file logs created under Debian and shipped can be safely compared with the logs written when the code is tested under MS Windows.
Lastly, I have unilaterally set the utf8 attribute used by Log::Handler. This is harmless for non-utf-8 file, and is vital for data/utf8.01.svg and similar end-user files. It allows the log output (STDOUT) to be redirected. And indeed, this is what some of the tests do.
This lists some possibly nice-to-have items, but none of them are important:
This could be done by reading them once using Data::Section::Simple, in MarpaX::Languages::SVG::Parser::SAXHandler, and caching them, rather than re-reading them each time a BNF is required.
Well, Jeffrey suggested this, but I don't have the skills (yet).
The file Changes was converted into Changelog.ini by Module::Metadata::Changes.
Version numbers < 1.00 represent development versions. From 1.00 up, they are production versions.
https://github.com/ronsavage/MarpaX-Languages-SVG-Parser
Email the author, or log a bug on RT:
https://rt.cpan.org/Public/Dist/Display.html?Name=MarpaX::Languages::SVG::Parser.
The BNFs are partially based on the W3C's SVG specs, and partially (for numbers) on 2 programs posted by Jean-Damien Durand to the Marpa Google group. The thread is titled 'Space (\s) problems with my grammar'.
Note: Some posts (as of 2013-10-16) in that thread can't be displayed. This may be a temporary issue. See scripts/float.pl and scripts/number.pl for Jean-Damien's original code, which were of considerable help to me.
Specifically, I use number.pl for integers and floats, with these adjustments:
MarpaX::Languages::SVG::Parser was written by Ron Savage <ron@savage.net.au> in 2013.
Home page: http://savage.net.au/.
Australian copyright (c) 2013, Ron Savage.
All Programs of mine are 'OSI Certified Open Source Software'; you can redistribute them and/or modify them under the terms of The Artistic License 2.0, a copy of which is available at: http://www.opensource.org/licenses/index.html
To install MarpaX::Languages::SVG::Parser, copy and paste the appropriate command in to your terminal.
cpanm
CPAN shell
perl -MCPAN -e shell install MarpaX::Languages::SVG::Parser
For more information on module installation, please visit the detailed CPAN module installation guide.