XML::Dataset - Extracts XML into Perl Datasets based upon a simple text profile markup language
version 0.006
use XML::Dataset; use Data::Printer; my $example_data = qq(<?xml version="1.0"?> <catalog> <shop number="1"> <book id="bk101"> <author>Gambardella, Matthew</author> <title>XML Developer's Guide</title> <genre>Computer</genre> <price>44.95</price> <publish_date>2000-10-01</publish_date> <description>An in-depth look at creating applications with XML.</description> </book> <book id="bk102"> <author>Ralls, Kim</author> <title>Midnight Rain</title> <genre>Fantasy</genre> <price>5.95</price> <publish_date>2000-12-16</publish_date> <description>A former architect battles corporate zombies, an evil sorceress, and her own childhood to become queen of the world.</description> </book> </shop> <shop number="2"> <book id="bk103"> <author>Corets, Eva</author> <title>Maeve Ascendant</title> <genre>Fantasy</genre> <price>5.95</price> <publish_date>2000-11-17</publish_date> <description>After the collapse of a nanotechnology society in England, the young survivors lay the foundation for a new society.</description> </book> <book id="bk104"> <author>Corets, Eva</author> <title>Oberon's Legacy</title> <genre>Fantasy</genre> <price>5.95</price> <publish_date>2001-03-10</publish_date> <description>In post-apocalypse England, the mysterious agent known only as Oberon helps to create a new life for the inhabitants of London. Sequel to Maeve Ascendant.</description> </book> </shop> </catalog> ); my $profile = qq( catalog shop book author = dataset:title_and_author title = dataset:title_and_author ); # Capture the output my $output = parse_using_profile( $example_data, $profile ); # Print using Data::Printer p $output;
Provides a simple means of parsing XML to return a selection of information based on a markup profile describing the XML structure and how the structure relates to a logical grouping of information ( a dataset ).
Parses XML based upon a profile.
Input: XML<string>, Profile<string>
I often found myself developing, adjusting and manipulating perl code using a variety of packages to extract XML sources into logical groupings that were relevant to the underline data as opposed to a perl structure of an entire XML source.
As well as the initial time in developing an appropriate construct to parse the source data, any future changes to the XML output involved additional changes to the code base.
I wanted a simplified solution, one where I can leverage a simple markup language that I could operate on to provide the context of interest with the necessary manipulation of data where desired.
I investigated a number of options available in the perl community to simplify the overall process. Whilst many excellent options are available, I did not find an option that provided the level of simplicity that I desired. This module is a result of the effort to fulfill this requirement.
The following example shows the extraction of the title and author information from the example XML document into a dataset called title_and_author.
The XML::Dataset profile follows a similar structure to the XML with elements indented to depict the relationship between entities.
Information that needs to be captured from within an element ( or an attribute ) is referenced using the <value> = dataset:<dataset_name> syntax.
\ { title_and_author [ [0] { author "Gambardella, Matthew", title "XML Developer's Guide" }, [1] { author "Ralls, Kim", title "Midnight Rain" }, [2] { author "Corets, Eva", title "Maeve Ascendant" }, [3] { author "Corets, Eva", title "Oberon's Legacy" } ] }
This example builds upon the previous to facilitate an additional dataset of title_and_genre. As per the example profile, multiple datasets can be specified through a space seperated list as per 'title' which is used for both title_and_author and title_and_genre.
my $profile = qq( catalog shop book author = dataset:title_and_author title = dataset:title_and_author dataset:title_and_genre genre = dataset:title_and_genre );
\ { title_and_author [ [0] { author "Gambardella, Matthew", title "XML Developer's Guide" }, [1] { author "Ralls, Kim", title "Midnight Rain" }, [2] { author "Corets, Eva", title "Maeve Ascendant" }, [3] { author "Corets, Eva", title "Oberon's Legacy" } ], title_and_genre [ [0] { genre "Computer", title "XML Developer's Guide" }, [1] { genre "Fantasy", title "Midnight Rain" }, [2] { genre "Fantasy", title "Maeve Ascendant" }, [3] { genre "Fantasy", title "Oberon's Legacy" } ] }
XML Attributes are treated in the profile as a sub level key/value in the profile. The following example depicts the inclusion of the attribute 'id' in the returned datasets. Note how id is indented under book and on the same level as author, title, genre etc.
my $profile = qq( catalog shop book id = dataset:title_and_author dataset:title_and_genre author = dataset:title_and_author title = dataset:title_and_author dataset:title_and_genre genre = dataset:title_and_genre );
\ { title_and_author [ [0] { author "Gambardella, Matthew", id "bk101", title "XML Developer's Guide" }, [1] { author "Ralls, Kim", id "bk102", title "Midnight Rain" }, [2] { author "Corets, Eva", id "bk103", title "Maeve Ascendant" }, [3] { author "Corets, Eva", id "bk104", title "Oberon's Legacy" } ], title_and_genre [ [0] { genre "Computer", id "bk101", title "XML Developer's Guide" }, [1] { genre "Fantasy", id "bk102", title "Midnight Rain" }, [2] { genre "Fantasy", id "bk103", title "Maeve Ascendant" }, [3] { genre "Fantasy", id "bk104", title "Oberon's Legacy" } ] }
Information that is available at a higher level to that of the specified dataset information can be referenced and included in datasets using a combination of the external_dataset and __EXTERNAL_VALUE__ markers.
The external_dataset marker informs the parser to store the information for later use. It follows the format of external_dataset:<target> where <target> is a reference name that identifies the external store.
The __EXTERNAL_VALUE__ marker informs the parser to reference a value that is or will be stored externally. It follows the format of __EXTERNAL_VALUE__ = <external_store>:<external_value>:<target_dataset>
Optionally the __EXTERNAL_VALUE__ marker can receive an additional parameter of :<override_name> making the full syntax <external_store>:<external_value>:<target_dataset>:<override_name>
my $profile = qq( catalog shop number = external_dataset:shop_information book id = dataset:title_and_author dataset:title_and_genre author = dataset:title_and_author title = dataset:title_and_author dataset:title_and_genre genre = dataset:title_and_genre __EXTERNAL_VALUE__ = shop_information:number:title_and_author shop_information:number:title_and_genre );
\ { title_and_author [ [0] { author "Gambardella, Matthew", id "bk101", number 1, title "XML Developer's Guide" }, [1] { author "Ralls, Kim", id "bk102", number 1, title "Midnight Rain" }, [2] { author "Corets, Eva", id "bk103", number 2, title "Maeve Ascendant" }, [3] { author "Corets, Eva", id "bk104", number 2, title "Oberon's Legacy" } ], title_and_genre [ [0] { genre "Computer", id "bk101", number 1, title "XML Developer's Guide" }, [1] { genre "Fantasy", id "bk102", number 1, title "Midnight Rain" }, [2] { genre "Fantasy", id "bk103", number 2, title "Maeve Ascendant" }, [3] { genre "Fantasy", id "bk104", number 2, title "Oberon's Legacy" } ] }
Dataset declarations can receive additional parameters through comma seperated inclusions. In this example the XML element of 'genre' is renamed to 'style' during processing using the name declaration.
my $profile = qq( catalog shop number = external_dataset:shop_information book id = dataset:title_and_author dataset:title_and_genre author = dataset:title_and_author title = dataset:title_and_author dataset:title_and_genre genre = dataset:title_and_genre,name:style __EXTERNAL_VALUE__ = shop_information:number:title_and_author shop_information:number:title_and_genre );
\ { title_and_author [ [0] { author "Gambardella, Matthew", id "bk101", number 1, title "XML Developer's Guide" }, [1] { author "Ralls, Kim", id "bk102", number 1, title "Midnight Rain" }, [2] { author "Corets, Eva", id "bk103", number 2, title "Maeve Ascendant" }, [3] { author "Corets, Eva", id "bk104", number 2, title "Oberon's Legacy" } ], title_and_genre [ [0] { id "bk101", number 1, style "Computer", title "XML Developer's Guide" }, [1] { id "bk102", number 1, style "Fantasy", title "Midnight Rain" }, [2] { id "bk103", number 2, style "Fantasy", title "Maeve Ascendant" }, [3] { id "bk104", number 2, style "Fantasy", title "Oberon's Legacy" } ] }
The prefix declaration assigns a prefix to the assignment name, for example genre with a prefix of shop_information_ will become shop_information_genre
For consistency, in this example, the external information of name uses the additional optional parameter of :<override_name> as mentioned in Example 4 to override the external name
my $profile = qq( catalog shop number = external_dataset:shop_information book id = dataset:title_and_author,prefix:shop_information_ dataset:title_and_genre,prefix:shop_information_ author = dataset:title_and_author,prefix:shop_information_ title = dataset:title_and_author,prefix:shop_information_ dataset:title_and_genre,prefix:shop_information_ genre = dataset:title_and_genre,prefix:shop_information_ __EXTERNAL_VALUE__ = shop_information:number:title_and_author:shop_information_number shop_information:number:title_and_genre:shop_information_number );
\ { title_and_author [ [0] { shop_information_author "Gambardella, Matthew", shop_information_id "bk101", shop_information_number 1, shop_information_title "XML Developer's Guide" }, [1] { shop_information_author "Ralls, Kim", shop_information_id "bk102", shop_information_number 1, shop_information_title "Midnight Rain" }, [2] { shop_information_author "Corets, Eva", shop_information_id "bk103", shop_information_number 2, shop_information_title "Maeve Ascendant" }, [3] { shop_information_author "Corets, Eva", shop_information_id "bk104", shop_information_number 2, shop_information_title "Oberon's Legacy" } ], title_and_genre [ [0] { shop_information_genre "Computer", shop_information_id "bk101", shop_information_number 1, shop_information_title "XML Developer's Guide" }, [1] { shop_information_genre "Fantasy", shop_information_id "bk102", shop_information_number 1, shop_information_title "Midnight Rain" }, [2] { shop_information_genre "Fantasy", shop_information_id "bk103", shop_information_number 2, shop_information_title "Maeve Ascendant" }, [3] { shop_information_genre "Fantasy", shop_information_id "bk104", shop_information_number 2, shop_information_title "Oberon's Legacy" } ] }
The process parameter can be used for inline manipulation of data. In this example the author is passed through a simple subroutine that returns an uppercase value.
The parser expects methods specified by the process declaration to be available to the main namespace.
sub return_uc { return uc($_[0]); } my $profile = qq( catalog shop number = external_dataset:shop_information book id = dataset:title_and_author dataset:title_and_genre author = dataset:title_and_author,process:return_uc title = dataset:title_and_author dataset:title_and_genre genre = dataset:title_and_genre __EXTERNAL_VALUE__ = shop_information:number:title_and_author shop_information:number:title_and_genre );
\ { title_and_author [ [0] { author "GAMBARDELLA, MATTHEW", id "bk101", number 1, title "XML Developer's Guide" }, [1] { author "RALLS, KIM", id "bk102", number 1, title "Midnight Rain" }, [2] { author "CORETS, EVA", id "bk103", number 2, title "Maeve Ascendant" }, [3] { author "CORETS, EVA", id "bk104", number 2, title "Oberon's Legacy" } ], title_and_genre [ [0] { genre "Computer", id "bk101", number 1, title "XML Developer's Guide" }, [1] { genre "Fantasy", id "bk102", number 1, title "Midnight Rain" }, [2] { genre "Fantasy", id "bk103", number 2, title "Maeve Ascendant" }, [3] { genre "Fantasy", id "bk104", number 2, title "Oberon's Legacy" } ] }
During processing, the parser looks for indicators that it should create a new dataset. As an example, when new data is encountered rather than overriding the existing data, a new dataset is created. Unfortunately this may lead to unexpected results when working with poorly structured input where subsets of information may be missing from the XML structure.
To mitigate this, the hint __NEW_DATASET__ = <dataset> is available to force the creation of a new dataset upon entering a block.
If there are any concerns about the consistency of the XML document then it is recommended that the __NEW_DATASET__ declaration is made within all respective blocks as part of the profile definition.
sub return_uc { return uc($_[0]); } my $profile = qq( catalog shop number = external_dataset:shop_information book __NEW_DATASET__ = title_and_author title_and_genre id = dataset:title_and_author dataset:title_and_genre author = dataset:title_and_author,process:return_uc title = dataset:title_and_author dataset:title_and_genre genre = dataset:title_and_genre __EXTERNAL_VALUE__ = shop_information:number:title_and_author shop_information:number:title_and_genre );
There may be occasions where information at a parallel level is required and subsequently, that information appears after the desired dataset information. To accomodate this, the __NEW_EXTERNAL_VALUE_HOLDER__ marker is available.
This can be used to create a stub store for the holder before it is actually processed by the parser. As the module uses aliases internally, the dataset is updated with a pointer which is subsequently updated to reflect the appropriate value as and when it is reached by the parser.
The XML example has been updated to include an information section that details the shop location.
__NEW_EXTERNAL_VALUE_HOLDER__ is declared at the corresponding indentation with a value of shop_information:address This tells the parser to store an externally referencable marker with a default value of '' -
shop __NEW_EXTERNAL_VALUE_HOLDER__ = shop_information:address
The shop_information:address:title_and_author entry under __EXTERNAL_VALUE__ informs the parser to lookup the externally stored value and store this value in the dataset, at which point storing the exising default value -
__EXTERNAL_VALUE__ = shop_information:number:title_and_author shop_information:number:title_and_genre
The indentation of information and address tells the parser to update the external_dataset entry for shop_information, subsequently updating the default value and reflecting the value where applicable across the desired datasets.
information address = external_dataset:shop_information
sub return_uc { return uc($_[0]); } my $profile = qq( catalog shop __NEW_EXTERNAL_VALUE_HOLDER__ = shop_information:address number = external_dataset:shop_information book __NEW_DATASET__ = title_and_author title_and_genre id = dataset:title_and_author dataset:title_and_genre author = dataset:title_and_author,process:return_uc title = dataset:title_and_author dataset:title_and_genre genre = dataset:title_and_genre __EXTERNAL_VALUE__ = shop_information:number:title_and_author shop_information:number:title_and_genre shop_information:address:title_and_author shop_information:address:title_and_genre information address = external_dataset:shop_information );
\ { title_and_author [ [0] { address "Regents Street", author "GAMBARDELLA, MATTHEW", id "bk101", number 1, title "XML Developer's Guide" }, [1] { address "Regents Street", author "RALLS, KIM", id "bk102", number 1, title "Midnight Rain" }, [2] { address "Oxford Street", author "CORETS, EVA", id "bk103", number 2, title "Maeve Ascendant" }, [3] { address "Oxford Street", author "CORETS, EVA", id "bk104", number 2, title "Oberon's Legacy" } ], title_and_genre [ [0] { address "Regents Street", genre "Computer", id "bk101", number 1, title "XML Developer's Guide" }, [1] { address "Regents Street", genre "Fantasy", id "bk102", number 1, title "Midnight Rain" }, [2] { address "Oxford Street", genre "Fantasy", id "bk103", number 2, title "Maeve Ascendant" }, [3] { address "Oxford Street", genre "Fantasy", id "bk104", number 2, title "Oberon's Legacy" } ] }
I'm a long time advocate of Data::Dumper. Data::Printer is also an excellent module. In the examples, for clarity purposes Data::Printer was chosen over Data::Dumper owing to the display differences that result from the internal use of Data::Alias.
As an example, here is the output from Example 4 depicted through Data::Dumper and Data::Printer.
It's important to understand the internal structure of the datasets if you plan on making changes to the returned information.
$VAR1 = \{ 'title_and_genre' => [ { 'number' => '1', 'title' => 'XML Developer\'s Guide', 'id' => 'bk101', 'genre' => 'Computer' }, { 'number' => ${\${$VAR1}->{'title_and_genre'}->[0]->{'number'}}, 'title' => 'Midnight Rain', 'id' => 'bk102', 'genre' => 'Fantasy' }, { 'number' => '2', 'title' => 'Maeve Ascendant', 'id' => 'bk103', 'genre' => 'Fantasy' }, { 'number' => ${\${$VAR1}->{'title_and_genre'}->[2]->{'number'}}, 'title' => 'Oberon\'s Legacy', 'id' => 'bk104', 'genre' => 'Fantasy' }, { 'number' => '1', 'title' => 'XML Developer\'s Guide', 'id' => 'bk101', 'genre' => 'Computer' }, { 'number' => ${\${$VAR1}->{'title_and_genre'}->[4]->{'number'}}, 'title' => 'Midnight Rain', 'id' => 'bk102', 'genre' => 'Fantasy' }, { 'number' => '2', 'title' => 'Maeve Ascendant', 'id' => 'bk103', 'genre' => 'Fantasy' }, { 'number' => ${\${$VAR1}->{'title_and_genre'}->[6]->{'number'}}, 'title' => 'Oberon\'s Legacy', 'id' => 'bk104', 'genre' => 'Fantasy' } ], 'title_and_author' => [ { 'number' => ${\${$VAR1}->{'title_and_genre'}->[0]->{'number'}}, 'title' => 'XML Developer\'s Guide', 'author' => 'Gambardella, Matthew', 'id' => 'bk101' }, { 'number' => ${\${$VAR1}->{'title_and_genre'}->[0]->{'number'}}, 'title' => 'Midnight Rain', 'author' => 'Ralls, Kim', 'id' => 'bk102' }, { 'number' => ${\${$VAR1}->{'title_and_genre'}->[2]->{'number'}}, 'title' => 'Maeve Ascendant', 'author' => 'Corets, Eva', 'id' => 'bk103' }, { 'number' => ${\${$VAR1}->{'title_and_genre'}->[2]->{'number'}}, 'title' => 'Oberon\'s Legacy', 'author' => 'Corets, Eva', 'id' => 'bk104' }, { 'number' => ${\${$VAR1}->{'title_and_genre'}->[4]->{'number'}}, 'title' => 'XML Developer\'s Guide', 'author' => 'Gambardella, Matthew', 'id' => 'bk101' }, { 'number' => ${\${$VAR1}->{'title_and_genre'}->[4]->{'number'}}, 'title' => 'Midnight Rain', 'author' => 'Ralls, Kim', 'id' => 'bk102' }, { 'number' => ${\${$VAR1}->{'title_and_genre'}->[6]->{'number'}}, 'title' => 'Maeve Ascendant', 'author' => 'Corets, Eva', 'id' => 'bk103' }, { 'number' => ${\${$VAR1}->{'title_and_genre'}->[6]->{'number'}}, 'title' => 'Oberon\'s Legacy', 'author' => 'Corets, Eva', 'id' => 'bk104' } ] };
Standing on the shoulders of giants, this module leverages the excellent XML::LibXML::Reader which itself is built upon the powerful libxml2 library. XML::LibXML::Reader uses an iterator approach to parsing XML documents, resulting in an approach that is easier to program than an event based parser (SAX) and much more lightweight than a tree based parser (DOM) which loads the complete tree into memory.
This was a particular consideration in the choice of scaffolding chosen for this module.
Data::Alias is utilised internally for lookback operations. The module allows you to apply "aliasing semantics" to a section of code, causing aliases to be made wherever Perl would normally make copies instead. You can use this to improve efficiency and readability, when compared to using references.
Thanks to the following for support, advice and feedback -
James Spurin <james@spurin.com>
This software is copyright (c) 2014 by James Spurin.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.
To install XML::Dataset, copy and paste the appropriate command in to your terminal.
cpanm
cpanm XML::Dataset
CPAN shell
perl -MCPAN -e shell install XML::Dataset
For more information on module installation, please visit the detailed CPAN module installation guide.