The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

dvp_gen_parser - creates in-memory perl data validator

SYNOPSIS

   $ dvp_gen_parser -m MyParser data.yml.spec
   options:
      -m|module : parser module name
      -v|verbose: print grammar created
      -d|debug  : print debug message

DESCRIPTION

This is the command-line utility of Data::Validate::Perl. It reads in the data structure specification and creates the parser module directly. For most users this is only thing to run in order to facilitate the module behind.

DATA SPECIFICATION FORMAT

Each data item here has a symbolic prefix, pretty much like in perl:

   1. scalar: $
   2. array: @
   3. hash: %
   4. value: '
SCALAR
   # the value of scalar 'foo' can be any string
   $foo: TEXT

   # the value of scalar 'foo' must be 'bar'
   $foo: 'bar

   # the value of scalar 'foo' can be either 'bar1' or 'bar2'
   # note there is only one single-quote at begin
   $foo: 'bar1 'bar2
ARRAY
   # the array 'foo' may contain a scalar called 'bar'
   # the possible values of 'bar' can be defined like above
   @foo: $bar

   # the array 'foo' may contain another array called 'bar'
   # the structure of 'bar' needs to be defined somewhere
   @foo: @bar

   # the array 'foo' may contain another 2 arrays 'bar1' and 'bar2'
   @foo: @bar1 @bar2

   # the array 'foo' may contain 1 array called 'bar1' and 1 hash
   # called 'bar2'
   @foo: @bar1 %bar2

   # the array 'foo' can be a simple array made up by scalars too
   @foo: 'bar1 'bar2 'bar3

Generic array rules:

   1. each declared array item may or may NOT appear (optional), except for the case of single item;
   2. relationship between multiple declared array items is OR;
   3. when array is declared, item(s) not in the declaration will be rejected;
   4. array referred by other data structure might NOT be declared, it will be treated as simple array which contains any number of scalar values (anonymous array);
HASH
   # the hash 'foo' may contain a key 'bar', whose value is 'zoo'
   %foo: $bar
   $bar: 'zoo

   # the hash 'foo' may contain a key 'bar', whose value is array
   %foo: @bar
   # or the key and value symbol may be different
   %foo: @(bar)bar_list


   # the hash 'foo' may contain a key 'bar', whose value is another
   # hash
   %foo: %bar

   # the hash 'foo' may contain 2 keys: 'bar1' whose value is array,
   # 'bar2' whose value is hash
   %foo: @bar1 %bar2

   # the hash 'foo' may contain 3 keys: 'bar1' whose value is array,
   # 'bar2' whose value is hash, 'bar3' whose value is scalar
   %foo: @bar1 %bar2 $bar3

Generic hash rules:

   1. each declared hash key may or may NOT appear (optional), except for the case of single key;
   2. relationship between multiple declared keys is OR;
   3. when hash is declared, key(s) not in the declaration will be rejected;
   4. hash referred by other data structure might NOT be declared, it will be treated as simple hash which contains any number of scalar key/value pairs (anonymous hash);

USECASE

Requirement

Say you have a program which reads all the postcode and areacode of all China mainland cities from a yaml file. Before taking the information inside you want to make sure this file looks like the one with proper data.

The sample format of this yaml file is like below:

   ---
   - city: Beijing
     postcode: 100000
     areacode: 010
   - city: Shanghai
     postcode: 200000
     areacode: 020
   - city: Hangzhou
     postcode: 310000
     areacode: 0571
   - city: Kunming
     postcode: 650000
     areacode: 0871
   ...

STEP 1: Write the Specification

The structure is not complicated, top container is an array, each item inside is hash with 3 keys. The corresponding specification can be written as:

   @root: %record
   %record: $city $postcode $areacode

Which is good enough to walk through the data. If you have the full list of city names that may appear, one more line can be inserted to list them:

   $city: 'Beijing 'Shanghai 'Hangzhou 'Kunming ...

Save these lines into a file called city_yaml.spec

STEP 2: Create the Parser Module

Simply call the command-line tool dvp_gen_parser like this:

   $ dvp_gen_parser -m City::Record::Validator city_yaml.spec

This will create Validator.pm under current working directory, move it to the right folder your main script will use later.

STEP 3: Put Together

Final step, write a simple script to load yaml and call the module method to validate data inside:

   # demonstration purpose only
   use FindBin;
   use YAML::XS qw/Load/;
   use lib "$FindBin::RealBin/../lib/perl5";
   use City::Record::Validator;
   open my $F, '<', $ARGV[0] or croak "cannot open file to read: $!";
   my $cont = do { local $/; <$F> };
   close $F;
   my $data = Load($cont);
   my $validator = City::Record::Validator::->new();
   $validator->parse($data);
   print STDERR '#error = ', $parser->YYNBerr(), "\n";

That is it.

Conclusion

The advantage of this module is to process data in memory, this will not bound it with any specific file format. As long as perl understands the target format and is capable of loading it, the created module will check the data in memory.

TROUBLESHOOTING

Parse::Yapp provides a debug flag to control the degree of running information of state machine. This flag is controlled by an environment variable named YYDEBUG and defaults off.

   $ YYDEBUG=31 validate.pl data.yml

BUGS

   1. no support of required hash key(s);

LICENSE AND COPYRIGHT

Copyright 2014 Dongxu Ma.

This program is free software; you can redistribute it and/or modify it under the terms of the the Artistic License (2.0). You may obtain a copy of the full license at:

http://www.perlfoundation.org/artistic_license_2_0