The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Data::Domain - Data description and validation

SYNOPSIS

  use Data::Domain qw/:all/;

  my $domain = Struct(
    anInt      => Int(-min => 3, -max => 18),
    aNum       => Num(-min => 3.33, -max => 18.5),
    aDate      => Date(-max => 'today'),
    aLaterDate => sub {my $context = shift;
                       Date(-min => $context->{flat}{aDate})},
    aString    => String(-min_length => 2, -optional => 1),
    anEnum     => Enum(qw/foo bar buz/),
    anIntList  => List(-min_size => 1, -all => Int),
    aMixedList => List(Integer, String, Int(-min => 0), Date),
    aStruct    => Struct(foo => String, bar => Int(-optional => 1))
  );

  my $messages = $domain->inspect($some_data);
  my_display_error($messages) if $messages;

DESCRIPTION

A data domain is a description of a set of values, either scalar or structured (arrays or hashes). The description can include many constraints, like minimal or maximal values, regular expressions, required fields, forbidden fields, and also contextual dependencies. From that description, one can then invoke the domain's inspect method to check if a given value belongs to it or not. In case of mismatch, a structured set of error messages is returned.

The motivation for writing this package was to be able to express in a compact way some possibly complex constraints about structured data. Typically the data is a Perl tree (nested hashrefs or arrayrefs) that may come from XML, JSON, from a database through DBIx::DataModel, or from postprocessing an HTML form through CGI::Expand. Data::Domain is a kind of tree parser on that structure, with some facilities for dealing with dependencies within the structure, and with several options to finely tune the error messages returned to the user.

There are several other packages in CPAN doing data validation; these are briefly listed in the "SEE ALSO" section.

GLOBAL API

Shortcut functions for domain constructors

Internally, domains are represented as Perl objects; however, it would be tedious to write

  my $domain = Data::Domain::Struct->new(
    anInt      => Data::Domain::Int->new(-min => 3, -max => 18),
    aDate      => Data::Domain::Date->new(-max => 'today'),
    ...
  );

so for each of its builtin domain constructors, Data::Domain exports a plain function that just calls new on the appropriate subclass. If you import those functions (use Data::Domain qw/:all/, or use Data::Domain qw/Struct Int Date .../), then you can write more conveniently :

  my $domain = Struct(
    anInt      => Int(-min => 3, -max => 18),
    aDate      => Date(-max => 'today'),
    ...
  );

Short function names like Int or String are convenient, but may cause name clashes with other modules. If conflicts happen, don't import the function names, and explicitly call the new method on domain constructors -- or write your own wrappers around them.

Methods

new

Creates a new domain object, from one of the domain constructors listed below (Num, Int, Date, etc.). The Data::Domain class itself has no new method, because it is an abstract class.

Arguments to the new method specify various constraints for the domain (minimal/maximal values, regular expressions, etc.); most often they are specific to a given domain constructor, so see the details below. However, there are also some generic options :

-optional

if true, an undef value will be accepted, without generating an error message

-name

defines a name for the domain, that will be printed in error messages instead of the subclass name.

-messages

defines ad hoc messages for that domain, instead of the builtin messages. The argument can be a string, a hashref or a coderef, as explained in the "ERROR MESSAGES" section.

Option names always start with a dash. If no option name is given, parameters to the new method are passed to the default option, as defined in each constructor subclass. For example the default option in Data::Domain::List is -items, so

   my $domain = List(Int, String, Int);

is equivalent to

   my $domain = List(-items => [Int, String, Int]);

inspect

  my $messages = $domain->inspect($some_data);

Inspects the supplied data, and returns an error message (or a structured collection of messages) if anything is wrong. If the data successfully passed all domain tests, then nothing is returned.

For scalar domains (Num, String, etc.), the error message is just a string. For structured domains (List, Struct), the return value is an arrayref or hashref of the same structure, like for example

  {anInt => "smaller than mimimum 3",
   aDate => "not a valid date",
   aList => ["message for item 0", undef, undef, "message for item 3"]}

The client code can then exploit this structure to dispatch error messages to appropriate locations (typically these will be the form fields that gathered the data).

BUILTIN DOMAIN CONSTRUCTORS

Whatever

  my $domain = Struct(
    just_anything => Whatever,
    is_defined    => Whatever(-defined => 1),
    is_undef      => Whatever(-defined => 0),
    is_true       => Whatever(-true => 1),
    is_false      => Whatever(-true => 0),
    is_object     => Whatever(-isa => 'My::Funny::Object'),
    has_methods   => Whatever(-can => [qw/jump swim dance sing/]),
  );

Encapsulates just any kind of Perl value (including undef). Options are :

-defined

If true, the data must be defined. If false, the data must be undef.

-true

If true, the data must be true. If false, the data must be false.

-isa

The data must be an object of the specified class.

-can

The data must implement the listed methods, supplied either as an arrayref (several methods) or as a scalar (just one method).

Empty

Empty domain, that always fails when inspecting any data. This is sometimes useful within lazy constructors (see below), like in this example :

  Struct(
    foo => String,
    bar => sub {
      my $context = shift;
      if (some_condition($context)) { 
        return Empty(-messages => 'your data is wrong')
      }
      else {
        ...
      }
    }
  )

Num

  my $domain = Num(-range =>[-3.33, 999], -not_in => [2, 3, 5, 7, 11]);

Domain for numbers (including floats). Options are :

-min

The data must be greater or equal to the supplied value.

-max

The data must be smaller or equal to the supplied value.

-range

-range => [$min, $max] is equivalent to -min => $min, -max => $max.

-not_in

The data must be different from all values in the exclusion set, supplied as an arrayref.

Int

  my $domain = Int(-min => 0, -max => 999, -not_in => [2, 3, 5, 7, 11]);

Domain for integers. Accepts the same options as Num and returns the same error messages.

Date

  Data::Domain::Date->parser('EU'); # default    
  my $domain = Date(-min => '01.01.2001', 
                    -max => 'today',
                    -not_in => ['02.02.2002', '03.03.2003', 'yesterday']);

Domain for dates, implemented via the Date::Calc module. By default, dates are parsed according to the european format, i.e. through the Decode_Date_EU method; this can be changed by setting

  Data::Domain::Date->parser('US'); # will use Decode_Date_US

or

  Data::Domain::Date->parser(\&your_own_date_parsing_function);
  # that func. should return an array ($year, $month, $day)

When outputting error messages, dates will be printed according to Date::Calc's current language (english by default); see that module's documentation for changing the language.

In the options below, the special keywords today, yesterday or tomorrow may be used instead of a date constant, and will be replaced by the appropriate date when performing comparisons.

-min

The data must be greater or equal to the supplied value.

-max

The data must be smaller or equal to the supplied value.

-range

-range => [$min, $max] is equivalent to -min => $min, -max => $max.

-not_in

The data must be different from all values in the exclusion set, supplied as an arrayref.

Time

  my $domain = Time(-min => '08:00', -max => 'now');

Domain for times in format hh:mm:ss (minutes and seconds are optional).

In the options below, the special keyword now may be used instead of a time, and will be replaced by the current local time when performing comparisons.

-min

The data must be greater or equal to the supplied value.

-max

The data must be smaller or equal to the supplied value.

-range

-range => [$min, $max] is equivalent to -min => $min, -max => $max.

String

  my $domain = String(qr/^[A-Za-z0-9_\s]+$/);

  my $domain = String(-regex     => qr/^[A-Za-z0-9_\s]+$/,
                      -antiregex => qr/$RE{profanity}/,    # see Regexp::Common
                      -range     => ['AA', 'zz'],
                      -length    => [1, 20],
                      -not_in    => [qw/foo bar/]);

Domain for strings. Options are:

-regex

The data must match the supplied compiled regular expression. Don't forget to put ^ and $ anchors if you want your regex to check the whole string.

-regex is the default option, so you may just pass the regex as a single unnamed argument to String().

-antiregex

The data must not match the supplied regex.

-min

The data must be greater or equal to the supplied value.

-max

The data must be smaller or equal to the supplied value.

-range

-range => [$min, $max] is equivalent to -min => $min, -max => $max.

-min_length

The string length must be greater or equal to the supplied value.

-max_length

The string length must be smaller or equal to the supplied value.

-length

-length => [$min, $max] is equivalent to -min_length => $min, -max_length => $max.

-not_in

The data must be different from all values in the exclusion set, supplied as an arrayref.

Enum

  my $domain = Enum(qw/foo bar buz/);

Domain for a finite set of scalar values. Options are:

-values

Ref to an array of values admitted in the domain. This would be called as Enum(-values => [qw/foo bar buz/]), but since this it is the default option, it can be simply written as Enum(qw/foo bar buz/).

Undefined values are not allowed in the list (use the -optional argument instead).

List

  my $domain = List(String, Int, String, Num);

  my $domain = List(-items => [String, Int, String, Num]); # same as above

  my $domain = List(-all  => String(qr/^[A-Z]+$/),
                    -any  => String(-min_length => 3),
                    -size => [3, 10]);

Domain for lists of values (stored as Perl arrayrefs). Options are:

-items

Ref to an array of domains; then the first n items in the data must match those domains, in the same order.

This is the default option, so item domains may be passed directly to the new method, without the -items keyword.

-min_size

The data must be a ref to an array with at least that number of entries.

-max_size

The data must be a ref to an array with at most that number of entries.

-size

-size => [$min, $max] is equivalent to -min_size => $min, -max_size => $max.

-all

All remaining entries in the array, after the first <n> entries as specified by the -items option (if any), must satisfy that domain specification.

-any

At least one remaining entry in the array, after the first n entries as specified by the -items option (if any), must satisfy that domain specification. A list domain can have both an -all and and -any constraint.

The argument to -any can also be an arrayref of domains, as in

   List(-any => [String(qr/^foo/), Num(-range => [1, 10]) ])

This means that one member of the list must be a string starting with foo, and one member of the list (in this case, necessarily another one) must be a number between 1 and 10. Note that this is different from

   List(-any => One_of(String(qr/^foo/), Num(-range => [1, 10]))

which says that one member of the list must be either a string starting with foo or a number between 1 and 10.

Struct

  my $domain = Struct(foo => Int, bar => String);

  my $domain = Struct(-fields  => [foo => Int, bar => String],
                      -exclude => '*');

Domain for associative structures (stored as Perl hashrefs). Options are:

-fields

Supplies a list of keys with their associated domains. The list might be given either as a hashref or as an arrayref. Specifying it as an arrayref is useful for controlling the order in which field checks will be performed; this may make a difference when there are context dependencies (see "LAZY CONSTRUCTORS" below ).

-exclude

Specifies which keys are not allowed in the structure. The exclusion may be specified as an arrayref of key names, as a compiled regular expression, or as the string constant '*' or 'all' (meaning that no key will be allowed except those explicitly listed in the -fields option.

One_of

  my $domain = One_of($domain1, $domain2, ...);

Union of domains : successively checks the member domains, until one of them succeeds. Options are:

-options

List of domains to be checked. This is the default option, so the keyword may be omitted.

LAZY CONSTRUCTORS (CONTEXT DEPENDENCIES)

Principle

If an element of a structured domain (List or Struct) depends on another element, then we need to lazily construct the domain. Consider for example a struct in which the value of field date_end must be greater than date_begin : the subdomain for date_end can only be constructed when the argument to -min is known, namely when the domain inspects an actual data structure.

Lazy domain construction is achieved by supplying a function reference instead of a domain object. That function will be called with some context information, and should return the domain object. So our example becomes :

  my $domain = Struct(
       date_begin => Date,
       date_end   => sub {my $context = shift;
                          Date(-min => $context->{flat}{date_begin})}
     );

Structure of context

The supplied context is a hashref containing the following information:

root

the overall root of the inspected data

path

the sequence of keys or array indices that led to the current data node. With that information, the subdomain is able to jump to other ancestor or sibling data nodes within the tree, with help of the node_from_path function.

flat

a flat hash containing an entry for any hash key met so far while traversing the tree. In case of name clashes, most recent keys (down in the tree) override previous keys.

list

a reference to the last list (arrayref) encountered while traversing the tree.

Here is an example :

  my $data   = {foo => [undef, 99, {bar => "hello, world"}]};
  my $domain = Struct(
     foo => List(Whatever, 
                 Whatever, 
                 Struct(bar => sub {my $context = shift;
                                    print Dumper($context);
                                    String;})
                )
     );
  $domain->inspect($data);

This code will print something like

  $VAR1 = {
    'root' => {'foo' => [undef, 99, {'bar' => 'hello, world'}]},
    'path' => ['foo', 2, 'bar'],
    'list' => $VAR1->{'root'}{'foo'},
    'flat' => {
      'bar' => 'hello, world',
      'foo' => $VAR1->{'root'}{'foo'}
    }
  };

Usage examples

Contextual sets

  my $some_cities = {
     Switzerland => [qw/Genève Lausanne Bern Zurich Bellinzona/],
     France      => [qw/Paris Lyon Marseille Lille Strasbourg/],
     Italy       => [qw/Milano Genova Livorno Roma Venezia/],
  };
  my $domain = Struct(
     country => Enum(keys %$some_cities),
     city    => sub {
        my $context = shift;
        Enum(-values => $some_cities->{$context->{flat}{country}});
      });

Ordered lists

Here is an example of a domain for ordered lists of integers:

  my $domain = List(-all => sub {
      my $context = shift;
      my $index = $context->{path}[-1];
      return Int if $index == 0; # first item has no constraint
      return Int(-min => $context->{list}[$index-1] + 1);
    });

Recursive domains

A domain for expression trees, where leaves are numbers, and intermediate nodes are binary operators on subtrees

  my $expr_domain = One_of(Num, Struct(operator => String(qr(^[-+*/]$)),
                                       left     => sub {$expr_domain},
                                       right    => sub {$expr_domain}));

WRITING NEW DOMAIN CONSTRUCTORS

Implementing new domain constructors is fairly simple : create a subclass of Data::Domain and implement a new method and an _inspect method. See the source code of Data::Domain::Num or Data::Domain::String for short examples.

However, before writing such a class, consider whether the existing mechanisms are not enough for your needs. For example, many domains could be expressed as a String constrained by a regular expression; therefore it is just a matter of writing a wrapper that supplies that regular expression, and passes other arguments (like -optional) to the String constructor :

  sub Phone   { String(-regex    => qr/^\+?[0-9() ]+$/, 
                       -messages => "Invalid phone number", @_) }
  sub Email   { String(-regex    => qr/^[-.\w]+\@[\w.]+$/,
                       -messages => "Invalid email", @_) }
  sub Contact { Struct(-fields => [name   => String,
                                   phone  => Phone,
                                   mobile => Phone(-optional => 1),
                                   emails => List(-all => Email)   ], @_) }

ERROR MESSAGES

Messages returned by validation rules have default values, but can be customized in several ways.

Each error message has an internal string identifier, like TOO_SHORT, NOT_A_HASH, etc. The documentation for each builtin domain tells which message identifiers may be generated in that domain. Message identifiers are then associated with user-friendly strings, either within the domain itself, or via a global table. Such strings are actually sprintf format strings, with placeholders for printing some specific details about the validation rule : for example the String domain defines default messages such as

      TOO_SHORT        => "less than %d characters",
      SHOULD_MATCH     => "should match %s",

The -messages option to domain constructors

Any domain constructor may receive a -messages option to locally override the messages for that domain. The argument may be

  • a plain string : that string will be returned for any kind of validation error within the domain

  • a hashref : keys of the hash should be message identifiers, and values should be the associated error strings.

  • a coderef : the referenced function is called, and the return value becomes the error string. The called function receives the message identifier as argument.

Here is an example :

 sub Phone { 
   String(-regex      => qr/^\+?[0-9() ]+$/, 
          -min_length => 7,
          -messages   => {
            TOO_SHORT    => "phone number should have at least %d digits",
            SHOULD_MATCH => "invalid chars in phone number"
           }, @_) 
 }

The messages class method

Default strings associated with message identifiers are stored in a global table. The distribution contains builtin tables for english (the default) and for french : these can be chosen through the messages class method :

  Data::Domain->messages('english');  # the default
  Data::Domain->messages('français');

The same method can also receive a custom table.

  my $custom_table = {...};
  Data::Domain->messages($custom_table);

This should be a two-level hashref : first-level entries in the hash correspond to Data::Domain subclasses (i.e Num => {...}, String => {...}), or to the constant Generic; for each of those, the second-level entries should correspond to message identifiers as specified in the doc for each subclass (for example TOO_SHORT, NOT_A_HASH, etc.). Values should be strings suitable to be fed to sprintf. Look at $builtin_msgs in the source code to see an example.

Finally, it is also possible to write your own message generation handler :

  Data::Domain->messages(sub {my ($msg_id, @args) = @_;
                              return "you just got it wrong ($msg_id)"});

What is received in @args depends on which validation rule is involved; it can be for example the minimal or maximal bounds, or the regular expression being checked.

The -name option to domain constructors

The name of the domain is prepended in front of error messages. The default name is the subclass of Data::Domain, so a typical error message for a string would be

  String: less than 7 characters

However, if a -name is supplied to the domain constructor, that name will be printed instead;

  my $dom = String(-min_length => 7, -name => 'Phone');
  # now error would be: "Phone: less than 7 characters"

Message identifiers

This section lists all possible message identifiers generated by the builtin constructors.

Whatever

MATCH_DEFINED, MATCH_TRUE, MATCH_ISA, MATCH_CAN.

Num

INVALID, TOO_SMALL, TOO_BIG, EXCLUSION_SET.

Date

INVALID, TOO_SMALL, TOO_BIG, EXCLUSION_SET.

Time

INVALID, TOO_SMALL, TOO_BIG.

String

TOO_SHORT, TOO_LONG, TOO_SMALL, TOO_BIG, EXCLUSION_SET, SHOULD_MATCH, SHOULD_NOT_MATCH.

Enum

NOT_IN_LIST.

List

The domain will first check if the supplied array is of appropriate shape; in case of of failure, it will return of the following scalar messages : NOT_A_LIST, TOO_SHORT, TOO_LONG.

Then it will check all items in the supplied array according to the -items and -all specifications; in case of failure, an arrayref of messages is returned, where message positions correspond to the positions of offending data items.

Finally, the domain will check the -any constraint; in case of failure, it returns an ANY scalar message. Since that message contains the name of the missing domain, it is a good idea to use the -name option so that the message is easily comprehensible, as for example in

  List(-any => String(-name => "uppercase word", 
                      -regex => qr/^[A-Z]$/))

Here the error message would be : should have at least one uppercase word.

Struct

The domain will first check if the supplied hash is of appropriate shape; in case of of failure, it will return of the following scalar messages : NOT_A_HASH, FORBIDDEN_FIELD.

Then it will check all entries in the supplied hash according to the -fields specification, and return a hashref of messages, where keys correspond to the keys of offending data items.

One_of

If all member domains failed to accept the data, an arrayref or error messages is returned, where the order of messages corresponds to the order of the checked domains.

INTERNALS

node_from_path

  my $node = node_from_path($root, @path);

Convenience function to find a given node in a data tree, starting from the root and following a path (a sequence of hash keys or array indices). Returns undef if no such path exists in the tree. Mainly useful for contextual constraints in lazy constructors.

msg

Internal utility method for generating an error message.

subclass

Method that returns the short name of the subclass of Data::Domain (i.e. returns 'Int' for Data::Domain::Int).

_expand_range

Internal utility method for converting a "range" parameter into "min" and "max" parameters.

_call_lazy_domain

Internal utility method for dynamically converting lazy domains (coderefs) into domains.

SEE ALSO

Doc and tutorials on complex Perl data structures: perlref, perldsc, perllol.

Other CPAN modules doing data validation : Data::FormValidator, CGI::FormBuilder, HTML::Widget::Constraint, Jifty::DBI, Data::Constraint, Declare::Constraints::Simple. Among those, Declare::Constraints::Simple is the closest to Data::Domain, because it is also designed to deal with substructures; yet it has a different approach to combinations of constraints and scope dependencies.

Some inspiration for Data::Domain came from the wonderful Parse::RecDescent module, especially the idea of passing a context where individual rules can grab information about neighbour nodes.

TODO

  - generate javascript validation code
  - generate XML schema
  - normalization / conversions (-filter option)
  - msg callbacks (-filter_msg option)
  - default values within domains ? (good idea ?)

AUTHOR

Laurent Dami, <laurent.d...@etat.geneve.ch>

COPYRIGHT AND LICENSE

Copyright 2006, 2007 by Laurent Dami.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 1628:

Non-ASCII character seen before =encoding in '[qw/Genève'. Assuming CP1252