The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Sah - Schema for data structures (specification)

VERSION

version 0.9.3

OVERVIEW

This document specifies Sah, a schema language for validating data structures. Some features of this schema language:

  • Written as data structure

    A Sah schema is just a normal data structure. Using data structures as schemas simplifies parsing and enables easier manipulation (composition, merging, etc) of schemas as well validation of the schemas themselves.

  • Emphasis on reusability

    A schema can be defined in terms of (based of) other schemas. An example:

     # schema: even (even numbers)
     [int => {div_by=>2}]
    
     # schema: pos_even (positive even numbers)
     [even => {min=>0}]

    In the above example, pos_even is defined in terms of even with an additional clause (min=>0). As a matter of fact you can also override and remove clauses from your base schema, for even more flexibility.

     # schema: pos_even_or_odd (positive even or odd numbers)
     [pos_even => {"[merge!]div_by"=>2}] # remove the div_by clause

    The above example makes pos_even_or_odd effectively equivalent to positive integer.

    For schema-local definition, you can also define schemas within schemas:

     # dice_throws: array of dice throw results
     ["array*" => {of => 'dice_throw*'},
      {def => {
          dice_throw => [int => {between=>[1, 6]}],
      }},
     ]

    The dice_throw schema will only be visible from within the dice_throws.

SPECIFICATION VERSION

0.9

STATUS

In the 0.9.0 series, there will probably still be incompatible syntax changes between revision before the spec stabilizes into 1.0 series.

TERMINOLOGY

Although it can contain extra stuffs, a schema is essentially a type definition, stating a set of valid values for data. Sah defines several basic types like "bool", "int", "str", "array", "hash", and a few others.

A type can have clauses, which mostly declare constraints. When validating data each clause will be tested and must succeed for the whole validation to succeed (there are exceptions, but it is not important right now). Aside from declaring constraints, clauses can also declare other stuff like default value (the default clause), store metadata (the summary, description, tags clauses), etc.

A Clause set is just a set of clauses, written in a defhash (see DefHash). Defhash properties map to Sah clauses, while defhash property attributes map to Sah clause attributes. A Sah schema is essentially comprised of type name and a clause set.

Base schema. You can define a schema, declare it as a new type, and then write subsequent schemas against that type, along with additional clauses. This is very much like subtyping. See "BASE SCHEMA" for more information.

Type can also have type property (not to be confused with properties from the DefHash terminology, which map to clauses in Sah). For example, the type str can have the following properties: length (an integer). The type email_address might have the following properties local (the local part, the part before the last @, a string) and domain (the part after the last @, a string). Type properties can also have parameters, for example the type email might have a property called header(header=>'subject').

Types are usually implemented in the target language using classes, and type properties using (class instance) methods. To retrieve a property's value, a method with the same name as the property's method is called.

GENERAL STRUCTURE

Sah schema is an array:

 [TYPE_NAME, CLAUSE_SET, EXTRAS]

TYPE_NAME is a string and must be started by a letter/underscore and contain only letters/numbers/underscores. CLAUSE_SET and EXTRAS are optional.

Examples:

 ['int']
 ['int', {min=>1, max=>10}]

If you don't have EXTRAS, you can also write it like this (saves a couple of characters):

 ["int", min=>1, max=>10]

If you don't have any clauses, you can use the scalar/string form:

 "int"

EXTRAS is a hashref. Currently the only known key is def, for defining base schemas locally (see "BASE SCHEMA" for more information). The other keys are reserved for future use.

Text/strings should be in Unicode (UTF-8).

String form shortcuts

For convenience, the string form not only can be used to specify just type, but also some common clauses via shortcut syntax:

  • The * suffix (req=>1)

     "X*"

    is equivalent to:

     [X => {req=>1}]

BASE SCHEMA

As mentioned before, you can define a schema as a type and then write other schemas against that type. For example:

 # defined as pos_int type
 [int => {min=>0}]

and later:

 # a positive integer, divisible by 5
 [pos_int => {div_by=>5}]

During data validation, base schemas will be replaced by its original definition, and all the clause sets will be evaluated. Illustrated by the plus sign:

 [int => {min=>0} + {div_by=>5}]

You can also declare base schemas/types locally using the def key in EXTRAS (the third element of the array schema), for example:

 [throws => {},
  {
      def => {
          single_dice_throw  => [int => {in => [1,2,3,4,5,6]}],
          sdt                => "single_dice_throw", # short notation
          dice_pair_throw    => [array => {len=>2, elems=>["sdt", "sdt"]}],
          dpt                => "dice_pair_throw",   # short notation
          throw              => [any => {of => ["sdt", "dpt"]}],
          throws             => [array => {of => 'throw'}],
      },
  }
 ]

The above schema describes a list of dice throws ("throws"). Each throw can be a single dice throw ("sdt") which is a number between 1 and 6, OR a throw of two dices ("dpt") which is a 2-element array (where each element is a number between 1 and 6).

Examples of valid data for this schema:

 [1, [1,3], 6, 4, 2, [3,5]]

Examples of invalid data:

 1                  # not an array
 [1, [2, 3], 0]     # the third throw is invalid
 [1, [2, 0, 4], 4]  # the second throw is invalid

All the base schemas names "throw", "throws", "sdt", etc is only declared locally and unknown outside the schema. You can even nest this.

Optional/conditional definition

If you put a ? suffix after the definition name then it means that the definition is optional and can be skipped if the type is already defined, e.g.:

  def         => {
      "email?"   => [str => {req=>1, match=>".+\@.+"}],
      "username" => [str => {req=>1, match=>'^[a-z0-9_]+$'}],
  },

In the above example, if there is already an "email" type defined at that time, the definition will be skipped instead of a "cannot redefine type" error being generated.

Optional definition is useful if you want to provide some defaults (e.g. a rudimentary validation for email) but don't mind if the validator already has something probably better (a stricter or more precise definition of email).

TYPE

CLAUSE

A clause set is a defhash containing clause name (as hash keys), clause value (as hash value), and clause attribute names and values (as hash keys and values):

mapping of clause attributes and its values:

 {
     'CLAUSENAME1' => CLAUSEVALUE,
     'CLAUSENAME1.ATTRNAME1' => ATTRVALUE1,
     'CLAUSENAME1.ATTRNAME2' => ATTRVALUE2,
     'CLAUSENAME1.ATTRNAME1.SUBATTR1' => ...,
     ...
     _IGNORED => ...,
     CLAUSENAME1._IGNORED => ...,
 }

For convenience, there are also some shortcuts:

  • & suffix (multiple clause values, all must succeed)

     "CLAUSENAME&" => [VAL, ...]

    is equivalent to:

     "CLAUSENAME.vals" => [VAL, ...]
     "CLAUSENAME.max_nok" => 0,
  • | suffix (multiple clause values, only one must succeed)

     "CLAUSENAME|" => [VAL, ...]

    is equivalent to:

     "CLAUSENAME.vals" => [VAL, ...],
     "CLAUSENAME.min_ok" => 1,
  • ! prefix (negation)

     "!CLAUSENAME" => VAL

    is a shortcut for this:

     CLAUSENAME => VAL,
     "CLAUSENAME.max_ok" => 0,
  • = suffix (expression)

     "CLAUSENAME=" => EXPR
     "CLAUSENAME.ATTRNAME1=" => EXPR

    are equivalent to:

     "CLAUSENAME.expr" => EXPR
     "CLAUSENAME.ATTRNAME1.expr" => EXPR

When doing validation, all clauses will be evaluated and must succeed if the validation is to succeed. The order of evaluation usually does not matter, but some clauses are early (like "default" and "prefilters") and some are late (like "postfilters").

Clause name

This specification comes from DefHash: Clause names must begin with letter/underscore and contain letters/numbers/underscores only. All clauses which begin with an _ (underscore) is ignored. You can use this to embed extra data for other purposes.

Clause attribute

This specification comes from DefHash: Attribute name must also only contain letters/numbers/underscores, but it can be a dotted-separated series of parts, e.g. alt.lang.id_ID. As with clauses, clause attributes which begin with _ (underscore) is ignored. You can use this to embed extra data.

Currently known attributes:

  • vals => ARRAY

    This attribute can be used to store more than one values to a clause. Example:

     ['int*' => {"div_by.vals"=>[2, 3, 5]}]

    The above schema requires an integer which is divisible by 2, 3, and 5.

    If this attribute is set, and the clause attribute is set, then all values from both must pass the clause. For example, the previous schema can also be written as:

     ['int*' => {div_by=>2, "div_by.vals"=>[3, 5]}]
  • min_ok, max_ok, min_nok, max_nok => INT

    In a clause when using multiple clause values, these attributes regulate the {minimum, maximum} number of values that must {pass, fail} the check for the whole clause to pass.

    Analogously, in a clause set, these attributes regulate the {minimum, maximum} number of clauses that must {pass, fail} the check for the whole clause set to pass.

    min_ok's default is undef. You can use this attribute to only require certain number of (instead of all) passing checks.

    Example:

     [str => {cset => {min_len=>8, match=>qr/\W/}, 'cset.min_ok'=>1}]

    The above schema requires a string to be at least 8 characters long, or contains a non-word character. Strings that would validate include: abcdefgh or $ or $abcdefg. Strings that would not validate include: abcd (fails both min_len and match clauses). Without the .min_ok attribute, by default all checks in the cset clause must pass.

    Another example:

     [str => {'match.vals'=>[RE1, RE2, RE3], 'match.min_ok'=>2}]

    The above schema specifies that string must match at least two of RE1/RE2/RE3.

    max_ok's default is undef. You can use this attribute to require a number of failures in the checks.

    Example:

     ['str', min_len=>8, match=>qr/\W/, '.min_ok'=>1, '.max_ok'=>1]

    The above schema states that string must either be longer than 8 characters or contains a non-word character, but not both. Strings that would validate include: abcdefgh or $. Strings that would not validate include: $abcdefg (match both clauses, so max_ok is not satisfied).

    Another example:

     [str => {'match.vals'=>[RE1, RE2, RE3], 'match.max_ok'=>1}]

    The above schema specifies that string must not match more than one of RE1/RE2/RE3.

    min_nok's default is undef. You can use this attribute to require a certain number of failures.

    Example:

     [str => {cset=>{min_len=>8, match=>qr/\W/}, 'cset.min_nok'=>1}]

    The above schema requires a string to be shorter than 8 characters or devoid of non-word characters. Strings that would validate include: abcdefghi (fails the match clause), $abcd (fails min_len clause), or a (fails both clauses). Strings that would not validate include: $abcdefg.

    Another example:

     [str => {'match.vals'=>[RE1, RE2, RE3], 'match.min_nok'=>1}]

    The above schema specifies that string must fail at least one regex match.

    max_nok's default is undef, but when none of the {min,max}_{ok,nok} is defined, the default behavior is to require all clauses to succeed, in other words, as if max_nok were 0. You can use this clause to tolerate a certain number of failures in the checks.

    Example:

     [str => {cset=>{min_len=>8, match=>qr/\W/}, 'cset.max_nok'=>1}]

    The above schema states that string must either be longer than 8 characters or contains two non-word characters, or both. Strings that would validate include: abcdefgh, $$, $abcdefgh. Strings that would not validate include: abcd (fails both min_len and match clauses).

    Another example:

     [str => {'match.vals'=>[RE1, RE2, RE3], 'match.max_nok'=>1}]

    The above schema specifies that string can fail at most one regex match.

  • expr => STR

    Can be used for clause or another attribute to indicate that the value is not a literal, but an expression. Example:

     # a string, minimum 4 characters
     [str => {min_len => 4}]
    
     # same thing, albeit a bit fancier
     [str => {'min_len=' => '2*2'}]
    
     # for default, we pick a random number between 1 and 10
     [int => {'default=' => 'int(10*rand())+1'}],

    Expression is useful for more complex schema, when a clause/attribute value needs to be calculated in terms of other values, and/or using functions.

    Note that not all clause or attribute support expression.

  • err_level => STR (default: error)

    Valid value: error, warn. Normally, when clause checking fails, an error is generated and it causes validation of the whole schema to fail. If err_level is set to warn, however, this only generates a warning and does not cause the validation to fail.

     # password
     ['str*' => {'cset&' => [
       {min_len             => 4},
       {min_len             => 8,
        "min_len.err_level" => "warn",
        "min_len.err_msg"   => "Although a password less than 8 letters are ".
                               "valid it's highly recommended that a password is ".
                               "at least 8 letters long, for security reasons"},
     ]}],

    In the above example, the err_level and err_msg are attributes for the min_len clause. The second clause set basically adds an optional restriction for the password: when the min_len clause is not satisfied, instead of making the data fails the validation, only a warning is issued.

  • err_msg[.LANGCODE]

    This tells the compiler that instead of the default error message from the type handler, a custom error message is supplied. You can add translations by adding more attributes. For example:

     [str=>{match                  => qr/[^A-Za-z0-9_-]/,
            'match.err_msg'        => 'Must not contain naughty characters',
            'match.err_msg.id_ID'  => 'Tidak boleh mengandung karakter aneh-aneh',
     }]
  • human[.LANGCODE]

    This is also ignored when validating data, but will be used by the human compiler to supply description. You can add translations by adding more attributes.

     [str=>{match               => qr/[^A-Za-z0-9_-]/,
            'match.human'       => 'Must not contain naughty characters',
            'match.human.id_ID' => 'Tidak boleh mengandung karakter aneh-aneh',
     }]
  • alt

    This comes from DefHash, mainly used to store translations for name, summary, description.

  • result_var => VARNAME (EXPERIMENTAL)

    Specify variable name to store results in.

    Aside from pass/failure, a clause or clause set can also produce some value. This attribute specifies where to put the results in. The value can then be used by referring to the variable in expression. Example:

     [any => {
         of => [
             ['str*'   => {min_len=>1, max_len=>10}], # 0
             ['str*'   => {min_len=>11}],             # 1
             ['array*' => {}],                        # 2
             ['hash*'  => {}],                        # 3
         ],
        'of.result_var' => 'a',
     }]

    Aside from passing/failing the validation, the of clause above also produces an index to the schema in the list which matches. So if you validate an array, $a in the schema will be set to 2. If you validate a string with length 12, $a will be set to 1. If you pass an empty string (which does not pass the of clause, $a will not be set.

    Refer to each clause's documentation to find out what value the clause returns.

Clause set merging

Clause set merging happens when a schema is based on another schema and the child schema's clause set contains merge prefixes (explained later) in its keys. For example:

 # schema1
 [TYPE1 => CSET1]

 # schema2, based on schema1
 [schema1 => CSET2]

 # schema3, based on schema2
 [schema2 => CSET3]

When compiling/evaluating schema2, Sah will check against TYPE1 and CSET1 and then CSET2. However, when CSET2 contains a merge prefix (marked with an asterisk here for illustration), then Sah will check against TYPE1 and merge(CSET1, *CSET2).

When compiling/evaluating schema3, Sah will check against TYPE1 and CSET1 and then CSET2 and then CSET3. However, when CSET2 contains a merge prefix, then Sah will check against TYPE1, merge(CSET1, *CSET2), and then CSET3. When CSET2 and CSET3 contains merge prefixes, Sah will check against TYPE1 and merge(CSET1, *CSET2, *CSET3). So merging will be done from left to right.

The base schema's clause set must not contain any merge prefixes.

Merging is done using Data::ModeMerge, with merge prefixes changed to '[merge+]', '[merge!]' and so on. In merging, Data::ModeMerge allows keys on the right side hash not only to replace but also add, subtract, remove keys from the left side. This is powerful because it allows schema definition to not only add clauses (restrict types even more), but also replace clauses (change type restriction) as well as delete clauses (relax type restriction). For more information, refer to the Data::ModeMerge documentation.

Illustration:

 int + {div_by=>2} + {  div_by =>3}            # must be divisible by 2 & 3

 int + {div_by=>2} + {'[merge]div_by'=>3}      # will be merged and become:
 int + {div_by=>3}                             # must be divisible by 3 ONLY

 int + {div_by=>2} + {'[merge!]div_by'=>0}     # will be merged and become:
 int + {}                                      # need not be divisible by any

 int + {in=>[1,2,3,4,5]} + {  in =>[6]}        # impossible to satisfy

 int + {in=>[1,2,3,4,5]} + {'[merge+]in'=>[6]} # will be merged and become:
 int + {in=>[1,2,3,4,5,6]}

 int + {in=>[1,2,3,4,5]}, {'-in'=>[4]}         # will be merged and become:
 int + {in=>[1,2,3,  5]}

Merging is performed before schema is normalized.

Merging is not recursive.

EXPRESSION

XXX: Syntax of variables not yet fixed.

Sah supports expressions, using Language::Expr minilanguage. See Language::Expr::Manual::Syntax for details on the syntax. You can specify expression in the check clause, e.g.:

 [int => {check => '$_ >= 4'}]

Alternatively, expression can also be specified in any clause's attribute:

 [int => {'min='     => '2+2'}]
 [int => {'min.val=' => 'floor(4.9)'}]

The above three schemas are equivalent to:

 [int => {min => 4}]

Expression can refer to elements of data and (normalized) schema, and can call functions, enabling more complex schema to be defined, for example:

 ['array*' => {len=>2, elems => [
   ['str*', {match => '^\w+$'}],
   ['str*', {'match=' => '${../../0/clause_sets/0/match}',
             'min_len=' => '2*length(${data:../0})'}]
 ]}]

The above schema requires data to be a two-element array containing strings, where the length of the second string has to be at least twice the length of the first. Both strings have to comply to the same regex, qr/^\w+$/ (which is declared on the first string's clause and referred to in the second string's clause).

FUNCTION

Functions can be used in expressions. The syntax of calling function is:

 func()
 func(ARG, ...)

Functions in Sah can sometimes accept several types of arguments, e.g. length(ARRAY) will return the number of elements in the ARRAY, while length(STR) will return the number of characters in the string. However, when an inappropriate argument is given, a Perl exception will be thrown.

HISTORY

2012-07-21 split specification to Sah

2011-11-23 Data::Sah

2009-03-30 Data::Schema (first CPAN release)

Previous incarnation as Schema-Nested (internal)

SEE ALSO

DefHash

Sah::Type, Sah::FAQ

AUTHOR

Steven Haryanto <stevenharyanto@gmail.com>

COPYRIGHT AND LICENSE

This software is copyright (c) 2012 by Steven Haryanto.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.