Steven Haryanto

NAME

Sah - Specification of Sah schema language

VERSION

version 0.9.1

SPECIFICATION VERSION

0.9

STATUS

In the 0.9.0 series, there will probably still be incompatible syntax changes between revision before the spec stabilizes into 1.0 series.

OVERVIEW AND TERMINOLOGY

Although it can contain extra stuffs, a schema is essentially a type definition, stating a set of valid values for data. Sah defines several basic types like "bool", "int", "str", "array", "hash", and a few others.

A type can have clauses, which mostly declare constraints. Each clause has a clause name and a hash of clause attributes, probably the most commonly used attribute is the val attribute. So common that, when specifying a clause in a schema, by default it sets the val attribute.

 min => 10

 # is actually equivalent to:
 'min.val' => 10

Aside from the val attribute, there are others, serving various functions. See "Clause attributes" for more information.

When validating a data each clause will be tested and must succeed for the whole validation to succeed (there are exceptions, but it is not important right now). Aside from declaring constraints, clauses can also declare other stuff like default value (the default clause), store metadata (the summary, description, tags clauses), etc. Please see type's documentation for the list of known clauses of each type.

Clause set is just a term for a hash of clauses. A Sah schema is essentially comprised of type name and a clause set.

Base schema. You can define a schema, declare it as a new type, and then write subsequent schemas against that type, along with additional clauses. This is very much like subtyping. See "BASE SCHEMA" for more information.

GENERAL STRUCTURE

Sah schema is an array:

 [TYPE_NAME, CLAUSE_SET, EXTRAS]

TYPE_NAME is a string and must be started by a letter/underscore and contain only letters/numbers/underscores. CLAUSE_SET and EXTRAS are optional.

Examples:

 ['int']
 ['int', {min=>1, max=>10}]

If you don't have EXTRAS, you can also write it like this (saves a couple of characters):

 ["int", min=>1, max=>10]

If you don't have any clauses, you can use the scalar/string form:

 "int"

EXTRAS is a hashref. Currently the only known key is def, for defining base schemas locally (see "BASE SCHEMA" for more information). The other keys are reserved for future use.

Text/strings should be in Unicode (UTF-8).

String form shortcuts

For convenience, the string form not only can be used to specify just type, but also some common clauses via shortcut syntax:

  • The * suffix (req=>1)

     "X*"

    is equivalent to:

     [X => {req=>1}]

BASIC TYPES

To be written.

BASE SCHEMA

As mentioned before, you can define a schema as a type and then write other schemas against that type. For example:

 # defined as pos_int type
 [int => {min=>0}]

and later:

 # a positive integer, divisible by 5
 [pos_int => {div_by=>5}]

During data validation, base schemas will be replaced by its original definition, and all the clause sets will be evaluated. Illustrated by the plus sign:

 [int => {min=>0} + {div_by=>5}]

You can also declare base schemas/types locally using the def key in EXTRAS (the third element of the array schema), for example:

 [throws => {},
  {
      def => {
          single_dice_throw  => [int => {in => [1,2,3,4,5,6]}],
          sdt                => "single_dice_throw", # short notation
          dice_pair_throw    => [array => {len=>2, elems=>["sdt", "sdt"]}],
          dpt                => "dice_pair_throw",   # short notation
          throw              => [any => {of => ["sdt", "dpt"]}],
          throws             => [array => {of => 'throw'}],
      },
  }
 ]

The above schema describes a list of dice throws ("throws"). Each throw can be a single dice throw ("sdt") which is a number between 1 and 6, OR a throw of two dices ("dpt") which is a 2-element array (where each element is a number between 1 and 6).

Examples of valid data for this schema:

 [1, [1,3], 6, 4, 2, [3,5]]

Examples of invalid data:

 1                  # not an array
 [1, [2, 3], 0]     # the third throw is invalid
 [1, [2, 0, 4], 4]  # the second throw is invalid

All the base schemas names "throw", "throws", "sdt", etc is only declared locally and unknown outside the schema. You can even nest this.

Optional/conditional definition

If you put a ? suffix after the definition name then it means that the definition is optional and can be skipped if the type is already defined, e.g.:

  def         => {
      "email?"   => [str => {req=>1, match=>".+\@.+"}],
      "username" => [str => {req=>1, match=>'^[a-z0-9_]+$'}],
  },

In the above example, if there is already an "email" type defined at that time, the definition will be skipped instead of a "cannot redefine type" error being generated.

Optional definition is useful if you want to provide some defaults (e.g. a rudimentary validation for email) but don't mind if the validator already has something probably better (a stricter or more precise definition of email).

CLAUSE

A clause set is a mapping of clause attributes and its values:

 {
     'CLAUSENAME1.ATTR1' => ...,
     'CLAUSENAME1.ATTR2' => ...,
     ...
     'CLAUSENAME2.ATTR1' => ...,
     'CLAUSENAME2.ATTR2' => ...,
     ...
 }

For convenience, there are also some shortcuts. First, the previously mentioned:

 CLAUSENAME => VAL

as a shortcut for:

 'CLAUSENAME.val' => VAL

(This is Huffman encoding principle at work, since the val attribute is the most common, it has the shortest syntax.)

Another shortcut is:

 "CLAUSENAME&" => [VAL, ...]

for:

 "CLAUSENAME.vals" => [VAL, ...]
 "CLAUSENAME.max_nok" => 0,

and:

 "CLAUSENAME|" => [VAL, ...]

for:

 "CLAUSENAME.vals" => [VAL, ...],
 "CLAUSENAME.min_ok" => 1,

Also this:

 "!CLAUSENAME" => VAL

is a shortcut for this:

 CLAUSENAME => VAL,
 "CLAUSENAME.max_ok" => 0,

An attribute can contain literal value or an expression. To specify an expression, add = suffix:

 "CLAUSENAME=" => EXPR
 "CLAUSENAME.ATTRNAME1=" => EXPR

When doing validation, all clauses will be evaluated and must succeed if the validation is to succeed. The order of evaluation usually does not matter, but some clauses are early (like "default" and "prefilters") and some are late (like "postfilters").

Clause name

Clause names must begin with letter/underscore and contain letters/numbers/underscores only. All clauses which begin with an "_" (underscore) will be ignored by Sah (you can use this to embed extra data for other purposes).

Clause attribute

Attribute name must also only contain letters/numbers/underscores, but it can be a dotted-separated series of parts, e.g. alt.lang.id_ID. As with clauses, clause attributes which begin with "_" (underscore) will be ignored by Sah.

Currently known attributes:

  • val

    This is the most commonly used attribute.

  • vals

    This attribute can be used to store more than one values to a clause. The values will be evaluated. Example:

     ['int*' => {"div_by.vals"=>[2, 3, 5]}]

    The above schema requires an integer which is divisible by 2, 3, and 5.

  • err_level

    Valid value: error, warn. Default if not specified is error. Normally, when clause checking fails, an error is generated and the fails to validate the whole schema. If err_level is set to warn, however, this only generates a warning and does not cause the validation to fail.

     [str=>{min_len=>4},
           {min_len=>8, 'min_len.err_level' => 'warn'},]

    Example:

     # password
     [str =>
       {req                 => 1,
        min_len             => 4},
       {min_len             => 8,
        "min_len.err_level" => "warn",
        "min_len.err_msg"   => "Although a password less than 4 letters are ".
                               "valid it's highly recommended that a password is ".
                               "at least 8 letters long, for security reasons"}],

    In the above example, the "err_level" and "err_msg" are attributes for the "min_len" clause. The second clause set basically adds an optional restriction for the password: when the "min_len" clause is not satisfied, instead of making the data fails the validation, only a warning is issued.

  • err_msg[.LANGCODE]

    This tells the compiler that instead of the default error message from the type handler, a custom error message is supplied. You can add translations by adding more attributes. For example:

     [str=>{match                  => qr/[^A-Za-z0-9_-]/,
            'match.err_msg'        => 'Must not contain naughty characters',
            'match.err_msg.id_ID'  => 'Tidak boleh mengandung karakter aneh-aneh',
     }]
  • comment

    This is ignored during validation.

  • human[.LANGCODE]

    This is also ignored when validating data, but will be used by the human compiler to supply description. You can add translations by adding more attributes.

     [str=>{match               => qr/[^A-Za-z0-9_-]/,
            'match.human'       => 'Must not contain naughty characters',
            'match.human.id_ID' => 'Must not contain naughty characters',
     }]
  • alt

    This attribute is used to store alternative clause value. Examples are:

     alt.lang.<LANGCODE>

    to store alternative value for different language. This applies to clauses which contain a translatable text, like name, summary, description.

  • min_ok => N

    This attribute specifies the required minimum number of comparisons that must succeed in order for clause to be considered a success. By default this is not defined. You can use this attribute to only require certain number of (instead of all) passing checks.

    Example:

     [str => {cset => {min_len=>8, match=>qr/\W/}, 'cset.min_ok'=>1}]

    The above schema requires a string to be at least 8 characters long, or contains a non-word character. Strings that would validate include: abcdefgh or $ or $abcdefg. Strings that would not validate include: abcd (fails both min_len and match clauses). Without the .min_ok attribute, by default all checks in the cset clause must pass.

    Another example:

     [str => {'match.vals'=>[RE1, RE2, RE3], 'match.min_ok'=>2}]

    The above schema specifies that strnig must match at least two of RE1/RE2/RE3.

    See also: .max_ok, .min_nok, .max_nok.

  • max_ok => N

    This attribute specifies the maximum number of checks that must succeed in order for the clause to be considered a success. By default this is not defined. You can use this attribute to require a number of failures in the checks.

    Example:

     [str => {cset=>{min_len=>8, match=>qr/\W/}, 'cset.min_ok'=>1, 'cset.max_ok'=>1]

    The above schema states that string must either be longer than 8 characters or contains a non-word character, but not both. Strings that would validate include: abcdefgh or $. Strings that would not validate include: $abcdefg (match both clauses, so max_ok is not satisfied).

    Another example:

     [str => {'match.vals'=>[RE1, RE2, RE3], 'match.max_ok'=>1}]

    The above schema specifies that string must not match more than one of RE1/RE2/RE3.

    See also: .max_ok, .min_nok, .max_nok.

  • min_nok => N

    This attribute specifies the required minimum number of checks that fail in order for the clause to be considered a success. By default this is not defined. You can use this attribute to require a certain number of failures.

    Example:

     [str => {cset=>{min_len=>8, match=>qr/\W/}, 'cset.min_nok'=>1}]

    The above schema requires a string to be shorter than 8 characters or devoid of non-word characters. Strings that would validate include: abcdefghi (fails the match clause), $abcd (fails min_len clause), or a (fails both clauses). Strings that would not validate include: $abcdefg.

    Another example:

     [str => {'match.vals'=>[RE1, RE2, RE3], 'match.min_nok'=>1}]

    The above schema specifies that string must fail at least one regex match.

    See also: .max_ok, .min_nok, .max_nok.

  • max_nok => N

    This attribute specifies the maximum number of checks that fail in order for the clause set to be considered a success. By default this is not defined (but when none of the {min,max}_{ok,nok} is defined, the default behavior is to require all clauses to succeed, in other words, as if max_nok were 0). You can use this clause to allow a certain number of failures in the checks.

    Example:

     [str => {cset=>{min_len=>8, match=>qr/\W/}, 'cset.max_nok'=>1}]

    The above schema states that string must either be longer than 8 characters or contains two non-word characters, or both. Strings that would validate include: abcdefgh, $$, $abcdefgh. Strings that would not validate include: abcd (fails both min_len and match clauses).

    Another example:

     [str => {'match.vals'=>[RE1, RE2, RE3], 'match.max_nok'=>1}]

    The above schema specifies that string can fail at most one regex match.

    See also: .max_ok, .min_nok, .max_nok.

A clause attribute can store a literal value, or an expression. To store an expression, append the = character after the attribute name. See "EXPRESSION" for more details on expression.

Example:

 default => "blah"
 "default=" => 'int(10*rand())+1'

Special-purpose clauses

Normally clauses serve as a type constraint (e.g. for type string, the "min_len" and "max_len" clauses restrict how short/long the string can be). However there are also some clauses that are special.

The '' (empty clause)

This can be used to store attributes that can be used generally by other clauses ("general attributes"). For example:

 [str => {not_match => /(password|abcd)$/,
          min_len => 4,
          ".err_msg" => "Password not good enough!"}]

When validation fails for one or more of the clauses, the custom error message will be used instead.

Clause set merging

Given several clause sets in the schema like:

 [TYPE, CS1 + CS2 + CS3]

all CS1, CS2, and CS3 will be evaluated in that order:

 eval(CS1)
 eval(CS2)
 eval(CS3)

However, if a clause set hash contains one or more keys with merge prefix (explained later), the clause set will first be merged with the previous clause set(s) prior to evaluation. For example, if CS2 keys contain merge prefixes (notation: "*" indicates the presence of merge prefix):

 [TYPE, CS1 + *CS2 + CS3]

then CS1 will be merged with CS2 first before evaluated (notation: "~" signifies merging).

 eval(CS1 ~ *CS2)
 eval(CS3)

If CS3 instead contains merge prefixes:

 [TYPE, CS1 + CS2 + *CS3]

then CS1 will be evaluated, and then CS2 is merged first with CS3:

 eval(CS1)
 eval(CS2 ~ *CS3)

If CS2 as well as CS3 contains merge prefixes:

 [TYPE, CS1 + *CS2 + *CS3]

then the three will be merged first before evaluating:

 eval(CS1 ~ *CS2 ~ *CS3)

So in short, unless the right hand side is devoid of merge prefixes, merging will be done first from left to right.

The first clause set should not contain any merge prefixes.

Sah uses Data::ModeMerge to do the merging, with merge prefixes changed to '[merge+]', '[merge!]' and so on. In merging, Data::ModeMerge allows keys on the right side hash not only to replace but also add, subtract, remove keys from the left side. This is powerful because it allows schema definition to not only add clauses (restrict types even more), but also replace clauses (change type restriction) as well as delete clauses (relax type restriction). For more information, refer to the Data::ModeMerge documentation.

Examples:

 [int => {div_by=>2} + {  div_by =>3}] # must be divisible by 2 & 3

 [int => {div_by=>2} + {'[merge]div_by'=>3}] # will be merged and become:
 [int => {div_by=>3}                       ] # must be divisible by 3 ONLY

 [int => {div_by=>2} + {'[merge!]div_by'=>0}] # will be merged and become:
 [int => {}                                 ] # need not be divisible by any

 [int => {in=>[1,2,3,4,5]} + {  in =>[6]}] # impossible to satisfy

 [int => {in=>[1,2,3,4,5]} + {'[merge+]in'=>[6]}] # will be merged and become:
 [int => {in=>[1,2,3,4,5,6]}                    ]

 [int => {in=>[1,2,3,4,5]}, {'-in'=>[4]}] # will be merged and become:
 [int => {in=>[1,2,3,  5]}              ]

Note that before performing merging, schemas will be normalized first. To avoid confusion, shortcut syntaxes are not allowed to have merge prefixes, e.g.:

 # error, not allowed
 '[merge]!div_by' => 2,

 # ok using long form
 '[merge]div_by' => 2,
 div_by.max_ok => 0,

Merging and hash keys

XXX: Does merging of clause sets need to be done recursively?

Due to recursive merging of clause sets by Data::ModeMerge, please be reminded if your clause set keys contain a merge prefix, then you need to check all hashes (not just the top-level clause set hash, but also deep down in nested hashes) for possible accidental merge prefixes:

 [merge]
 [merge+]
 [merge-]
 [merge.]
 [merge!]
 [merge^]

because they will also be merged and removed after merging.

Example:

 [hash => {       "keys_regex" => { foo=>[int=>{max=>7}],
                                    "[merge^]bar"=>"str" }}
       +  {"[merge]keys_regex" => { foo=>[int=>{max=>3}]
                                                          }}]

will be merged to:

 [hash => {       "keys_regex" => { foo => [int=>{max=>7}],
                                    bar => "str"          }}]

that is, foo will also gets merged. Sometimes this might be what you want, but sometimes it might not be. If the later is the case, you can turn off prefix parsing:

 [hash => {       "keys_regex" => { foo => [int=>{max=>7}],
                                   "[merge^]bar" => "str",
                 ""=>{parse_prefix=>0} }}
        + {"[merge]keys_regex" => { "[merge-]foo" => [int=>{max=>4}]}}]

it will become:

 [hash => { "keys_regex" => { foo => [int=>{max=>7}],
                              "[merge-]foo" => [int=>{max=>4}],
                              "[merge^]bar" => "str" } }]

Please also note that the empty string key ("") is also regarded as special by Data::ModeMerge, it is called the options key which regulate how merging should be done. Be careful not to use an empty string as your key either.

EXPRESSION

XXX: Syntax of variables not yet fixed.

Sah supports expressions, using Language::Expr minilanguage. See Language::Expr::Manual::Syntax for details on the syntax. You can specify expression in the check clause, e.g.:

 [int => {check => '$_ >= 4'}]

Alternatively, expression can also be specified in any clause's attribute:

 [int => {'min='     => '2+2'}]
 [int => {'min.val=' => 'floor(4.9)'}]

The above three schemas are equivalent to:

 [int => {min => 4}]

Expression can refer to elements of data and (normalized) schema, and can call functions, enabling more complex schema to be defined, for example:

 ['array*' => {len=>2, elems => [
   ['str*', {match => '^\w+$'}],
   ['str*', {'match=' => '${../../0/clause_sets/0/match}',
             'min_len=' => '2*length(${data:../0})'}]
 ]}]

The above schema requires data to be a two-element array containing strings, where the length of the second string has to be at least twice the length of the first. Both strings have to comply to the same regex, qr/^\w+$/ (which is declared on the first string's clause and referred to in the second string's clause).

FUNCTION

Functions can be used in expressions. The syntax of calling function is:

 func()
 func(ARG, ...)

Functions in Sah can sometimes accept several types of arguments, e.g. length(ARRAY) will return the number of elements in the ARRAY, while length(STR) will return the number of characters in the string. However, when an inappropriate argument is given, a Perl exception will be thrown.

HISTORY

To be written.

2012-07-21 split specification

2011-11-23 Data::Sah

2009-03-30 - Data::Schema (first public release)

previous: Schema-nested (internal)

SEE ALSO

Sah::Type

AUTHOR

Steven Haryanto <stevenharyanto@gmail.com>

COPYRIGHT AND LICENSE

This software is copyright (c) 2012 by Steven Haryanto.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.