The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

YATT::Lite::XHF::Syntax - Extended Header Fields (XHF) format.

SYNOPSIS

  require YATT::Lite::XHF;

  my $parser = YATT::Lite::XHF->new(string => <<'END');
  # Taken from http://docs.ansible.com/YAMLSyntax.html#yaml-basics
  name: Example Developer
  job: Developer
  skill: Elite
  employed: 1
  foods[
  - Apple
  - Orange
  - Strawberry
  - Mango
  ]
  languages{
  ruby: Elite
  python: Elite
  dotnet: Lame
  }

  name: hkoba
  languages{
  yatt: Elite?
  }
  END
  
  # read() returns one set of parsed result by one paragraph, separated by \n\n+.
  # In array context, you will get a flattened list of items in one paragraph.
  # (It may usually be a list of key-value pairs, but you can write other types)
  # In scalar context, you will get a hash struct.
  while (my %hash = $parser->read) {
    print Dumper(\%hash), "\n";
  }

DESCRIPTION

Extended Header Fields (XHF) format, which I'm defining here, is a data format based on Email header (and HTTP header) , with extension to carry nested data structures. To load XHF files/strings, use YATT::Lite::XHF.

Note: Although there is a serializer for XHF (YATT::Lite::XHF::Dumper), XHF is specifically designed to help programmers writing test data for unit tests, instead of to be a perfect serializer for perl (ie. XHF doesn't support self-referencing data structures. It is not my design goal). If you want such complex serializer, you should use YAML family, Storable and such instead.

Minimum escaping for value of name-value pairs

For simplest cases, YAML and XHF may look fairly similar. For example, a hash structure {foo => 1, bar => 2} can be written in a same way both in YAML and in XHF:

  foo: 1
  bar: 2

However, if you serialize a structure {x => [1, 2, "3, 4"], y => 5}, you will notice significant differences.

In XHF, above will be written as:

  {
  x[
  - 1
  - 2
  - 3, 4
  ]
  y: 5
  }

In contrast in YAML, same structure will be written as:

  ---
  x:
    - 1
    - 2
    - '3, 4'
  y: 5

The differences are:

  • XHF uses parens {} [] . YAML uses indents.

  • XHF can represent 3, 4 as is. YAML needs to escape it like '3, 4'.

Multi-line text value and verbatim text

In XHF, you only need to escape \n (and leading/trailing SPACE, TAB, if you need) for each value-part. In other words, there is no syntax for value-part so you don't need to worry about which characters must be escaped.

How to escape newlines in the middle.

Just substitute all "\n" with "\n " like s/\n/\n /g.

eg. { foo => "1\n2\n\n3", bar => 4 } can be written as:

   foo: 1
    2
    
    3
   bar: 4
How to escape leading/trailing spaces/tabs/newlines.

Just start value with ":\n" and follow same escaping rule for "\n".

eg. { foo => " x ", bar => "\n\ny\n\n" } can be written as:

  foo:
    x  
  bar:
   
   
   y
   
   

Name-value pair can be written as two items separately

In contrast to value-part, name-part has syntax restriction. name-part of XHF can contain only [[:alnum:]], "-", ".", "/" and some additional chars(see field-name definition in "BNF"). However, you can use two - items to write name-value pairs interchangeably. So again, whenever you are not sure about allowed char, you can use - notation and only escape \n.

   # For example, following block:

   foo: 1
   bar: 2

   # can be written as following:

   - foo
   - 1
   - bar
   - 2

eg. { "foo bar" => "baz" } can be written as:

  {
  - foo bar
  - baz
  }

And { "\n foo\nbar \n" => "baz" } can be written as:

  {
  -
  
     foo
   bar
  
  - baz
  }

For nested elements, same applies.

   foo{
   x: 1
   y: 2
   }
   baz[
   - z
   ]

   # can be written instead as following:

   - foo
   {
   x: 1
   y: 2
   }
   - baz
   [
   - z
   ]

   # or even like following:

   - foo
   {
   - x
   - 1
   - y
   - 2
   }
   - baz
   [
   - z
   ]

Also, you can put key: value notation in arrays, like following:

  [
  foo: 1
  bar: 2
  ]

  # above is equal to following

  [
  - foo
  - 1
  - bar
  - 2
  ]

Container Agnostic List

Another important difference (you might notice in previous examples) is at container type selection (array or dict). In XHF, name-value separator determines "type of value" instead of "type of surrounding container".

In XHF, following block

  foo: 1
  bar: 2

just represents ( foo => 1, bar => 2 ), which is flattened list of 4 items. This itself do not determine surrounding container type. Then you can choose outermost container type like

  my %dict = $parser->read;

or

  my @array = $parser->read;

When you call read() via scalar context, you will get a dictionary (or an error when the block has odd number of items).

  my $dict = $parser->read;

In contrast in YAML, : always means map(dictionary). So, above will be always +{ foo => 1, bar => 2 }.

Ordered kv-pair list with key duplicates (limited)

Since outermost xhf-block means flattened list, you can use XHF to write down ordered key-value pair list with key duplicates, like following:

  foo: 1
  foo: 2
  foo: 3
  bar: x
  bar: y

If you read above with

  my @array = $parser->read;

you can get @array == (foo => 1, foo => 2, foo => 3, bar => 'x', bar => 'y') exactly.

This is important for some kind of test data (eg. HTTP query parameters and some of Email header fields like "Received"). For example, above is (equivalent of) valid output from following html form in HTTP:

  <input type="checkbox" name="foo" value="1">
  <input type="checkbox" name="foo" value="2">
  <input type="checkbox" name="foo" value="3">
  <input type="checkbox" name="bar" value="x">
  <input type="checkbox" name="bar" value="y">

Note: currently, nested elements are deserialized as ordinally perl hash and array, so this order/dup-key preservation only works for outermost list.

Paragraph based block stream (with comment skipping)

XHF input stream is delimited by consecutive empty-line(s) "\n\n+" (like Email header and HTTP header), designed to work well with traditional "paragraph mode" multi-line record format. For more about paragraph mode, see perl -00 and Setting $RS to "" in perldoc.

Note: in XHF, "comment-only" blocks are skipped silently. For example:

  foo: 1
  bar: 2

  # Hey, here is a comment only block!


  baz: 3
  qux: 4

Then this script:

  my @records;
  push @records, $_ while $_ = $parser->read;

will result @records == ({foo => 1, bar => 2}, {baz => 3, qux => 4}).

How to put metainfo as optional (comment-only) record

In rare case, you may want to prepend optional meta record in single stream. If you really want to do this, you can use "comment only" block to represent empty record and read it with read(skip_comment => 0) like following:

  # This is metainfo. To put test => 1, please remove leading "# " below:
  # test: 1


  # This is body1
  foo: 1
  bar: 2

  # This is body2
  foo: 3
  bar: 4

Then

  if (my @meta = $parser->read(skip_comment => 0)) {
    # process metainfo. You may get (test => 1).
  }
  while (my @content = $parser->read) {
    # process body1, body2, ...
  }

COMPLEX EXAMPLE, compared with YAML

Here is a more dense example in XHF:

  name: hkoba
  # (1) You can write a comment line here, starting with '#'.
  job: Programming Language Designer (self-described;-)
  skill: Random
  employed: 0
  foods[
  - Sushi
  #(2) here too. You don't need space after '#'. This will be good for '#!'
  - Tonkatsu
  - Curry and Rice
  [
  - More nested elements
  ]
  ]
  favorites[
  # (3) here also.
  {
  title: Chaika - The Coffin Princess
  # (4) ditto.
  heroine: Chaika Trabant
  }
  {
  title: Witch Craft Works
  heroine: Ayaka Kagari
  # (5) You can use leading "-" for hash key/value too (so that include any chars)
  - Witch, Witch!
  - Tower and Workshop!
  }
  # (6) You can put NULL(undef) like below. (equal space sharp+keyword)
  = #null
  ]

Above will be loaded like following structure:

  $VAR1 = {
          'foods' => [
                     'Sushi',
                     'Tonkatsu',
                     'Curry and Rice',
                     [
                       'More nested element'
                     ]
                   ],
          'job' => 'Programming Language Designer (self-described;-)',
          'name' => 'hkoba',
          'employed' => '0',
          'skill' => 'Random',
          'favorites' => [
                         {
                           'heroine' => 'Chaika Trabant',
                           'title' => 'Chaika - The Coffin Princess'
                         },
                         {
                           'title' => 'Witch Craft Works',
                           'heroine' => 'Ayaka Kagari',
                           'Witch, Witch!' => 'Tower and Workshop!'
                         },
                         undef
                       ]
        };

Above will be written in YAML like below (note: inline comments are omitted):

  ---
  employed: 0
  favorites:
    - heroine: Chaika Trabant
      title: 'Chaika - The Coffin Princess'
    - 'Witch, Witch!': Tower and Workshop!
      heroine: Ayaka Kagari
      title: Witch Craft Works
    - ~
  foods:
    - Sushi
    - Tonkatsu
    - Curry and Rice
    -
      - More nested element
  job: Programming Language Designer (self-described;-)
  name: hkoba
  skill: Random

This YAML example clearly shows how you need to escape strings quite randomly, e.g. see above value of $VAR1->{favorites}[0]{title}. Also the key of $VAR1->{favorites}[1]{'Witch, Witch!'} is nightmare.

I don't want to be bothered by this kind of escaping. That's why I made XHF.

FORMAT SPECIFICATION

XHF are parsed one paragraph by one. Each paragraph can contain a set of xhf-items. Every xhf-items start from a fresh newline, ends with a newline and is basically formed like one of followings:

  <name> <type-sigil> <sep> <value>         (name-value pair)

  <type-sigil> <sep> <value>                (standalone value)

type-sigil defines type of value. sep is usually one of logical whitespace chars where space, tab and newline (newline is used for verbatim text). But for block items(dict/array), only newline is allowed.

Here is all kind of type-sigils:

"name:" then " " or "\n"

":" is for ordinally text with name. MUST be prefixed by name. sep can be any of WS.

"-" then " " or "\n"

"-" is for ordinally text without name. CANNOT be prefixed by name.

(Note: Currently, "," works same as "-". This feature is arguable.)

"{" then "\n"
"name{" then "\n"

"{" is for dictionary block ( { %HASH } container). Can be prefixed by name.

MUST be closed by "}\n". Number of elements MUST be even.

"[" then "\n"
"name[" then "\n"

"[" is for array block. ( [ @ARRAY ] container). Can be prefixed by name.

MUST be closed by "]\n"

"=" then " " or "\n"
"name=" then " " or "\n"

"=" is for special values. Can be prefixed by name.

Currently only #undef and its synonym #null is defined.

"#"

"#" is for embedded comment line. CANNOT be prefixed by name.

XHF Syntax definition in extended BNF

Here is a syntax definition of XHF in extended BNF (roughly following ABNF.)

  xhf-block       = 1*xhf-item

  xhf-item        = field-pair / single-text
                   / dict-block / array-block / special-expr
                   / comment

  field-pair      = field-name  field-value

  field-name      = 1*NAME *field-subscript

  field-subscript = "[" *NAME "]"

  field-value     = ":" text-payload / dict-block / array-block / special-expr

  text-payload    = ( trimmed-text / verbatim-text ) NL

  trimmed-text    = SPTAB *( 1*NON-NL / NL SPTAB )

  verbatim-text   = NL    *( 1*NON-NL / NL SPTAB )

  single-text     = "-" text-payload

  dict-block      = "{" NL *xhf-item "}" NL

  array-block     = "[" NL *xhf-item "]" NL

  special-expr    = "=" SPTAB known-specials NL

  known-specials  = "#" ("null" / "undef")

  comment         = "#" *NON-NL NL

  NL              = [\n]
  NON-NL          = [^\n]
  SPTAB           = [\ \t]
  NAME            = [0-9A-Za-z_.-/~!]

Some notes on current definition

field-name, field-subscript

field-name can contain /, ., ~ and !. Former two are for file names (path separator and extension separator). Later two (and field-subscript) are incorporated just to help writing test input/output data for YATT::Lite, so these can be arguable for general use.

trimmed-text vs verbatim-text

If field-name is separated by ": ", its field-value will be trimmed their leading/trailing spaces/tabs. This is useful to handle hand-written configuration files.

But for some software-testing purpose(e.g. templating engine!), this space-trimming makes it impossible to write exact input/output data.

So, when field-sep is NL, field-value is not trimmed.

LF vs CRLF

Currently, I'm not so rigid to reject the use of CRLF. This ambiguity may harm use of XHF as a serialization format, however.

"," can be used in-place of "-".

This feature also may be arguable for general use.

":" without name was valid, but is now deprecated.

Previously valid

  : bar

which represents ( "" => "bar" ), is now invalid. Please use two "- " items like following:

  - 
  - bar

XXX: Hmm, should I provide deprecation cycle? Are there someone already used XHF to serialize important data even before having this manual? If so, please contact me. I will add an option to allow this.

line-continuation is valid.

Although line-continuation is obsoleted in HTTP headers, line-continuation will be kept valid in XHF spec. This is my preference.

AUTHOR

"KOBAYASI, Hiroaki" <hkoba@cpan.org>

LICENSE

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.