The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Data::Tubes::Plugin::Parser

DESCRIPTION

This module contains factory functions to generate tubes that ease parsing of input records.

Each of the generated tubes has the following contract:

  • the input record MUST be a hash reference;

  • one field in the hash (according to factory argument input, set to raw by default) points to the input text that has to be parsed;

  • one field in the hash (according to factory argument output, set to structured by default) is set to the output of the parsing operation.

The factory functions below have two names, one starting with parse_ and the other without this prefix. They are perfectly equivalent to each other, whereas the short version can be handier e.g. when using tube or pipeline from Data::Tubes.

FUNCTIONS

by_format

   my $tube = by_format($format, %args); # OR
   my $tube = by_format(%args); # OR
   my $tube = by_format(\%args);

parse the input text according to a template format string (passed via factory argument format or through first unnamed parameter $format). This string is supposed to be composed of word and non-word sequences, where each word sequence is assumed to be the name of a field, and each non-word sequence is a separator. Example:

   $format = 'foo;bar;baz';

is interpreted as follows:

   @field_names = ('foo', 'bar', 'baz');
   @separators  = (';', ';');

Example:

   $format = 'foo;bar~~~baz';

is interpreted as follows:

   @field_names = ('foo', 'bar', 'baz');
   @separators  = (';', '~~~');

In the first case, i.e. when all separators are equal to each other, "by_split" will be called, as it is (arguably) slightly more efficient. Otherwise, "by_separators" will be called. Whatever these two factories return will be returned back.

All @field_names MUST be different from one another.

The following arguments are supported:

allow_missing

set to the number of missing trailing elements that you are fine to lose, in case the format is only compound of a single separator and "by_split" is used behind the scenes. This allows you setting an optional catchall trailing parameter to collect whatever you are not really interested into, also allowing for its absence.

As an example, consider the following input lines:

   FOO0,BAR0,BAZ0,WHATEVER
   FOO1,BAR1,BAZ1
   FOO2,BAR2,BAZ2,WHAT2,EVER2,

Assuming that you're really interested into the first three parameter, disregarding whatever comes after, you can set the following format:

   foo,bar,baz,rest

and also set allow_missing to 1, indicating that you can sustain the lack of rest (which you really don't care about);

format

the format to use for splitting the inputs. This parameter is the main one, so it can also be passed as the first, unnamed parameter (see third calling convention);

input

name of the input field, defaults to raw;

name

name of the tube, useful for debugging;

output

name of the output field, defaults to structured;

trim

remove leading and trailing whitespaces from the extracted values;

value

set how you are going to accept input values, e.g. escaped or quoted. See "by_separators" for details.

by_regex

   my $tube = by_regex($regex, %args); # OR
   my $tube = by_regex(%args); # OR
   my $tube = by_regex(\%args);

parse the input text based on a regular expression, passed as argument regex or $regex as unnamed first parameter. The regular expression is supposed to have named captures, that will eventually be used to populate the rendered output.

The following arguments are supported:

input

name of the input field, defaults to raw;

name

name of the tube, useful for debugging;

output

name of the output field, defaults to structured;

regex

the regular expression to use for splitting the inputs. This is the main argument, and can be passed also as the first unnamed one in the argument list.

by_separators

   my $tube = by_separators($separators, %args); # OR
   my $tube = by_separators(%args); # OR
   my $tube = by_separators(\%args);

parse the input according to a series of separators, that will be applied in sequence. For example, if the list of separators is the following:

   @separators = (';', '~~');

the following input:

   $text = 'foo;bar~~/baz/';

will be split as:

   @split = ('foo', 'bar', '/baz/');

The following arguments are supported:

input

name of the input field, defaults to raw;

keys

a reference to an array containing the list of keys to be associated to the values from the split;

name

name of the tube, useful for debugging;

output

name of the output field, defaults to structured;

separators

a reference to an array containing the list of separators to be used for splitting the input. This parameter can also be passed as the first, unnamed argument.

Each separator can be:

  • a sub reference, that is invoked once with a reference to the arguments, and must return either of the following forms;

  • a regular expression reference, that will be used as-is at the right place;

  • a plain string, that will be matched verbatim (through a regular expression matching the string after passing it through CORE::quotemeta);

trim

remove leading and trailing whitespaces from the extracted values. Example:

   @seps  = qw< : ; , >;
   $input = ' what : ever    ;you,do  ';
   @elements = ('what', 'ever', 'you', 'do');
value

this is how you provide a description of what you consider a valid value. It can be multiple things:

  • a sub reference, that is called and MUST provide back one of the following alternatives;

  • a regular expression reference, that is used directly;

  • a plain string, that is turned into an array reference by creating an anonymous array with the string as its only element, then processed as in the following bullet;

  • an array reference with elements inside, that will be described in the following list.

If you end up with an array reference, each element will be put in a big regular expression that is the OR of all elements. Each can be:

  • a regular expression reference, that is fit as-is in the big regular expression;

  • the string specials, that is the same as having put the three string escaped, single-quoted and double-quoted;

  • the string quoted, that is the same as having put the three string single-quoted and double-quoted;

  • the string single-quoted (or single_quoted), that allows you to match a string that is delimited by single quotes, with no escaping inside. This is always put at the beginning of the big regular expression (although double-quoted strings can be fit before actually);

  • the string double-quoted (or double_quoted), that allows you to match a string that is delimited by double quotes, also allowing escaped elements inside (via backslashes). This is always put at the beginning of the big regular expression;

  • the string escaped, that allows you to match a non-greedy sequence of escaped characters (via backslash). If single-quoted is also specified, single quotes need to be escaped too. If double-quoted is also specified, double quotes need to be escaped too. This is always set at the end of the big regular expression (except for whatever, that might appear after it);

  • the string whatever, that allows you to match a non-greedy sequence of characters, i.e. it is a synonym of regular expression (?ms:.*?). If present, it is always set at the end of the big regular expression.

For example, if you want to accept single quoted, double quoted and unquoted strings, you might provide the following:

   [qw< single-quoted double-quoted whatever >]

by_split

   my $tube = by_split(%args); # OR
   my $tube = by_split(\%args); # OR
   my $tube = by_split($separator, %args);

split the input according to a separator string, passed either as the first unnamed parameter $separator or as hash options separator.

The following arguments are supported:

allow_missing

set to the number of missing trailing elements that you are fine to lose, in case you also provide keys (see below). This is particularly important when this function is called behind the scenes by "parse_by_format", because that sets keys.

In practice, suppose that you set the following keys:

   [qw< foo bar baz whatever >]

A normal parsing will expect to find at least four elements, so the following input would fail:

   FOO,BAR,BAZ

On the other hand, if you set allow_missing to 1, you are accepting that there might be a missing value for whatever, that will be filled with the undefined value.

input

name of the input field, defaults to raw;

keys

optional reference to an array containing a list of keys to be associated to the split data. If present, it will be used as such; if absent, a reference to an array will be set as output.

name

name of the tube, useful for debugging;

output

name of the output field, defaults to structured;

separator

the separator to be used for CORE::split. If it is a code reference, it is invoked once with the provided arguments to get the separator back. After this, it can be either a regular expression, used as-is, or a string that is passed through CORE::quotemeta before being used;

trim

remove leading and trailing whitespaces from the extracted values. As you might expect, if the separator is a colon, the following input:

   $input = ' what : ever    :you:do  ';

would be split into the following elements:

   @elements = ('what', 'ever', 'you', 'do');

by_value_separator

   $tube = by_value_separator($separator, %args); # OR
   $tube = by_value_separator(%args); # OR
   $tube = by_value_separator(\%args);

parse a sequence of value-and-separator. This is a generalization of "by_split", where you can provide a way to specify what you consider valid values, e.g. to allow for escaping or quoting (hence also allowing having the separator inside your values).

CAVEAT: this function uses the regular expression construct (?{...}) internally. While it is supported as of perl 5.10, this has evolved in time, up to perl 5.18 where it was stabilized. In particular, before perl 5.18 it was not possible to use lexical variables in the construct, so for older perls by_value_separator uses a package variable for collecting values. This should not be a problem, but might be.

Just to make an example, suppose that you are using semicolons as separators. by_value_separator would allow you to take this:

   'some;thing';  what\;ever ; "this;\"goes\";fine"

and turn it into this:

   ['some;thing', 'what;ever', 'this:"goes";fine']

As noted, it is similar to "by_split"; as a matter of fact, this might be re-implemented (less efficiently) through by_value_separator. Unless there are bugs, of course. Like "by_split", you can provide a separator parameter (also via the first, unnamed parameter) that can be either a sub reference, a string or a regular expression.

Additionally, you can provide a value parameter that tells what is considered an acceptable input value. A value can be different things (see below), but it boils down to providing regular expressions, indication of pre-canned matching expressions, or a combination.

When you match values, you can then decode them. For example, if you specify that you want to accept double-quoted strings, it makes sense to remove the quotes and un-escape the remaining sequence before using it. Depending on what you pass as a definition for a valid value, your decoding approach might vary. Decoding can happen in two ways: either you provide a decode function that will be applied to each value, or a decode_values that is applied to the whole values array. You might want to choose the latter for improving performance (1 sub call against N).

Normally, an input would be split and an array reference would populate the output field (that is, the field indicated by the output argument). If you would rather get a hash, you can pass keys to use, in order. If this is the case, you can also accept getting more values than you have keys for with allow_surplus, or less of them with allow_missing.

Last, you might want to take advantage of trim if your values shouldn't have leading/trailing spaces. Be sure to read the fine prints about trimming quoted strings, though.

Accepted arguments are:

allow_missing
allow_surplus

these are integer values that set how much less/more values you are willing to admit with respect to the provided keys (see below). Hence, they only work when keys is set.

By default they are set to 0, meaning that you expect to have exactly the same number of values as there are keys. Allowing missing means that you accept getting less values than there are keys, that will be associated to undef. Allowing surplus means that you're willing to ditch that number of exceeding values;

input

name of the input field, defaults to raw;

keys

an array reference with the keys to be associated (one-by-one, in order) to the extracted values;

name

name of the tube, useful for debugging. Defaults to parse by value and separator;

output

name of the output field, defaults to structured;

separator

the separator to be used between two consecutive valid values. It can be one of the following:

  • a sub reference, that is called with whatever arguments provided (as a hash reference) and MUST return one of the following two alternatives;

  • a regular expression reference, that will be matched for the separator;

  • a plain string, that will be matched verbatim.

There is no default, you MUST provide one either as the first, unnamed parameter or as argument separator;

trim

remove leading and trailing whitespaces from the extracted values. This is applied before decoding is applied, which means that leading/trailing whitespaces inside quoted strings will be kept. Defaults to a false value, meaning that no trimming is performed;

value

this is how you provide a description of what you consider a valid value. It can be multiple things:

  • a sub reference, that is called and MUST provide back one of the following alternatives;

  • a regular expression reference, that is used directly;

  • a plain string, that is turned into an array reference by creating an anonymous array with the string as its only element, then processed as in the following bullet;

  • an array reference with elements inside, that will be described in the following list.

If you end up with an array reference, each element will be put in a big regular expression that is the OR of all elements. Each can be:

  • a regular expression reference, that is fit as-is in the big regular expression;

  • the string specials, that is the same as having put the three string escaped, single-quoted and double-quoted;

  • the string quoted, that is the same as having put the three string single-quoted and double-quoted;

  • the string single-quoted (or single_quoted), that allows you to match a string that is delimited by single quotes, with no escaping inside. This is always put at the beginning of the big regular expression (although double-quoted strings can be fit before actually);

  • the string double-quoted (or double_quoted), that allows you to match a string that is delimited by double quotes, also allowing escaped elements inside (via backslashes). This is always put at the beginning of the big regular expression;

  • the string escaped, that allows you to match a non-greedy sequence of escaped characters (via backslash). If single-quoted is also specified, single quotes need to be escaped too. If double-quoted is also specified, double quotes need to be escaped too. This is always set at the end of the big regular expression (except for whatever, that might appear after it);

  • the string whatever, that allows you to match a non-greedy sequence of characters, i.e. it is a synonym of regular expression (?ms:.*?). If present, it is always set at the end of the big regular expression.

For example, if you want to accept single quoted, double quoted and unquoted strings, you might provide the following:

   [qw< single-quoted double-quoted whatever >]

ghashy

   my $tube = ghashy(%args); # OR
   my $tube = ghashy(\%args);

parse the input thext as a hash, generalized. The algorithm used is the same as "generalized_hashy" in Data::Tubes::Util. It is a generalization of "hashy" below.

Accepts all arguments as "generalized_hashy" in Data::Tubes::Util, with the same default values except for default_key that is set to the empty string (as opposed to not being defined). This means that stand-alone values will always be accepted. This setting is in line with "hashy" and has been set for backwards/mutual compatibility.

The following arguements are recognised too:

defaults

a hash reference with default values for the output;

input

name of the input field, defaults to raw;

name

name of the tube, useful for debugging. Defaults to parse ghashy;

output

name of the output field, defaults to structured;

hashy

   my $tube = hashy(%args); # OR
   my $tube = hashy(\%args);

parse the input text as a hash. The algorithm used is the same as "metadata" in Data::Tubes::Util.

chunks_separator

character used to divide chunks in the input, defaults to a space character (ASCII 0x20);

default_key

the default key to be used when a key is not present in a chunk, defaults to the empty string;

defaults

a hash reference with default values for the output;

input

name of the input field, defaults to raw;

key_value_separator

character used to divide the key from the value in a chunk, defaults to the equal sign =;

name

name of the tube, useful for debugging. Defaults to parse hashy;

output

name of the output field, defaults to structured;

This tube factory is strict in what accepts as inputs, in that the separators MUST be single characters and there is no escaping mechanism. If you need something more flexible, see "ghashy" above.

parse_by_format

Alias for "by_format".

parse_by_regex

Alias for "by_regex".

parse_by_separators

Alias for "by_separators".

parse_by_split

Alias for "by_split".

parse_by_value_separator

Alias for "by_value_separator".

parse_ghashy

Alias for "ghashy".

parse_hashy

Alias for "hashy".

parse_single

Alias for "single".

single

   my $tube = single(%args); # OR
   my $tube = single(\%args);

consider the input text as already parsed, and generate as output a hash reference where the text is associated to a key.

input

name of the input field, defaults to raw;

key

key to use for associating the input text;

name

name of the tube, useful for debugging;

output

name of the output field, defaults to structured;

BUGS AND LIMITATIONS

Report bugs either through RT or GitHub (patches welcome).

AUTHOR

Flavio Poletti <polettix@cpan.org>

COPYRIGHT AND LICENSE

Copyright (C) 2016 by Flavio Poletti <polettix@cpan.org>

This module is free software. You can redistribute it and/or modify it under the terms of the Artistic License 2.0.

This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.