The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Data::Tubes::Util

DESCRIPTION

Helper functions for automatic management of argument lists and other.

FUNCTIONS

args_array_with_options

    my ($aref, $args) = args_array_with_options(@list, \%defaults); # OR
    my ($aref, $args) = args_array_with_options(@list, \%args, \%defaults);

helper function to ease parsing of input parameters. This is mostly useful when your function usually takes a list as input, but you want to be able to provide an optional hash of arguments.

The function returns an array reference with the list of parameters, and a hash reference of arguments for less common things.

When calling this function, you are always supposed to pass a hash reference of options, which will act as a default. If the element immediately before is a hash reference itself, it will be considered the input for overriding arguments. Their combination (a simple overriding at the highest hash level) is then returned as $<$args>.

The typical way to invoke this function is like this:

   function foo {
      my ($list, $args) = args_array_with_options(@_, {bar => 'baz'});
      ...
   }

so that the function foo can be called with an optional trailing hash reference containing the arguments, like this:

   foo(qw< this and that >, {bar => 'galook!'});

In case your list might actually contain hash references, you will have to take this into consideration.

assert_all_different

   $bool = assert_all_different(@strings);

checks that all strings in @strings are different. Returns 1 if the check is successful, throws an exception otherwise. The exception is a hash reference with a key message set to the first string that is found repeated.

generalized_hashy

   $outcome = generalized_hashy($text, %args); # OR
   $outcome = generalized_hashy(%args);        # OR
   $outcome = generalized_hashy(\%args);

very generic parsing function that tries to figure out a hash out of an input text.

The default settings are optimezed for whipuptitude and DWIMmery. This means that a lot of strings that you would hardly consider sane are parsed anyway, just to give you something fast. If you need to be precise instead, you can either customize the different %args, use a different parsing function or... roll your own.

The returned value is a hash with the following keys:

failpos

in case of failure, it reports the position in the input text where the parsing was unsuccessful. It is absent when the parsing succeeds;

failure

in case of failure, it reports an error message. It is absent when the parsing succeeds;

hash

the parsed hash. It is absent when the parsing fails;

pos

the position at which the parsing ended, because the "close" sequence was found;

res

the number of characters in the input text that were not parsed;

The model is the following:

  • the string is considered a sequence of chunks, optionally marked at the beginning by an open sequence, and at the end by a close sequence. Chunks are separated by a chunk separator;

  • each chunk can be either a stand-alone value or a key/value pair. In the latter case, key and value are separated by a key-value separator

  • there is something that defines what a valid key and value looks like.

This gives you the following options via %args:

capture

the regular expression that dominates all the other ones. You normally don't want to set it directly, but you can if you look at how the code uses it.

You can use this input argument using something that has already been compiled in a previous invocation of generalized_hashy, because it is returned at every invocation. So, the typical idiom for avoiding the recompilation of this regular expression every time is:

   # get the capture, set text to undef to avoid any parsing
   $args{capture} = generalized_hashy(undef, %args)->{capture};

From now on, $args{capture} contains the regular expression and generalized_hashy will not need to compute it again when called with this %args list.

It has no default value.

chunks_separator

a regular expression for telling chunks apart. Defaults to:

   chunks_separator => qr{(?mxs: \s* [\s,;\|/] \s*)}

i.e. it eats up surrounding spaces, and can be a space, comma, semicolon, pipe or slash character;

close

a regular expression for stating that the hash ends. Defaults to:

   close => qr{(?mxs: \s*\z)}

i.e. it eats up optional trailing whitespace and expects to find the end of the string;

key

a regular expression for valid keys. This allows you to be quite precise as to what you admit for keys, but be sure to take a look at "key_admitted" below for a quicker way to set this parameter.

It does not have a default value as it relies upon "key_admitted"'s one.

key_admitted

a specification for valid, unquoted keys. When specifying this parameter and not setting a "key", the key is computed according to the algorithm explained below for admitted sequences.

This parameter can be either a regular expression, or a plain string containing the admitted characters. Defaults to:

   key_admitted => qr{[^\\'":=\s,;\|/]};

i.e. whatever cannot fit in either separator.

key_decoder

a decoding function for a parsed key. You might want to set it when you allow quoting and/or escape sequences in your keys.

By default, it removes quotes and escaping characters related to "key_admitted";

key_default
default_key

a default key to use when there is a stand-alone value. The default_key variant is provided for compatibility with "metadata" and "hashy" in Data::Dumper::Plugin::Parser.

When not set and a stand-alone value is found, the parsing fails and an error is returned.

There is no default. Note that this is different from the default setting/behaviour of "ghashy" in Data::Dumper::Plugin::Parser, although that function used generalized_hashy behind the scenes. Again, this is for similarity with hashy and backwards compatibility.

key_duplicated

a sub reference that will be called whenever a key is already present in the output hash. This allows you to e.g. complain loudly in case your input has a duplicated key.

By default, when a duplicate key is found for the first time the current value is transformed into an array reference whose first element is the old value and the second one is the new value. Any following value for that key is appended to the array;

key_value_separator

a regular expression for telling a key from a value. Defaults to:

   key_value_separator => qr{(?mxs: \s* [:=] \s*)}

i.e. it eats up surrounding spaces, and can be a colon or an equal sign;

open

a regular expression for the hash beginning. Defaults to:

   open => qr{(?mxs: \s* )}

i.e. it eats up optional leading whitespace;

pos

an integer value to set the initial position for parsing the input string. Default to 0, i.e. the start of the string;

text

the text to parse. This can also appear as the first unnamed parameter in the argument list;

value

a regular expression for valid values. This allows you to be quite precise as to what you admit for values, but be sure to take a look at "value_admitted" below for a quicker way to set this parameter.

It does not have a default value as it relies upon "value_admitted"'s one.

value_admitted

a specification for valid, unquoted values. When specifying this parameter and not setting a "value", the key is computed according to the algorithm explained below for admitted sequences.

This parameter can be either a regular expression, or a plain string containing the admitted characters. Defaults to:

   value_admitted => qr{[^\\'":=\s,;\|/]};

i.e. whatever cannot fit in either separator.

value_decoder

a decoding function for a parsed value. You might want to set it when you allow quoting and/or escape sequences in your values.

By default, it removes quotes and escaping characters related to "value_admitted";

When using either "key_admitted" or "value_admitted", the "key" and "value" regular expressions will be computed automatically allowing for single and double quoted strings. This is what we refer to as admitted sequences. In this case, the admitted regular expression (we will call it $admitted) is used as follows:

   allowed_sequence => qr{(?mxs:
      (?mxs:
         (?: "(?: [^\\"]+ | \\. )*") # double quotes
         | (?: '[^']*')              # single quotes
      )
      | (?: (?: $admitted | \\.)+? ) # unquoted sequence, with escapes
   )}

In case $admitted is not a regular expression, it is transformed into one like this:

   $admitted = qr{[\Q$admitted\E]}

i.e. it is considered a set of valid characters and transformed into a characters class.

One admitted sequence can then be either of the following:

double-quoted

in this case, it is bound by double quotes characters, and can contain any character, including the double quotes themselves, by escaping using the backslash. As a matter of fact, every sequence of a backslash and a character is accepted whatever the second character is (including the backslash itself and the quoting character);

single-quoted

in this case, it is bound by single quote characters, and can contain any character except the single quote itself. This differs from what Perl accepts in single-quoted strings, and is more in line with what happens in other languages (e.g. the shell);

unquoted

in this case, no quotation character is considered, and the $admitted characters are used, with a twist: you can still escape otherwise invalid characters with the backslash.

If you don't like all this DWIMmery you can set "key" and "value" independently, of course.

Some examples are due. The following inputs all produce the same output in the default settings, ranging from mostly OK to definitely weird:

    input text -> q< what:ever you:do >
    input text -> q< what: ever you: do >
    input text -> q< what: ever you= do | wow: yay >
    input text -> q< what: ever , you= do | wow: yay >
   output hash -> {what => 'ever', you => 'do', wow => 'yay'}

This shows you that you can do some escaping in the keys and values:

    input text -> q< what: ever\ \"\,\"\ you\=\ do | wow: yay >
    input text -> q< what: 'ever "," you= do'      | wow: yay >
    input text -> q< what: "ever \",\" you= do"    | wow: yay >
   output hash -> {what => 'ever "," you= do', wow => 'yay'}

load_module

   my $module = load_module($locator); # OR
   my $module = load_module($locator, $prefix);

loads a module automatically. There are a lot of modules on CPAN that do this, probably much better, but this should do for these module's needs.

The $locator is resolved into a full module name through "resolve_module"; the resulting name is then required and the resolved name returned back.

Example:

   my $module = load_module('Reader');

loads module Data::Tubes::Plugin::Reader and returns the string Data::Tubes::Plugin::Reader, while:

   my $other_module = load_module('Foo::Bar');

loads module Foo::Bar and returns string Foo::Bar.

You can optionally pass a $prefix that will be passed to "resolve_module", see there for further information.

load_sub

   my $sub = load_sub($locator); # OR
   my $sub = load_sub($locator, $prefix);

loads a sub automatically. There are a lot of modules on CPAN that do this, probably much better, but this should do for these module's needs.

The $locator is split into a pair of module and subroutine name. The module is loaded through "load_module"; the subroutine referenc3 is then returned from that module.

Example:

   my $sub = load_module('Reader::by_line');

loads subroutine Data::Tubes::Plugin::Reader::by_line and returns a reference to it, while:

   my $other_sub = load_module('Foo::Bar::baz');

returns a reference to subroutine Foo::Bar::baz after loading module Foo::Bar.

You can optionally pass a $prefix that will be passed to "resolve_module", see there for further information.

metadata

   my $href = metadata($input, %args); # OR
   my $href = metadata($input, \%args);

parse input string $string according to rules exposed below, that can be controlled through %args.

The string is split on the base of two separators, a chunks separator and a key/value separator. The first one isolates what should be key/value pairs, the second allows separating the key from the value in each of these chunks. Whenever a chunk is not actually a key/value pair, it is considered a value and associated to a default key.

The following items can be set in %args:

chunks_separator

what allows separating chunks, it MUST be a single character;

default_key

a string used as the key when a chunk cannot be split into a pair;

key_value_separator

what allows separating the key from the value in a chunk, it MUST be a single character.

Examples:

   # use defaults
   my $input = 'foo=bar baz=galook booom!';
   my $href = metadata($input);
   # $href = {
   #    foo => 'bar',
   #    baz => 'galook',
   #    ''  => 'booom!'
   # }

   # use defaults
   my $input = 'foo=bar baz=galook booom!';
   my $href = metadata($input, default_key => 'name');
   # $href = {
   #    foo  => 'bar',
   #    baz  => 'galook',
   #    name => 'booom!'
   # }

   # use alternative separators
   my $input = 'foo:bar & bar|baz:galook booom!|whatever';
   my $href = metadata($input,
      default_key => 'name',
      chunks_separator => '|',
      key_value_separator => ':'
   );
   # $href = {
   #    foo  => 'bar & bar',
   #    baz  => 'galook booom!',
   #    name => 'whatever'
   # }

normalize_args

   my $args = normalize_args( %args, \%defaults); # OR
   my $args = normalize_args(\%args, \%defaults); # OR
   my $args = normalize_args($value, %args, [\%defaults, $key]);

helper function to handle input parameters, with some defaults. Allows accepting both a series of key/value pairs, or a hash reference with these pairs, while at the same time providing default values.

A typical usage is as follows:

   sub foo {
      my $args = normalize_args(@_, {bar => 'baz'});
      ...
   }

The last version allows you to accept an initial $value without a key in your functions, because you pass the default $key during the call to normalize_args. A typical usage is as follows:

   sub foo {
      my $args = normalize_args(@_, [{bar => 'baz'}, 'aargh']);
      ...
   }

In this case, you can accept calling foo like this:

   foo('some value', salutation => 'aloha');

and $args will be populated as follows:

   $args = {
      aargh => 'some value', # thanks to the default $key
      salutation => 'aloha', # passed as %args
      bar => 'baz',          # from defaults
   };

normalize_filename

   my $name_or_handle = normalize_filename($name, $default_handle);

helper function to normalize a file name according to some rules. In particular, depending on $filename:

  • if it is a filehandle, it is returned directly;

  • if it is the string -, the $default_handle is returned. This allows you to use STDIN or STDOUT as input/output handles in case the filename is - (like many applications support);

  • if it starts with the string file:, this prefix is stripped away and the rest is used as a filename. This allows you to actually use - as a real file name, avoiding the automatic handle management described in the bullet above. If your filename may start with the string file:, then you should always put this prefix, e.g.:

       file:whatever   -- should be passed as -->  file:file:whatever
  • if it starts with the string handle:, this prefix is stripped and the rest is used to get one of the standard filehandles. The allowed remaining parts are (case-insensitive):

    in
    stdin
    out
    stdout
    err
    stderr

    Any other remaining part causes an exception to be thrown.

    Again, if you actually need to create a file whose name is e.g. handle:whatever, you have to prefix it with file::

       handle:whatever   -- should be passed as -->  file:handle:whatever
  • otherwise, the provided $filename will be returned as-is.

pump

   pull($iterator);
   my $records = pull($iterator);
   my @records = pull($iterator);
   pull($iterator, $sink);

exhaust an $iterator, depending on the conditions;

  • if a $sink is present, it MUST be a sub reference. For each item extracted from the iterator, this sub reference will be called with the items as argument;

  • otherwise, if called in void context, the iterator is simply exhausted, without any kind of accumulation of the records generated;

  • otherwise, depending on scalar context or list context, an array reference or a list of generated records is returned.

read_file

   my $contents = read_file($filename, %args); # OR
   my $contents = read_file(%args); # OR
   my $contents = read_file(\%args);

a slurping facility. The following options are available:

binmode

parameter for CORE::binmode, defaults to :encoding(UTF-8);

filename

the filename (or reference to a string, if you really need it) to slurp data from.

You can optionally pass the filename standalone as the first argument without pre-pending it with the string filename. In this case, it MUST appear as the first item in the argument list.

read_file_maybe

   my $text = read_file_maybe(\@aref);
   my $x    = read_file_maybe($x); # where ref($x) ne 'ARRAY'

helper function that expands the input argument with "read_file" if it is an array reference, while returning the input argument unchanged otherwise.

This can be useful if you want to overload an input parameter with either a straight text or something that should be loaded from a file, like a template:

   my $template = read_file_maybe($args{template});

In this case, if $args{template} is a text, it will be returned unchanged. Otherwise, if it is an array reference, it will be expanded in a list passed to "read_file", and the contents of the file returned back.

Examples:

   $text = read_file_maybe('this goes straight');  # direct text
   # $text contains 'this goes straight' now

   $text = read_file_maybe(['/path/to/text.txt']);
   # $text has the contents of file /path/to/text.txt now

   $text = read_file_maybe(['/path/to/text.txt', binmode => ':raw']);
   # ditto, but read as raw text instead of default utf-8

resolve_module

   my $full_module_name = resolve_module($module_name); # OR
   my $full_module_name = resolve_module($module_name, $prefix);

possibly expand a module's name according to a prefix. These are the rules as of release 0.736:

  • if $module_name starts with either a plus sign character + or a caret character ^, this initial character will be stripped away and the rest will be used as the package name. $prefix will be ignored in this case;

  • otherwise, ${prefix}::${module_name} will be returned (where $prefix defaults to the string Data::Tubes::Plugin).

The change is related to simplification of interface and better conformance to what other modules do in similar situations (principle of least surprise).

Examples:

   module_name('^SimplePack'); # SimplePack
   module_name('+Some::Pack'); # Some::Pack
   module_name('SimplePack');  # Data::Tubes::Plugin::SimplePack
   module_name('Some::Pack');  # Data::Tubes::Plugin::Some::Pack
   module_name('Pack', 'Some::Thing'); # Some::Thing::Pack
   module_name('Some::Pack', 'Some::Thing'); # Some::Thing::Some::Pack

API Versioning Note: behaviour of this function changed between version 0.734 and 0.736. The previous behaviour, described below, is still available when $Data::Tubes::API_VERSION (see "API Versioning" in Data::Tubes) is (lexicographically) less than, or equal to, 0.734. Here's what the function does with the older interface:

    • if $module_name starts with an exclamation point !, this initial character will be stripped away and the rest will be used as the package name. $prefix will be ignored in this case;

    • otherwise, if $module_name starts with a plus sign +, this first character will be stripped away and the $prefix will be used (defaulting to Data::Tubes::Plugin);

    • otherwise, if $module_name does not contain sub-packages (i.e. the sequence ::), then the $prefix will be used as in the previous bullet;

    • otherwise, the provide name is used.

    Examples (in the same order as the bullet above):

       module_name('!SimplePack'); # SimplePack
       module_name('+Some::Pack'); # Data::Tubes::Plugin::Some::Pack
       module_name('SimplePack');  # Data::Tubes::Plugin::SimplePack
       module_name('Some::Pack');  # Some::Pack
       module_name('Pack', 'Some::Thing'); # Some::Thing::Pack
       module_name('Some::Pack', 'Some::Thing'); # Some::Pack

shorter_sub_names

   shorter_sub_names($package_name);

this helper is used in plugins to generate alternative versions of the implemented functions, with shorter names.

The basic rationale is that functions are usually named after the area they cover, e.g. the function in Data::Tubes::Plugin::Reader that reads a filehandle line-by-line is called read_by_line. In this way, when you use e.g. summon from Data::Tubes, you end up with a function read_by_line that is much clearer than simply by_line.

On the other hand, when you rely upon automatic running of factory functions like in tube or pipeline (again, in Data::Tubes), some parts are redundant. In the example, you would end up using Reader::read_by_line, where read_ is actually redundant as you already have the last part of the plugin package name to tell you what this by_line thing is about.

shorter_sub_names comes to the rescue to generate alternative names by analysing the current namespace for a package and generating new functions by removing a prefix. In the Data::Tubes::Plugin::Reader case, for example, it is called like this at the end of the module:

   shorter_sub_names(__PACKAGE__);

and it generates, among the others, by_line and by_paragraph.

Consider using this if you generate new plugins.

sprintffy

   my $string = sprintffy($template, \@substitutions);

expand a $template string a-la sprintf, based on a list of @substitutions.

The template targets are sprintf-like, i.e. sequences that start with a percent sign followed by... something.

Each substitution is supposed to be an array reference with two items inside: a regular expression and a value specifier. The regular expression is used to match what comes after the percent sign, while the value part can be either a straight value, or a subroutine reference that will be run to get the real value for the substitution.

There is always an implicit, high priority substitution that matches a single percent sign and expands to a percent sign, so that the string %% will be unescaped to % as you would expect in something that is sprintf-like.

test_all_equal

   my $bool = test_all_equal(@list);

test whether all elements in @list are equal to one another or not, and return test output as a boolean value (i.e. something that Perl considers true or false).

trim

   trim(@strings);

remove leading/trailing whitespaces from input @strings, in-place.

traverse

   my $item = traverse($data, @keys);

Assuming that $data is an array or hash reference, traverse it using items in @keys at each step in the descent.

tube

see tube in Data::Tubes, this is the same function.

unzip

   my ($even, $odds) = unzip(@list); # OR
   my ($even, $odds) = unzip(\@list);

separates even and odd items in the input @list and returns them as two references to arrays.

SEE ALSO

Data::Tubes is a valid entry point of all of this.

AUTHOR

Flavio Poletti <polettix@cpan.org>

COPYRIGHT AND LICENSE

Copyright (C) 2016 by Flavio Poletti <polettix@cpan.org>

This module is free software. You can redistribute it and/or modify it under the terms of the Artistic License 2.0.

This program is distributed in the hope that it will be useful, but without any warranty; without even the implied warranty of merchantability or fitness for a particular purpose.