NAME

Array::GroupBy - Group equal elements of an ordered array, or list.

VERSION

This document describes Array::GroupBy version 0.0.1

SYNOPSIS

  use Array::GroupBy;
  # or
  use Array::GroupBy qw(igroup_by str_row_equal num_row_equal);

Array::Groupby exports function igroup_by() by default, and convenience functions str_row_equal() and num_row_equal() if requested.

DESCRIPTION

igroup_by

igroup_by() returns an iterator which when called, returns sub-arrays of the given array whose elements are "equal" as determined by a user-supplied boolean function. The iterator does this by stepping through the given data array, comparing adjacent elements, and without sorting the array. The name is inspired by the SQL GROUP BY clause.

str_row_equal
num_row_equal

str_row_equal() and num_row_equal() are convenience row-comparison routines which might be of use for database-derived two-dimensional arrays (i.e. arrays of arrays, of the kind returned, for example, by DBI module's fetchall_arrayref()). They compare, respectively, rows of strings or numbers possibly containing undef values. (See below)

The general usage for igroup_by is:

    use Array::GroupBy;

    $iter = igroup_by(
                data    => \@data,
                compare => \&compare,
              [ args    => \@args, ]    # optional arguments to compare()
                     );

    while ($a = $iter->()) {
      # do something with array @{ $a } ...
    }

The user-supplied boolean function compare() should return 1 (true) if the two array elements passed as arguments are "equal", otherwise return 0 (false).

Example 1: Simple one-dimensional lists:

    use Array::GroupBy;

    # the data to be "grouped"
    my @a = qw(alpha alpha alpha beta beta charlie alpha alpha);

    my $iter = igroup_by(
                data    => \@a,
                compare => sub { $_[0] eq $_[1] },
    );

On repeated calls of:

    while ( my $b = $iter->() ) {
      ...
    }

Array @{ $b } would contain, in order:

    qw(alpha alpha alpha)
    qw(beta beta)
    qw(charlie)
    qw(alpha alpha)

In Example 1 above, where the data was a list of strings, the comparison subroutine was:

    compare => sub { $_[0] eq $_[1] }

If the data consisted of a list of numbers, the comparison subroutine would, of course, become:

    compare => sub { $_[0] == $_[1] }

Example 2: Two-dimensional arrays:

    use Array::GroupBy;

    # people's favourite colour(s)
    # (John and David each have two favourite colours, Alice only one)
    my $l1 = [ qw( Smith John  red    ) ];
    my $l2 = [ qw( Smith John  blue   ) ];
    my $l3 = [ qw( Smith Alice orange ) ];
    my $l4 = [ qw( Black David green  ) ];
    my $l5 = [ qw( Black David red    ) ];

    my $a = [ $l1, $l2, $l3, $l4, $l5 ]; # array to be grouped

    my $iter = igroup_by(
                data    => $a,
                # the data contains no '|' characters
                compare => sub { my ($row1, $row2, $slice) = @_;
                               join('|', @{ $row1 }[ @{ $slice } ] )
                               eq
                               join('|', @{ $row2 }[ @{ $slice } ] )
                             },
                args    => [ 0, 1 ],  # slice: compare first two columns only
    );
  

On repeated calls of:

    while ( my $b = $iter->() ) {
      ...
    }

Array @{ $b } would contain, in order,

    ( $l1, $l2 ),
    ( $l3      ),
    ( $l4, $l5 )

Note that the comparison function used in Example 2 is for illustration only. A much better routine for this example would be str_row_equal() included with the module.

Routines str_row_equal() and num_row_equal()

str_row_equal() and num_row_equal() are row-comparison routines which are useful when grouping two-dimensional arrays as in Example 2 above.

The subroutines are called with 2 or 3 arguments:

    $bool = str_row_equal($row1, $row2)         # for text data
    $bool = num_row_equal($row1, $row2)         # for numeric data
    # or
    $bool = str_row_equal($row1, $row2, $slice) # for text data
    $bool = num_row_equal($row1, $row2, $slice) # for numeric data

where the third argument, $slice, derives from the "args => ..." argument in group_by()

str_row_equal() compares arrays of string data possibly containing undef values, typically returned from database SQL queries in which DBI maps NULL values to undef.

Similarly, num_row_equal() compares arrays of numeric data possibly containing undef values.

Both routines return 1 (true) if the rows are "equal", 0 (false) if they are "unequal"

When comparing rows, if str_row_equal() and num_row_equal() encounter undef elements in corresponding column positions, they will consider the elements equal. When corresponding column elements are defined and undef respectively, the elements are considered unequal.

This truth table demonstrates the various combinations (in this case for numeric comparisons):

   --------+-----------+---------------+---------------+--------------
    row 1  | (1, 2, 3) | (1, undef, 3) | (1, undef, 3) | (1,     2, 3)
    row 2  | (1, 2, 3) | (1, undef, 3) | (1,     2, 3) | (1, undef, 3)
   --------+-----------+---------------+---------------+--------------
    equal? |   yes     |     yes       |      no       |      no

Also note that neither str_row_equal() nor num_row_equal() generate diagnostics if called with rows of unequal lengths, or for args => [...] slice arguments which are out of bounds for the rows being compared: in both cases a value of 0 (false) will be returned.

Example 3: Simulating SQL "GROUP BY" clause

Given a hypothetical annual salary dataset containing Person Name, Year, and Salary, in k$ (ordered by Person Name), print out the max annual salary for each Person and the year(s) during which each Person received that maximum salary.

  use Array::GroupBy;
  use List::Util qw( max );

  # salary dataset
  my @amounts = (
      [ "Smith, J", 2009, 65 ],
      [ "Smith, J", 2010, 63 ],
        ...
      [ "Brown, F", 2006, 45 ],
      [ "Brown, F", 2007, 47 ],
        ...
  );

  my $iter = igroup_by(
                data    => \@amounts,
                compare => sub { $_[0]->[0] eq $_[1]->[0] },
                      );

  while (my $subset = $iter->()) {
    my $max_sal = max map { $_->[2] } @$subset; # max salary

    print "Name: $subset->[0]->[0], ",
           "Max Salary: $max_sal, Year(s) max salary reached: ",
           join(', ',
               map  { $_->[1] }
               grep { $_->[2] == $max_sal } @$subset
           ),
           "\n";
  }

See t/5.t for code.

Example 4: Building objects

This is the real, "scratch-my-itch" reason for this module: to be able to take multi-level data generated by SQL, and build objects from the returned data, in this example Moose objects.

The hypothetical situation being modelled in the database is that patients make multiple visits to a doctor on several occasions and on each visit receive a diagnosis of their condition.

So object Visit has three attributes, the date the visit took place, the name of the doctor, and the diagnosis. Object Patient has a first and last name and a list of Visits. To keep it simple, all scalar attributes are strings. We assume all patients have unique (First, Last) name pairs.

  package Visit;
  use Moose;
  has  date      => (is => 'ro', isa => 'Str');
  has  doctor    => (is => 'ro', isa => 'Str');
  has  diagnosis => (is => 'ro', isa => 'Str');

  package Patient;
  use Moose;
  has last      => (is => 'ro', isa => 'Str'); 
  has first     => (is => 'ro', isa => 'Str'); 
  has Visits    => (is => 'ro', isa => 'ArrayRef[Visit]');

  no Moose;

  use DBI;

  ...

  my @result;     # this will contain a list of Patient objects

  my $sql = q{
    SELECT
       P.Last, P.First
      ,V.Date, V.Doctor, V.Diagnosis
    FROM
      Patient P
      ,Visit  V
    WHERE
      V.Patient_key = P.Patient_key   -- join clause
      ...
    ORDER BY
       P.Last, P.First
  };

  my $dbh = DBI->connect(...);

  my $data = $dbh->selectall_arrayref($sql);

  # rows of @$data contain: Last, First, Date, Doctor, Diagnosis
  #           at positions: [0]   [1]    [2]   [3]     [4]

  my $iter = igroup_by(
                data    => $data,
                compare => \&str_row_equal,
                args    => [ 0, 1 ],
                      );

  while (my $subset = $iter->()) {

    my @visits = map { Visit->new(
                        date        => $_[2],
                        doctor      => $_[3],
                        diagnosis   => $_[4],
                                 )
                     } @$subset;

    push @result, Patient->new(
                        last  => $subset->[0]->[0],
                        first => $subset->[0]->[1],
                        Visit => \@visits,
                              );
  }
  

See t/6.t for code.

DIAGNOSTICS

Most error diagnostics are generated by the Params::Validate module which igroup_by() uses for argument validation.

The data => ... and compare => ... parameters are mandatory. Omitting one will generate error message:

  Mandatory parameter '<data or compare>' missing in call to
    Array::GroupBy::igroup_by

Similarly, using a parameter not in the list ( "data", "compare", "args" ), e.g., typo compaer => ..., will generate error:

  The following parameter was passed in the call to Array::GroupBy::igroup_by
  but was not listed in the validation options: compaer

If the argument to the compare => ... parameter is not a subroutine reference, e.g., compare => 'my_sub', this will generate error:

  The 'compare' parameter ("my_sub") to Array::GroupBy::igroup_by was a
  'scalar', which is not one of the allowed types: coderef

If any of values of the parameters are undefined, this will generate error:

  The '<data|compare|args>' parameter (undef) to Array::GroupBy::igroup_by
  was an 'undef', which is not one of the allowed types: ...

Passing an empty data array, e.g., data => [], will result in error:

  The array passed to igroup_by( data => ... ) is empty,
  called at <program name> line <nnn>.

DEPENDENCIES

    Carp
    Params::Validate
    List::Util

BUGS AND LIMITATIONS

No bugs have been reported (yet).

Please report any bugs or feature requests to bug-array-groupby@rt.cpan.org, or through the web interface at http://rt.cpan.org.

AUTHOR

Sam Brain <samb@stanford.edu>

LICENCE AND COPYRIGHT

Copyright (c) 2012, Sam Brain <samb@stanford.edu>. All rights reserved.

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See perlartistic.

DISCLAIMER OF WARRANTY

BECAUSE THIS SOFTWARE IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE SOFTWARE, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE SOFTWARE "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE SOFTWARE IS WITH YOU. SHOULD THE SOFTWARE PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR, OR CORRECTION.

IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE SOFTWARE AS PERMITTED BY THE ABOVE LICENCE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE SOFTWARE (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE SOFTWARE TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.