The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

DBIx::SQLCrosstab - creates a server-side cross tabulation from a database

SYNOPSIS

    use DBIx::SQLCrosstab;
    my $dbh=DBI->connect("dbi:driver:database"
        "user","password", {RaiseError=>1})
            or die "error in connection $DBI::errstr\n";

    my $params = {
        dbh    => $dbh,
        op     => [ ['SUM', 'salary'] ], 
        from   => 'person INNER JOIN departments USING (dept_id)',
        rows   => [
                    { col => 'country'},
                  ],
        cols   => [
                    {
                       id => 'dept',
                       value =>'department',
                       from =>'departments'
                    },
                    {
                        id => 'gender', from => 'person'
                    }
                  ]
    };
    my $xtab = DBIx::SQLCrosstab->new($params)
        or die "error in creation ($DBIx::SQLCrosstab::errstr)\n";

    my $query = $xtab->get_query("#")
        or die "error in query building $DBIx::SQLCrosstab::errstr\n";

    # use the query or let the module do the dirty job for you
    my $recs = $xtab->get_recs
        or die "error in execution $DBIx::SQLCrosstab::errstr\n";

    # do something with records, or use the child class 
    # DBIx::SQLCrosstab::Format to produce well 
    # formatted HTML or XML output
    #

    my $xtab = DBIx::SQLCrosstab::Format->new($params)
        or die "error in creation ($DBIx::SQLCrosstab::errstr)\n";
    if ($xtab->get_query and $xtab->get_recs) { 
        print $xtab->as_html;
        my $xml_data = $xtab->as_xml;
    }

DESCRIPTION

DBIx::SQLCrosstab produces a SQL query to interrogate a database and generate a cross-tabulation report. The amount of parameters needed to achieve the result is kept to a minimum. You need to indicate which columns and rows to cross and from which table(s) they should be taken. Acting on your info, DBIx::SQLCrosstab creates an appropriate query to get the desired result. Compared to spreadsheet based cross-tabulations, DBIx::SQLCrosstab has two distinct advantages, i.e. it keeps the query in the database work space, fully exploiting the engine capabilities, and does not limit the data extraction to one table.

See http://gmax.oltrelinux.com/cgi-bin/xtab.cgi for an interactive example.

Cross tabulation basics

Cross tabulations are statistical reports where the values from one or more given columns are used as column headers, and GROUP functions are applied to retrieve totals that apply to such values.

     SELECT
        id, name, gender, dept
     FROM
        person
        INNER JOIN depts ON (depts.dept_id=perspn.dept_id)

     +----+--------+--------+-------+
     | id | name   | gender | dept  |
     +----+--------+--------+-------+
     |  1 | John   | m      | pers  |
     |  2 | Mario  | m      | pers  |
     |  7 | Mary   | f      | pers  |
     |  8 | Bill   | m      | pers  |
     |  3 | Frank  | m      | sales |
     |  5 | Susan  | f      | sales |
     |  6 | Martin | m      | sales |
     |  4 | Otto   | m      | dev   |
     |  9 | June   | f      | dev   |
     +----+--------+--------+-------+

A simple example will clarify the concept. Given the above raw data, a count of employees by dept and gender would look something like this:

     +-------+----+----+-------+
     | dept  | m  | f  | total |
     +-------+----+----+-------+
     | dev   |  1 |  1 |     2 |
     | pers  |  3 |  1 |     4 |
     | sales |  2 |  1 |     3 |
     +-------+----+----+-------+

The query to create this result is

    SELECT
        dept,
        COUNT(CASE WHEN gender = 'm' THEN id ELSE NULL END) as m,
        COUNT(CASE WHEN gender = 'f' THEN id ELSE NULL END) as f,
        COUNT(*) as total
    FROM
        person
        INNER JOIN depts ON (person.dept_id = depts.dept_id)
    GROUP BY
        dept

Although this query doesn't look easy, it is actually quite easy to create and the resulting data is straightforward. Creating the query requires advance knowledge of the values for the "gender" column, which can be as easy as m/f or as complex as male/female/unknown/undeclared/ former male/former female/pending (don't blame me. This is a "real" case!). Give the uncertainity, the method to get the column values id to issue a preparatory query

    SELECT DISTINCT gender FROM person

Then we can use the resulting values to build the final query

    my $query = "SELECT dept \n";
    $query .=
            ",COUNT(CASE WHEN gender = '$_' THEN id ELSE NULL END) AS $_ \n"
              for   @$columns;
    $query .= ",COUNT(*) as total \n"
              . "FROM person INNER JOIN depts \n"
              . "ON (person.dept_id=depts.dept_id) \n"
              . "GROUP BY dept\n";

If you have to do it once, you can just use the above idiom and you are done. But if you have several cases, and your cross-tab has more than one level, then you could put this module to good use. Notice that, to create this query, you needed also the knowledge of which column to test (gender) and to which column apply the GROUP function (id)

Multi-level cross tabulations

If single-level cross tables haven't worried you, multiple level tables should give you something to think. In addition to everything said before, multiple level crosstabs have:

    - query composition complexity. Each column is the combination
      of several conditions, one for each level;
    - column subtotals, to be inserted after the appropriate section;
    - row subtotals, to be inserted after the relevant rows;
    - explosive increase of column number. For a three-level crosstab
      where each level has three values you get 27 columns. If you
      include sub-totals, your number rises to 36. If you have just a few
      levels with five or six values , you may be counting rows by the 
      hundreds;    
    - visualization problems. While the result set from the DBMS
      is a simple matrix, the conceptual table has a visualization tree
      at the top (for columns) and a visualization tree at the left
      side (for rows).

  +----+----+--------------------+--------------------+--+
  | A  | B  |        C1          |       C2           |  |  1
  |    |    +--------------------+--------------------+  |  
  |    |    |   D1   |   D2   |  |   D1   |   D2   |  |  |  2
  |    |    +--------+--------|  +--------+--------+  |  |  
  |    |    |E1 E2 T |E1 E2 T |T |E1 E2 T |E1 E2 T |T |T |  3
  +----+----+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
  | A1 | B1 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  4
  |    |----+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
  |    | B2 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  5
  |    |----+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
  |    |  T |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  6
  +----+----+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
  | A2 | B1 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  7
  |    |----+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
  |    | B2 |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  8
  |    |----+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
  |    |  T |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  9
  +----+----+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
  | T  | -- |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  | 10
  +----+----+--+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
   a    b    c  d  e  f  g  h  i  g  k  l  m  n  o  p  q

  Some definitions

  columns headers               : 1-3
  column headers 1st level      : 1
  column headers 2nd level      : 2
  column headers 3rd level      : 3

  row headers                   : a, b
  row header 1st level          : a
  row header 2nd level          : b

  row sub totals                : 6, 9
  row total                     : 10
  column sub totals             : e, h, i, l, o, p
  column total                  : q

Column headers choice strategies

The easiest way of choosing columns is to tell DBIx::SQLCrosstab to get the values from a given column and let it extract them and combine the values from the various levels to make the appropriate conditions. Sometimes this is not desirable. For instance, if your column values come from the main dataset, the one that will be used for the crosstab, you are querying the database twice with two possibly expensive queries. Sometimes you can't help it, but there are a few workarounds.

If uou know the values in advance, you can pass them to the SQLCrosstab object, saving one query. This is when the values come from a constraint, for example. You may have a "grade" column and you know that the values can only be "A" to "F", or a "game result" column with values "1", "2", "x", "suspended", "camncelled". Or you can run the full query once, and when you are satisfied that the column values are the ones you know they should be, you pass the values for the subsequent calls.

The list option is also useful when you only want to make a crosstab for a given set of values, rather than having a mammoth result table with values you don't need.

Hidden JOINs

The normal case, in a well normalized database, should be to get the column values from a lookup table. If such table was created dynamically, so that it only contains values referred from the main table, then there is no problem. If, on the contrary, the lookup table was pre-loaded with all possible values for a given column, you may come out with a huge number of values, of which only a few were used. In this case, you need to run the query with a JOIN to the main table, to exclude the unused values.

When yoo use a lookup table, though, you can optimize the main query, by removing a JOIN that you have already used in the preparatory query. Let's see an example.

You have the table "person", which includes "dept_id", a foreign key referencing the table "depts". If you pass SQLCrosstab a column description including a {value} key, it will get from the lookup table both {id} and {value}, so that, instead of creating a main query with columns like

   ,COUNT(CASE WHEN dept = 'pers' THEN id ELSE NULL END) AS 'pers' 

It will create this:

   ,COUNT(CASE WHEN dept_id = '1' THEN id ELSE NULL END) AS 'pers'

    or (depending on the parameters you passed) this:

   ,COUNT(CASE WHEN dept_id = '1' THEN id ELSE NULL END) AS fld001 -- pers 

The difference is that in the first case your final query needs a JOIN to depts, while in the second case it won't need it. Therefore the final query, the expensive one, will be much faster. The reasoning is, once you went through a lookup table to get the distinct values, you should not use that table again in the main query.

Class methods

new

Create a new DBIx::SQLCrosstab object.

    my $xtab = DBIx::SQLCrosstab->new($params)
        or die "creation error $DBIx::SQLCrosstab::errstr\n"

$params is a hash reference containing at least the following parameters:

    dbh     
            either a valid database handler (DBI::db) or a hash reference
            with the appropriate parameters to create one

            dbh =>  {
                        dsn      => "dbi:driver:database",
                        user     => "username",
                        password => "secretpwd",
                        params   => {RaiseError => 1}
                    }

    op      
            the operation to perform (SUM, COUNT, AVG, MIN, MAX, STD,
            VAR, provided that your DBMS supports them)
            and the column to summarize. It must be an array reference,
            with each item a pair of operation/column.
            E.g.: 
            op => [ [ 'COUNT', 'id'], ['SUM', 'salary'] ],
        
            *** WARNING ***
            Use of this parameter as a scalar is still supported
            but it is DEPRECATED.
    

    op_col  
            The column on which to perform the operation
            *** DEPRECATED ***
            Use {op} as an array reference instead.

    from    
            Where to get the data. It could be as short as
            a table name or as long as a comlex FROM statement
            including INNER and OUTER JOINs. The syntax is not 
            checked. It must be accepted by the DBMS you are 
            using underneath.

    rows    
            a reference to an array of hashes, each
            defining one header for a crosstab row level.
            The only mandatory key is  
                {col}  identifying the column name
            Optionally, you can add an {alias} key, to be used
            with the AS keyword. 

    cols    
            a reference to an array of hashes, each defining
            one header for a cross tab column level.
            Two keys are mandatory
                {id}      the column name
                {from}    where to get it from.
                          If the {group} option is
                          used, the other columns can have
                          a value of "1" instead.

            Optionally, the following keys can be added

                {group}   If this option is set, then all the
                          columns are queried at once, with the
                          {from} statement of the first column
                          definition. 

                {alias}   an alias for the column name. Useful
                          for calculated fields.
                {value}   an additional column, related to {id}
                          whose values you want to use instead of
                          {id} as column headers. See below "The
                          hidden join" for more explanation.
                {col_list}
                         Is a referenece to an array of values
                         that will be used as column headers, instead
                         of querying the database. If you know the
                         values in advance, or if you want to use only
                         a few known ones, then you can specify them
                         in this list. Each element in col_list must be
                         a hash reference with at least an {id} key. A 
                         {value} key is optional.
                {exclude_value}
                         Is a reference to an array of values to exclude
                         from the headers. Unlike the general option
                         "col_exclude", this option will remove all the
                         combinations containing the given values.

                {where}   to limit the column to get
                {orderby} to order the columns in a different
                          way.


    The following parameters are optional.

    where   
            a WHERE clause that will be added to the resulting query
            to limit the amount of data to fetch.

    having  
            Same as WHERE, but applies to the grouped values

    add_op  
            either a scalar or an array reference containing one or
            more functions to be used in addition to the main 'op'.
            For example, if 'op' is 'COUNT', you may set add_op to
            ['SUM', 'AVG'] and the crosstab engine will produce 
            a table having the count, sum and average of the 
            value in 'op_col'.
            *** DEPRECATED *** Use {op} as an array reference instead.

    title   
            A title for the crosstab. Will be used for HTML and XML
            creation

    remove_if_null
    remove_if_zero
            Remove from the record set all the columns where all
            values are respectively NULL or zero. Notice that there 
            is a difference between a column with all zeroes and a 
            column with all NULLs.
            All zeroes with a SUM means that all the values were 0,
            while all NULLs means that no records matching the given 
            criteria were met.
            However, it also depends on the DBMS. According to ANSI
            specifications, SUM/AVG(NULL) should return NULL, while
            COUNT(NULL) should return 0. Rather than assuming
            strict compliance, I leave you the choice.

    col_exclude
            Is a reference to an array of columns to be excluded from
            the query. The values must be complete column names.
            To know the column names, you can use the "add_real_names"
            option and then the get_query method.        

    add_real_names
            Add the real column names as comments to the query text.
            In order to avoid conflicts with the database, the default
            behavior is to create fake column names (fld001, fld002, etc)
            that will  be later replaced. 
            This feature may cause problems with the database engines
            that don't react well to embedded comments.

    use_real_names
            use the real column values as column names. This may be a
            problem if the column value contains characters that are not
            allowed in column names. Even though the names are properly
            quoted, it id not 100% safe.

    row_total
    row_sub_total
            If activated, adds a total row at the end of the result set
            or the total rows at the end of each row level. Your DBMS
            must support the SQL UNION keyword for this option to work.

            ********
            CAVEAT!
            ********

            Be aware that these two options will double the server load
            for each row level beyond 1, plus one additional query for
            the grand total.
            The meaning of this warning is that the query generated
            by DBIx::SQLCrosstab will contain one UNION query with a
            different GROUP BY clause for each row level. The grand
            total is a UNION query without GROUP BY clause. If your
            dataset is several million records large, you may consider
            skipping these options and perform subtotals and grand 
            total in the client.
            For less than one million records, any decent database
            engine should be able to execute the query in an acceptable
            timeframe.

    col_total
    col_sub_total
            If activated, add a total column at the end of the result set
            or the total columns at the end of each column level. 

    RaiseError
            If actviated, makes all errors fatal. Normally, errors are 
            trapped and recorded in $DBIx::SQLCrosstab. RaiseError will 
            raise an exception instead.

    PrintError
            If activated, will issue a warning with the message that 
            caused the exception, but won't die.

    ************************************************
    The following options only apply when creating a 
    DBIx::SQLCrosstab::Format object.
    ************************************************

    commify
            Used for HTML and XML output. If true, will insert commas
            as thousand separators in all recordset numbers.

    complete_html_page
            Returns HTML header and end tags, so that the resulting
            text is a complete HTML page.

    only_html_header
            Returns only the header part of the table, without records.
            Useful to create templates.

    add_colors
            If true, default colors are applied to the resulting
            table.
            text    => "#009900", # green
            number  => "#FF0000", # red
            header  => "#0000FF", # blue
            footer  => "#000099", # darkblue

    text_color
    number_color
    header_color
    footer_color
            Change the default colors to custom ones  

     table_border       
     table_cellspacing 
     table_cellpadding 
            Change the settings for HTML table borders. Defaults are:
            border      => 1
            cellspacing => 0
            cellpadding => 2
set_param

Allows to set one or more parameters that you couldn't pass with the constructor.

       $xtab->set_param( cols => [ { id => 'dept', from => 'departments' } ]  )
            or die "error setting parameter: DBIx::SQLCrosstab::errstr\n";

       $xtab->set_param( 
                            remove_if_null => 1,
                            remove_if_zero => 1,
                            title          => 'Some nice number crunching'
                        )
            or die "error setting parameter: DBIx::SQLCrosstab::errstr\n";

You can use this method together with a dummy constructor call:

        my $xtab = DBIx::SQLCrosstab->new ('STUB')
            or die "can't create ($DBIx::SQLCrosstab::errstr)\n";

        $xtab->set_param( 
                          dbh    => $dbh,
                          op     => 'SUM',
                          op_col => 'amount',
                          cols   => $mycolumns,
                          rows   => $myrows,
                          from   => 'mytable'
                          )
            or die "error setting parameter: DBIx::SQLCrosstab::errstr\n";
get_params

Returns a string containing te parameters to replicate the current DBIx::SQLCrosstab object. The data is represented as Perl code, and it can be evaluated as such. The variable's name is 'params'. It does not include the 'dbh' parameter.

    my $params = $xtab->get_params
        or warn "can't get params ($DBIx::SQLCrosstab::errstr)";
save_params

Saves the parameters necessary to rebuild the current object to a given file. This function stores what is returned by get_params into a text file. Notice that the 'dbh' option is not saved.

    unless ($xtab->save_params('myparams.pl')
        die "can't save current params ($DBIx::SQLCrosstab::errstr)";
load_params

Loads previously saved parameters into the current object. Remember that 'dbh' is not restored, and must be set separately with set_param().

    my $xtab = DBIx::SQLCrosstab->new('stub')
        or die "$DBIx::SQLCrosstab::errstr";
    $xtab->load_params('myparams.pl')
        or die "$DBIx::SQLCrosstab::errstr";
    $xtab->set_param( dbh => $dbh )
        or die "$DBIx::SQLCrosstab::errstr";
get_query

Returns the query to get the final cross-tabulation, or undef in case of errors. Check $DBIx::SQLCrosstab::errstr for the reason. You may optionally pass a parameter for the character to be used as separator between column names. The default is a pound sign ('#'). If the separator character is present in any of the column values (i.e. the values from a candidate column header), the engine will try in sequence '#', '/', '-', '=', doubling them if necessary, and eventually giving up only if all these characters are present in any column values. If this happens, then you need to pass an appropriate character, or group of charecters that you are reasonably sure doesn't recur in column values.

get_recs

Executes the query and returns the recordset as an arrayref of array references, or undef on failure. After this method is called, several attributes become available: - recs the whole recordset - NAME an arrayref with the list of column names - LENGTH an arrayref with the maximum size of each column - NUM_OF_FIELDS an integer with the number of felds in the result set

Class attributes

There are attributes that are available for external consumption. Like the DBI, these attributes become available after a given event.

{NAME}

This attribute returns the raw column names for the recordset as an array reference. Even if {use_real_names} was not defined, this attribute returns the real names rather than fld001, fld002, and so on. It is available after get_recs() was called.

    my $fnames = $xtab->{NAMES};
    print "$_\n" for @{$fnames};
{recs}

This attribute contains the raw recordset as returned from the database. Available after get_recs().

{NUM_OF_FIELDS}

The number of fields in the recordset. Available after get_recs().

{LENGTH}

Contains an array reference to the maximum lengths of each column. The length is calculated taking into account the length of the column name and the length of all values in that column. Available after get_recs().

DEPENDENCIES

    DBI
    a DBD driver
    Tree::DAG_Node

EXPORT

None by default.

AUTHOR

Giuseppe Maxia (<gmax_at_cpan.org>)

SEE ALSO

DBI

An article at OnLamp, "Generating Database Server-Side Cross Tabulations" (http://www.onlamp.com/pub/a/onlamp/2003/12/04/crosstabs.html) and one at PerlMonks, "SQL Crosstab, a hell of a DBI idiom" (http://www.perlmonks.org/index.pl?node_id=313934).

COPYRIGHT

Copyright 2003 by Giuseppe Maxia (<gmax_at_cpan.org>)

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.