The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Graphics::Skullplot::ClassifyColumns - simple type inference of columns of tabular data

VERSION

Version 0.01

SYNOPSIS

  use Graphics::Skullplot::ClassifyColumns;

  my $cc = Graphics::Skullplot::ClassifyColumns->new( data => $data );  
  my $plot_cols = 
    $cc->classify_columns_simple( { indie_count => $indie_count, } );

DESCRIPTION

Graphics::Skullplot::ClassifyColumns is a stripped down version of an old experimental module I was developing I called Data::Classify. I expect to go back to that project and develop a more elaborate system of plug-ins to target different kinds of databases and so on, most likely named Table::TypeInference.

This particular module just needs a "classify_columns_simple" routine that works well enough to figure out how to plot some data via ggplot2 in R (i.e. the "Graphics::Skullplot" project).

new

Creates a new Graphics::Skullplot::ClassifyColumns object.

Takes a hashref as an argument, with named fields identical to the names of the object attributes. These attributes are:

data

A required field, columns of data as an array of array references, with a header in the first row.

classify_columns_simple

Note: here "simple" might be thought of as "stub": This does the simplest possible categorization using only a single numeric hint for the number of independent fields.

The presumption here is the incoming data is organized like the output of a typical sql group by select, x-axis in the first column a number of columns of dependent data as the end, and (possibly) a certain number of categorical variables (ones with a small number of allowed values) in-between.

This returns a hash indicating how different columns should be handled in the plotting stage, the keys are:

  x    (rename: indie_x )
  y             but just for when there's only one dependent 
  gb_cats
  dep_fields  (rename: dependents_y }

Example usage:

  my $cc = Graphics::Skullplot::ClassifyColumns->new( data => $data );  
  my $opt = { indie_count => 1, };
  my $plot_cols_href = 
    $cc->classify_columns_simple( $opt ); 
column_types

Given a reference to tabular data in an array-of-arrays format- with a header expected in the first row- tries to infer the rough data type of each column.

Returns a list (or aref) of the type codes, in sequence.

classify

A wrapper around Scalar::Classify's "classify", which also subdivides the string category, looking for datetime types.

The type is most often (but not limited to) one of the following:

   ARRAY
   HASH
   :NUMBER:
   :STRING:

This code examines any string values to see if a date/time code is more appropriate:

   :DATE: 
   :DATETIME: 
   :TIME:
most_common

Given a hash of numeric counts, returns the key of the maximum count.

In the case of a tie, the return will be one of the tie values, which one is undefined.

define_regxeps

Generates a hashref of locally useful regexps.

These are mostly intended to identify dates and times. TODO just look up existing solutions, e.g. Regexp::Common.

AUTHOR

Joseph Brenner, <doom@kzsu.stanford.edu>, 22 May 2018

COPYRIGHT AND LICENSE

Copyright (C) 2018 by Joseph Brenner

This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.

No warranty is provided with this code.

See http://dev.perl.org/licenses/ for more information.