NAME
DataCube::Schema - An Object Oriented Perl Module for creating Snowflake Schemas.
SYNOPSIS
use strict;
use warnings;
use DataCube::Schema;
# the new constructor
my $schema = DataCube::Schema->new;
# basic: adding dimensions, hierarchies and measures
$schema->add_dimension('country');
$schema->add_dimension('product');
$schema->add_dimension('salesperson');
$schema->add_hierarchy('year','quarter','month','day');
$schema->add_measure('sum','units_sold');
$schema->add_measure('sum','dollar_volume');
$schema->add_measure('average','price_per_unit');
# advanced: adding strict dimensions / hierarchies
$schema->add_strict_dimension('country');
$schema->add_strict_hierarchy('year','quarter','month','day');
# advanced: suppressing lattice points
$schema->suppress_lattice_point('country','salesperson');
DESCRIPTION
This module provides a pure perl, object oriented, embeddable Star and Snowflake Schema engine. It is self contained and ready to use in data mining and data warehousing applications.
All schemas created by this module are Snowflake Schemas.
BACKGROUND
Star and Snowflake Schemas are used to organize dimensions and measurements in Data Warehouses. Snowflake Schemas encompass all the functionality of Star Schemas, and provide direct support for hierarchies.
STAR SCHEMAS
In a Star Schema, many peripheral tables of data are joined to one central table called the "Fact Table". Each peripheral table represents a single dimension.
SNOWFLAKE SCHEMAS
The Snowflake Schema is an extension of the Star Schema, in which each peripheral "Dimension Table" holding hierarchical data is replaced by a group of tables representing that hierarchy.
To illustrate the difference, consider a single table in a Star Schema which contains a date field of the form "year/month/day". In a Snowflake Schema, this table would be replaced by 3 tables: one containg the year, one containing the month, and one containing the day, all linked together by a primary key / foreign key relationship.
FACT TABLES
A special table called the "Fact Table" resides in the middle of both Star and Snowflake Schemas. Fact Tables contain a single row of data for each factual event logged during the course of business.
The Fact Table contains redundant and repetitious data (usually in Second Normal Form) and is therefore subject to update anamolies. For this reason, Snowflake Schemas should rarely be used for high performance Relational Databases. They should be used, however, to design "Dimensional Databases" and Data Warehouses, where such redundancy allows for extreme performance gains for complex sql queries, especially those containing aggregation functions on hierarchical relationships.
BASIC OPERATIONS
This module provides several methods to design Snowflake Schemas.
add_dimension
This method adds a single dimension to a schema.
add_measure
This method adds a single measures to the cubes measure table.
Supported measures inlcude:
count
min [field name]
max [field name]
sum [field name]
count [field name]
average [field name]
product [field name]
multi_count [field name]
Here is a description of each measure:
count
init_value 0
update_rule ++
report_format integer
additivity additive
declaration $schema->add_measure('count')
description the number of times a dimensional tuple has been inserted into the cube
min
init_value undef
update_rule = if < or undefined
report_format decimal
additivity additive
declaration $schema->add_measure('min','field')
description the minimal value of inserted numbers from 'field'
max
init_value undef
update_rule = if > or undefined
report_format decimal
additivity additive
declaration $schema->add_measure('max','field')
description the maximal value of inserted numbers from 'field'
sum
init_value 0
update_rule +=
report_format decimal
additivity additive
declaration $schema->add_measure('sum','field')
description the sum of inserted numbers from 'field'
product
init_value 1
update_rule *=
report_format decimal
additivity additive (ie separable)
declaration $schema->add_measure('product','field')
description the multiplication of inserted numbers from 'field'
average
init_value 0
update_rule {average}->{$field}->{sum_total} += $field_value;
{average}->{$field}->{observations}++
report_format decimal (sum_total / observations)
additivity non-additive
declaration $schema->add_measure('average','field')
description the average of inserted values from 'field'
count (distinct)
init_value {} (empty hashref)
update_rule {count}->{$field}->{$field_value} = undef
report_format integer (ie scalar(keys(%{{count}->{$field}->{$field_value}})))
additivity non-additive
declaration $schema->add_measure('count','field')
description the count distinct of inserted values from 'field'
multi_count (distinct with multiplicity)
init_value {} (empty hashref)
update_rule {count}->{$field}->{$field_value}++
report_format integer (same as count)
additivity non-additive
declaration $schema->add_measure('multi_count','field')
description the count distinct of inserted values from 'field', also stores the *number of times* that $field_value was 'uniquefied'
add_hierarchy
This method adds a single hierarchy to a schema. Hierarchies are like Dimensions, except that aggregate measures will be computed on complete Parent - Child chains.
For example, consider the hierarchy "year", "month", "day" and the measure "sum" of "dollars".
The following code:
$schema->add_measure('sum','dollars');
$schema->add_hierarchy('year','month','day');
$cube = DataCube->new($schema);
# a bunch of data is fed to the cube
#
# [...]
#
# and some time later:
$cube->rollup;
$cube->report;
will create the following reports:
1. sum_of_dollars
2. sum_of_dollars by year
3. sum_of_dollars by year, month
4. sum_of_dollars by year, month, day
as it probably should.
ADVANCED OPERATIONS
add_strict_dimension
This method adds a single dimension to a schema, over which no aggregation will be performed.
For example, consider the following code:
$schema->add_dimension('product');
$schema->add_strict_dimension('country');
$schema->add_measure('sum','dollars');
$cube = DataCube->new($schema);
# a bunch of data is fed to the cube
#
# [...]
#
# and some time later:
$cube->rollup;
$cube->report;
will create the following reports:
1. sum_of_dollars by country
2. sum_of_dollars by country, product
Notice that the datacube did not produce the sum_of_dollars irrespective of country.
add_strict_hierarchy
This method adds a single hierarchy to a schema. No aggregation will be performed over the top-most dimension.
suppress_lattice_point
This method suppresses specific rollups / reports from being created during a call to rollup, which may lead to a saving of both time and space.
For example, consider the following code:
$schema->add_measure('sum','dollars');
$schema->add_hierarchy('year','month','day');
$schema->suppress_lattice_point('year','month');
$cube = DataCube->new($schema);
# a bunch of data is fed to the cube
#
# [...]
#
# and some time later:
$cube->rollup;
$cube->report;
will create the following reports:
1. sum_of_dollars
2. sum_of_dollars by year
3. sum_of_dollars by year, month, day
assert_lattice_point
This method restricts a datacube to only the specified list of dimensions during rollup.
This method superscedes all others except for add_strict_dimension and add_strict_hierarchy, and may lead to a saving of both time and space.
For example, consider the following code:
$schema->add_measure('sum','dollars');
$schema->add_hierarchy('year','month','day');
$schema->assert_lattice_point('overall');
$schema->assert_lattice_point('year','month');
$cube = DataCube->new($schema);
# a bunch of data is fed to the cube
#
# [...]
#
# and some time later:
$cube->rollup;
$cube->report;
will create the following reports:
1. sum_of_dollars
2. sum_of_dollars by year, month
If you do this:
$schema->add_measure('sum','dollars');
$schema->add_strict_hierarchy('year','month','day');
$schema->assert_lattice_point('overall');
$schema->assert_lattice_point('year','month');
you will not get the report
1. sum_of_dollars
because the method call
$schema->add_strict_hierarchy('year','month','day');
confines 'year' to always be present.
When in doubt, do not use 'assert_lattice_point' in the presence of the other lattice assertions (such as 'strict' and 'suppress').
confine_to
This method restricts a datacube to only the specified list of dimensions and superscedes all other methods.
The base table becomes fixed to the confined point and no rollup occurs even if called.
For example, consider the following code:
my $schema = DataCube::Schema->new;
$schema->add_dimension('country');
$schema->add_dimension('product');
$schema->add_dimension('salesperson');
$schema->add_hierarchy('year','quarter','month','day');
$schema->add_measure('sum','units_sold');
$schema->add_measure('sum','dollar_volume');
$schema->add_measure('average','price_per_unit');
$schema->confine_to('country','product','year');
my $cube = DataCube->new($schema);
will create a cube with only one table (the base table: 'country','product','year') and only one report:
1. sum_of_dollars etc. by country, product, year
EXPORT
None
SEE ALSO
Wikipedia on Snowflake Schema:
http://en.wikipedia.org/wiki/Snowflake_schema
AUTHOR
David Williams, <david@namimedia.com>
COPYRIGHT AND LICENSE
Copyright (C) 2009 by David Williams
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.8 or, at your option, any later version of Perl 5 you may have available.