The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Parse::Taxonomy - Validate hierarchical data stored in CSV format

VERSION

This document refers to version 0.16 of Parse::Taxonomy. This version was released October 18 2015.

SYNOPSIS

    use Parse::Taxonomy;

DESCRIPTION

This module is the base class for the Parse-Taxonomy extension to the Perl 5 programming language. You will not instantiate objects of this class; rather, you will instantiate objects of subclasses, of which Parse::Taxonomy::MaterializedPath and Parse::Taxonomy::AdjacentList are the first.

This is an BETA release. The documented interfaces are expected to remain stable but are not guaranteed to remain so.

Taxonomy: definition

For the purpose of this library, a taxonomy is defined as a tree-like data structure with a root node, zero or more branch (child) nodes, and one or more leaf nodes. The root node and each branch node must have at least one child node, but leaf nodes have no child nodes. The number of branches between a leaf node and the root node is variable.

Diagram 1:

                               Root
                                |
                  ----------------------------------------------------
                  |                            |            |        |
               Branch                       Branch       Branch     Leaf
                  |                            |            |
       -------------------------         ------------       |
       |                       |         |          |       |
    Branch                  Branch     Leaf       Leaf   Branch
       |                       |                            |
       |                 ------------                       |
       |                 |          |                       |
     Leaf              Leaf       Leaf                    Leaf

Taxonomy File: definition

For the purpose of this module, a taxonomy file is a CSV file in which (a) certain columns hold data from which the position of each record within the taxonomy can be derived; and (b) each node in the tree (with the possible exception of the root node) is uniquely represented by a record within the file.

CSV

"CSV", strictly speaking, refers to comma-separated values:

    path,nationality,gender,age,income,id_no

For the purpose of this module, however, the column separators in a taxonomy file may be any user-specified character handled by the Text-CSV_XS library on CPAN. Formats frequently observed are tab-separated values:

    path        nationality     gender  age     income  id_no

and pipe-separated values:

    path|nationality|gender|age|income|id_no

The documentation for Text-CSV comments that the CSV format could "... perhaps better [be] called ASV (anything separated values)", but we shall for convenience use "CSV" herein regardless of the specific delimiter.

Since it is often the case that the characters used as column separators may occur within the data recorded in the columns as well, it is customary to quote either all columns:

    "path","nationality","gender","age","income","id_no"

... or, at the very least, all columns which can hold data other than pure integers or floating-point numbers:

    "path","nationality","gender",age,income,id_no

Tree structure

To qualify as a taxonomy file, it is not sufficient for a file to be in CSV format. In each non-header record in that file, there must be one or more columns which hold data capable of exactly specifying the record's position in the taxonomy, i.e., the route from the root node to the node being represented by that record.

The precise way in which certain columns are used to determine the path from the root node to a given node is what differentiates various types of taxonomy files from one another. In Parse-Taxonomy we identify two different flavors of taxonomy files and provide a class for the construction of each.

Taxonomy-by-materialized-path

A taxonomy-by-materialized-path is one in which a single column -- which we will refer to as the path column -- serves as a materialized path. A materialized path represents the route from the root to the given record as a series of strings joined by separator characters. Within that path column the value corresponding to the root node need not be specified, i.e., may be represented by an empty string.

Let's rewrite Diagram 1 with values to make this clear.

Diagram 2:

                               ""
                                |
                  ----------------------------------------------------
                  |                            |            |        |
                Alpha                        Beta         Gamma    Delta
                  |                            |            |
       -------------------------         ------------       |
       |                       |         |          |       |
    Epsilon                  Zeta       Eta       Theta   Iota
       |                       |                            |
       |                 ------------                       |
       |                 |          |                       |
     Kappa            Lambda        Mu                      Nu

Let us suppose that our taxonomy file held comma-separated, quoted records. Let us further supposed that the column holding taxonomy paths was, not surprisingly, called path and that the separator within the path column was a pipe (|) character. Let us further suppose that for now we are not concerned with the data in any columns other than path so that, for purpose of illustration, they will hold empty (albeit quoted) strings.

Then the taxonomy file describing the tree in Diagram 2 would look like this:

    "path","nationality","gender","age","income","id_no"
    "|Alpha","","","","",""
    "|Alpha|Epsilon","","","","",""
    "|Alpha|Epsilon|Kappa","","","","",""
    "|Alpha|Zeta","","","","",""
    "|Alpha|Zeta|Lambda","","","","",""
    "|Alpha|Zeta|Mu","","","","",""
    "|Beta","","","","",""
    "|Beta|Eta","","","","",""
    "|Beta|Theta","","","","",""
    "|Gamma","","","","",""
    "|Gamma|Iota","","","","",""
    "|Gamma|Iota|Nu","","","","",""
    "|Delta","","","","",""

Note that while in the |Gamma branch we ultimately have only one leaf node, |Gamma|Iota|Nu, we require separate records in the taxonomy file for |Gamma and |Gamma|Iota. To put this another way, the existence of a Gamma|Iota|Nu leaf must not be assumed to "auto-vivify" |Gamma and |Gamma|Iota nodes. Each non-root node must be explicitly represented in the taxonomy file for the file to be considered valid.

Note further that there is no restriction on the values of the components of the path across records. It only the full path that must be unique. Let us illustrate that by modifying the data in Diagram 2:

Diagram 3:

                               ""
                                |
                  ----------------------------------------------------
                  |                            |            |        |
                Alpha                        Beta         Gamma    Delta
                  |                            |            |
       -------------------------         ------------       |
       |                       |         |          |       |
    Epsilon                  Zeta       Eta       Theta   Iota
       |                       |                            |
       |                 ------------                       |
       |                 |          |                       |
     Kappa            Lambda        Mu                    Delta

Here we have two leaf nodes each named Delta. However, we follow different paths from the root node to get to each of them. The taxonomy file representing this tree would look like this:

    "path","nationality","gender","age","income","id_no"
    "|Alpha","","","","",""
    "|Alpha|Epsilon","","","","",""
    "|Alpha|Epsilon|Kappa","","","","",""
    "|Alpha|Zeta","","","","",""
    "|Alpha|Zeta|Lambda","","","","",""
    "|Alpha|Zeta|Mu","","","","",""
    "|Beta","","","","",""
    "|Beta|Eta","","","","",""
    "|Beta|Theta","","","","",""
    "|Gamma","","","","",""
    "|Gamma|Iota","","","","",""
    "|Gamma|Iota|Delta","","","","",""
    "|Delta","","","","",""

Taxonomy-by-adjacent-list

A taxonomy-by-adjacent-list is one in which each record has a column with a unique identifier (id) and another column holding the unique identifier of the record representing the next higher node in the hierarchy (parent_id). The record must also have a column which holds a datum that is unique among all records having the same parent node.

Let's make this clearer by rewriting the taxonomy-by-materialized-path above for Example 3 as a taxonomy-by-adjacent-list.

    "id","parent_id","name","nationality","gender","age","income","id_no"
    1,,"Alpha","","","","",""
    2,1,"Epsilon","","","","",""
    3,2,"Kappa","","","","",""
    4,1,"Zeta","","","","",""
    5,4,"Lambda","","","","",""
    6,4,"Mu","","","","",""
    7,,"Beta","","","","",""
    8,7,"Eta","","","","",""
    9,7,"Theta","","","","",""
    10,,"Gamma","","","","",""
    11,10,"Iota","","","","",""
    12,11,"Delta","","","","",""
    13,,"Delta","","","","",""

In the above taxonomy-by-adjacent-list, the records with ids 1, 7, 10, and 13 are top-level nodes. They have no parents, so the value of their parent_id column is null or, in Perl terms, an empty string. The records with ids 2 and 4 are children of the record with id of 1. The record with id 3 is, in turn, a child of the record with id 2.

In the above taxonomy-by-adjacent-list, close inspection will show that no two records with the same parent_id share the same name. The property of uniqueness of sibling names means that we can construct a non-indexed version of the path from the root to a given node by using the parent_id column in a given record to look up the name of the record with the id value identical to the child's parent_id.

    Via index: 3        2       1

    Via name:  Kappa    Epsilon Alpha

We go from id 3 to its parent_id, <2>, then to 2's parent_id, <1>. Putting names to this, we go from Kappa to Epsilon to Alpha.

Now, reverse the order of those names, throw a pipe delimiter before each of them and join them into a single string, and you get:

    |Alpha|Epsilon|Kappa

... which is the value of the path column in the third record in the taxonomy-by-materialized-path displayed previously.

With correct data, a given hierarchy of data can therefore be represented either by a taxonomy-by-materialized-path or by a taxonomy-by-adjacent-list. This permits us to describe these two taxonomies as equivalent to each other.

Taxonomy Validation

Each Parse::Taxonomy subclass will have a constructor, new(), whose principal interface will take the name of a taxonomy file as an argument. We will call this interface the file interface to the constructor. The purpose of the constructor will be to determine whether the taxonomy file holds a valid taxonomy according to the description provided above. The arguments needed for such a constructor will be found in the documentation of the subclass.

The constructor of a Parse::Taxonomy subclass may, if desired, accept a different set of arguments. Suppose you have already read a CSV file and parsed it into one array reference holding its header row -- a list of its columns -- and a second array reference, this one being an array of arrays where each element holds the data in one record in the CSV file. You have the same components needed to validate the taxonomy that you would get by parsing the CSV file, so your subclass may implement a components interface as well as a file interface.

You should now proceed to read the documentation for Parse::Taxonomy::MaterializedPath and Parse::Taxonomy::AdjacentList.

BUGS

There are no bug reports outstanding on Parse::Taxonomy as of the most recent CPAN upload date of this distribution.

SUPPORT

Please report any bugs by mail to bug-Parse-Taxonomy@rt.cpan.org or through the web interface at http://rt.cpan.org.

AUTHOR

James E. Keenan (jkeenan@cpan.org). When sending correspondence, please include 'Parse::Taxonomy' or 'Parse-Taxonomy' in your subject line.

Creation date: May 24 2015. Last modification date: October 18 2015.

Development repository: https://github.com/jkeenan/parse-taxonomy

REFERENCES

DBIx-Class-MaterializedPath by Arthur Axel "fREW" Schmidt

DBIx-Tree-MaterializedPath by Larry Leszczynski

Trees in SQL: Nested Sets and Materialized Path by Vadim Tropashko

DBIx-Tree, now maintained by Ron Savage.

COPYRIGHT

Copyright (c) 2002-15 James E. Keenan. United States. All rights reserved. This is free software and may be distributed under the same terms as Perl itself.

DISCLAIMER OF WARRANTY

BECAUSE THIS SOFTWARE IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE SOFTWARE, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE SOFTWARE ''AS IS'' WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE SOFTWARE IS WITH YOU. SHOULD THE SOFTWARE PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR, OR CORRECTION.

IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE SOFTWARE AS PERMITTED BY THE ABOVE LICENCE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE SOFTWARE (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE SOFTWARE TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.