Parse::Taxonomy::AdjacentList - Extract a taxonomy from a hierarchy inside a CSV file
use Parse::Taxonomy::AdjacentList; $source = "./t/data/alpha.csv"; $obj = Parse::Taxonomy::AdjacentList->new( { file => $source, } );
new()
Purpose
Parse::Taxonomy::AdjacentList constructor.
Arguments
Single hash reference. There are two possible interfaces: file and components.
file
components
$source = "./t/data/delta.csv"; $obj = Parse::Taxonomy::AdjacentList->new( { file => $source, } );
Elements in the hash reference are keyed on:
Absolute or relative path to the incoming taxonomy file. Required for this interface.
id_col
The name of the column in the header row under which each data record's unique ID can be found. Defaults to id.
id
parent_id_col
The name of the column in the header row under which each data record's parent ID can be found. (Will be empty in the case of top-level nodes, as they have no parent.) Defaults to parent_id.
parent_id
leaf_col
The name of the column in the header row under which, in each data record, there is a found a string which differentiates that record from all other records with the same parent ID. Defaults to name.
name
Text::CSV_XS options
Any other options which could normally be passed to Text::CSV_XS->new() will be passed through to that module's constructor. On the recommendation of the Text::CSV documentation, binary is always set to a true value.
Text::CSV_XS->new()
binary
$obj = Parse::Taxonomy::AdjacentList->new( { components => { fields => $fields, data_records => $data_records, } } );
Elements in this hash are keyed on:
This element is required for the components interface. The value of this element is a hash reference with two keys, fields and data_records. fields is a reference to an array holding the field or column names for the data set. data_records is a reference to an array of array references, each of the latter arrayrefs holding one record or row from the data set.
fields
data_records
Return Value
Parse::Taxonomy::AdjacentList object.
Exceptions
new() will throw an exception under any of the following conditions:
Argument to new() is not a reference.
Argument to new() is not a hash reference.
Argument to new() must have either 'file' or 'components' element but not both.
Lack columns in header row to match requirements.
Non-numeric entry in id or parent_id column.
Duplicate entries in id column.
Number of fields in a data record does not match number in header row.
Empty string in a component column of a record.
component
Unable to locate a record whose id is the parent_id of a different record.
No records with same parent_id may share value of component column.
file interface
In the file interface, unable to locate the file which is the value of the file element.
The same field is found more than once in the header row of the incoming taxonomy file.
Unable to open or close the incoming taxonomy file for reading.
components interface
In the components interface, components element must be a hash reference with fields and data_records elements.
fields element must be array reference.
data_records element must be reference to array of array references.
No duplicate fields in fields element's array reference.
fields()
Identify the names of the columns in the taxonomy.
my $fields = $self->fields();
No arguments; the information is already inside the object.
Reference to an array holding a list of the columns as they appear in the header row of the incoming taxonomy file.
Comment
Read-only.
data_records()
Once the taxonomy has been validated, get a list of its data rows as a Perl data structure.
$data_records = $self->data_records;
None.
Reference to array of array references. The array will hold the data records found in the incoming taxonomy file in their order in that file.
Does not contain any information about the fields in the taxonomy, so you should probably either (a) use in conjunction with fields() method above; or (b) use fields_and_data_records().
fields_and_data_records()
get_field_position()
Identify the index position of a given field within the header row.
$index = $obj->get_field_position('income');
Takes a single string holding the name of one of the fields (column names).
Integer representing the index position (counting from 0) of the field provided as argument. Throws exception if the argument is not actually a field.
0
The following methods provide information about key columns in a Parse::Taxonomy::MaterializedPath object. The key columns are those which hold the ID, parent ID and component information. They take no arguments. The methods whose names end in _idx return integers, as they return the index position of the column in the header row. The other methods return strings.
_idx
$index_of_id_column = $self->id_col_idx; $name_of_id_column = $self->id_col; $index_of_parent_id_column = $self->parent_id_col_idx; $name_of_parent_id_column = $self->parent_id_col; $index_of_leaf_column = $self->leaf_col_idx; $name_of_leaf_column = $self->leaf_col;
pathify()
Generate a new Perl data structure which holds the same information as a Parse::Taxonomy::AdjacentList object but which expresses the route from the root node to a given branch or leaf node as either a separator-delimited string (as in the path column of a Parse::Taxonomy::MaterializedPath object) or as an array reference holding the list of names which delineate that route.
path
Another way of expressing this: Transform a taxonomy-by-adjacent-list to a taxonomy-by-materialized-path.
Example: Suppose we have a CSV file which serves as a taxonomy-by-adjacent-list for this data:
"id","parent_id","name","is_actionable" "1","","Alpha","0" "2","","Beta","0" "3","1","Epsilon","0" "4","3","Kappa","1" "5","1","Zeta","0" "6","5","Lambda","1" "7","5","Mu","0" "8","2","Eta","1" "9","2","Theta","1"
Instead of having the route from the root node to a given node be represented implicitly by following parent_ids up the tree, suppose we want that route to be represented by a string. Assuming that we work with default column names, that would mean representing the information currently spread out among the id, parent_id and name columns in a single path column which, by default, would hold an array reference.
$source = "./t/data/theta.csv"; $obj = Parse::Taxonomy::AdjacentList->new( { file => $source, } ); $taxonomy_with_path_as_array = $obj->pathify;
Yielding:
[ ["path", "is_actionable"], [["", "Alpha"], 0], [["", "Beta"], 0], [["", "Alpha", "Epsilon"], 0], [["", "Alpha", "Epsilon", "Kappa"], 1], [["", "Alpha", "Zeta"], 0], [["", "Alpha", "Zeta", "Lambda"], 1], [["", "Alpha", "Zeta", "Mu"], 0], [["", "Beta", "Eta"], 1], [["", "Beta", "Theta"], 1], ]
If we wanted the path information represented as a string rather than an array reference, we would say:
$taxonomy_with_path_as_string = $obj->pathify( { as_string => 1 } );
[ ["path", "is_actionable"], ["|Alpha", 0], ["|Beta", 0], ["|Alpha|Epsilon", 0], ["|Alpha|Epsilon|Kappa", 1], ["|Alpha|Zeta", 0], ["|Alpha|Zeta|Lambda", 1], ["|Alpha|Zeta|Mu", 0], ["|Beta|Eta", 1], ["|Beta|Theta", 1], ]
If we are providing a true value to the as_string key, we also get to choose what character to use as the separator in the path column.
as_string
$taxonomy_with_path_as_string_different_path_col_sep = $obj->pathify( { as_string => 1, path_col_sep => '~~', } );
Yields:
[ ["path", "is_actionable"], ["~~Alpha", 0], ["~~Beta", 0], ["~~Alpha~~Epsilon", 0], ["~~Alpha~~Epsilon~~Kappa", 1], ["~~Alpha~~Zeta", 0], ["~~Alpha~~Zeta~~Lambda", 1], ["~~Alpha~~Zeta~~Mu", 0], ["~~Beta~~Eta", 1], ["~~Beta~~Theta", 1], ]
Finally, should we want the path column in the returned arrayref to be named something other than path, we can provide a value to the path_col key.
path_col
[ ["foo", "is_actionable"], [["", "Alpha"], 0], [["", "Beta"], 0], [["", "Alpha", "Epsilon"], 0], [["", "Alpha", "Epsilon", "Kappa"], 1], [["", "Alpha", "Zeta"], 0], [["", "Alpha", "Zeta", "Lambda"], 1], [["", "Alpha", "Zeta", "Mu"], 0], [["", "Beta", "Eta"], 1], [["", "Beta", "Theta"], 1], ]
item * Arguments
Optional single hash reference. If provided, the following keys may be used:
User-supplied name for column holding path information in the returned array reference. Defaults to path.
Boolean. If supplied with a true value, path information will be represented as a separator-delimited string rather than an array reference.
path_col_sep
User-supplied string to be used to separate the parts of the route when as_string is called with a true value. Not meaningful unless as_string is true.
Reference to an array of array references. The first element in the array will be a reference to an array of field names. Each succeeding element will be a reference to an array holding data for one record in the original taxonomy. The path data will be represented, by default, as an array reference built up from the component (name) column in the original taxonomy, but if as_string is selected, the path data in all non-header elements will be a separator-delimited string.
write_pathified_to_csv()
Create a CSV-formatted file holding the data returned by pathify().
$csv_file = $obj->write_pathified_to_csv( { pathified => $pathified, # output of pathify() csvfile => './t/data/taxonomy_out5.csv', } );
Single hash reference. That hash is keyed on:
pathified
Required: Its value must be the arrayref of hash references returned by the pathify() method.
csvfile
Optional. Path to location where a CSV-formatted text file holding the taxonomy-by-adjacent-list will be written. Defaults to a file called taxonomy_out.csv in the current working directory.
You can also pass through any key-value pairs normally accepted by Text::CSV_XS.
Returns path to CSV-formatted text file just created.
Example
Suppose we have a CSV-formatted file holding the following taxonomy-by-adjacent-list:
After running this file through new(), pathify() and write_pathified_to_csv() we will have a new CSV-formatted file holding this taxonomy-by-materialized-path:
path,is_actionable |Alpha,0 |Beta,0 |Alpha|Epsilon,0 |Alpha|Epsilon|Kappa,1 |Alpha|Zeta,0 |Alpha|Zeta|Lambda,1 |Alpha|Zeta|Mu,0 |Beta|Eta,1 |Beta|Theta,1
Note that the id, parent_id and name columns have been replaced by the <path> column.
To install Parse::Taxonomy, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Parse::Taxonomy
CPAN shell
perl -MCPAN -e shell install Parse::Taxonomy
For more information on module installation, please visit the detailed CPAN module installation guide.