The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Parse::Taxonomy::AdjacentList - Extract a taxonomy from a hierarchy inside a CSV file

SYNOPSIS

    use Parse::Taxonomy::AdjacentList;

    $source = "./t/data/alpha.csv";
    $obj = Parse::Taxonomy::AdjacentList->new( {
        file    => $source,
    } );

METHODS

new()

  • Purpose

    Parse::Taxonomy::AdjacentList constructor.

  • Arguments

    Single hash reference. There are two possible interfaces: file and components.

    1 file interface
        $source = "./t/data/delta.csv";
        $obj = Parse::Taxonomy::AdjacentList->new( {
            file    => $source,
        } );

    Elements in the hash reference are keyed on:

    • file

      Absolute or relative path to the incoming taxonomy file. Required for this interface.

    • id_col

      The name of the column in the header row under which each data record's unique ID can be found. Defaults to id.

    • parent_id_col

      The name of the column in the header row under which each data record's parent ID can be found. (Will be empty in the case of top-level nodes, as they have no parent.) Defaults to parent_id.

    • leaf_col

      The name of the column in the header row under which, in each data record, there is a found a string which differentiates that record from all other records with the same parent ID. Defaults to name.

    • Text::CSV_XS options

      Any other options which could normally be passed to Text::CSV_XS->new() will be passed through to that module's constructor. On the recommendation of the Text::CSV documentation, binary is always set to a true value.

    2 components interface
        $obj = Parse::Taxonomy::AdjacentList->new( {
            components  => {
                fields          => $fields,
                data_records    => $data_records,
            }
        } );

    Elements in this hash are keyed on:

    • components

      This element is required for the components interface. The value of this element is a hash reference with two keys, fields and data_records. fields is a reference to an array holding the field or column names for the data set. data_records is a reference to an array of array references, each of the latter arrayrefs holding one record or row from the data set.

  • Return Value

    Parse::Taxonomy::AdjacentList object.

  • Exceptions

    new() will throw an exception under any of the following conditions:

    • Argument to new() is not a reference.

    • Argument to new() is not a hash reference.

    • Argument to new() must have either 'file' or 'components' element but not both.

    • Lack columns in header row to match requirements.

    • Non-numeric entry in id or parent_id column.

    • Duplicate entries in id column.

    • Number of fields in a data record does not match number in header row.

    • Empty string in a component column of a record.

    • Unable to locate a record whose id is the parent_id of a different record.

    • No records with same parent_id may share value of component column.

    • file interface

      • In the file interface, unable to locate the file which is the value of the file element.

      • The same field is found more than once in the header row of the incoming taxonomy file.

      • Unable to open or close the incoming taxonomy file for reading.

    • components interface

      • In the components interface, components element must be a hash reference with fields and data_records elements.

      • fields element must be array reference.

      • data_records element must be reference to array of array references.

      • No duplicate fields in fields element's array reference.

fields()

  • Purpose

    Identify the names of the columns in the taxonomy.

  • Arguments

        my $fields = $self->fields();

    No arguments; the information is already inside the object.

  • Return Value

    Reference to an array holding a list of the columns as they appear in the header row of the incoming taxonomy file.

  • Comment

    Read-only.

data_records()

  • Purpose

    Once the taxonomy has been validated, get a list of its data rows as a Perl data structure.

  • Arguments

        $data_records = $self->data_records;

    None.

  • Return Value

    Reference to array of array references. The array will hold the data records found in the incoming taxonomy file in their order in that file.

  • Comment

    Does not contain any information about the fields in the taxonomy, so you should probably either (a) use in conjunction with fields() method above; or (b) use fields_and_data_records().

get_field_position()

  • Purpose

    Identify the index position of a given field within the header row.

  • Arguments

        $index = $obj->get_field_position('income');

    Takes a single string holding the name of one of the fields (column names).

  • Return Value

    Integer representing the index position (counting from 0) of the field provided as argument. Throws exception if the argument is not actually a field.

Accessors

The following methods provide information about key columns in a Parse::Taxonomy::MaterializedPath object. The key columns are those which hold the ID, parent ID and component information. They take no arguments. The methods whose names end in _idx return integers, as they return the index position of the column in the header row. The other methods return strings.

    $index_of_id_column = $self->id_col_idx;

    $name_of_id_column = $self->id_col;

    $index_of_parent_id_column = $self->parent_id_col_idx;

    $name_of_parent_id_column = $self->parent_id_col;

    $index_of_leaf_column = $self->leaf_col_idx;

    $name_of_leaf_column = $self->leaf_col;

pathify()

  • Purpose

    Generate a new Perl data structure which holds the same information as a Parse::Taxonomy::AdjacentList object but which expresses the route from the root node to a given branch or leaf node as either a separator-delimited string (as in the path column of a Parse::Taxonomy::MaterializedPath object) or as an array reference holding the list of names which delineate that route.

    Another way of expressing this: Transform a taxonomy-by-adjacent-list to a taxonomy-by-materialized-path.

    Example: Suppose we have a CSV file which serves as a taxonomy-by-adjacent-list for this data:

        "id","parent_id","name","is_actionable"
        "1","","Alpha","0"
        "2","","Beta","0"
        "3","1","Epsilon","0"
        "4","3","Kappa","1"
        "5","1","Zeta","0"
        "6","5","Lambda","1"
        "7","5","Mu","0"
        "8","2","Eta","1"
        "9","2","Theta","1"

    Instead of having the route from the root node to a given node be represented implicitly by following parent_ids up the tree, suppose we want that route to be represented by a string. Assuming that we work with default column names, that would mean representing the information currently spread out among the id, parent_id and name columns in a single path column which, by default, would hold an array reference.

        $source = "./t/data/theta.csv";
        $obj = Parse::Taxonomy::AdjacentList->new( {
            file    => $source,
        } );
    
        $taxonomy_with_path_as_array = $obj->pathify;

    Yielding:

        [
          ["path", "is_actionable"],
          [["", "Alpha"], 0],
          [["", "Beta"], 0],
          [["", "Alpha", "Epsilon"], 0],
          [["", "Alpha", "Epsilon", "Kappa"], 1],
          [["", "Alpha", "Zeta"], 0],
          [["", "Alpha", "Zeta", "Lambda"], 1],
          [["", "Alpha", "Zeta", "Mu"], 0],
          [["", "Beta", "Eta"], 1],
          [["", "Beta", "Theta"], 1],
        ]

    If we wanted the path information represented as a string rather than an array reference, we would say:

        $taxonomy_with_path_as_string = $obj->pathify( { as_string => 1 } );

    Yielding:

        [
          ["path", "is_actionable"],
          ["|Alpha", 0],
          ["|Beta", 0],
          ["|Alpha|Epsilon", 0],
          ["|Alpha|Epsilon|Kappa", 1],
          ["|Alpha|Zeta", 0],
          ["|Alpha|Zeta|Lambda", 1],
          ["|Alpha|Zeta|Mu", 0],
          ["|Beta|Eta", 1],
          ["|Beta|Theta", 1],
        ]

    If we are providing a true value to the as_string key, we also get to choose what character to use as the separator in the path column.

        $taxonomy_with_path_as_string_different_path_col_sep =
            $obj->pathify( {
                as_string       => 1,
                path_col_sep    => '~~',
             } );

    Yields:

        [
          ["path", "is_actionable"],
          ["~~Alpha", 0],
          ["~~Beta", 0],
          ["~~Alpha~~Epsilon", 0],
          ["~~Alpha~~Epsilon~~Kappa", 1],
          ["~~Alpha~~Zeta", 0],
          ["~~Alpha~~Zeta~~Lambda", 1],
          ["~~Alpha~~Zeta~~Mu", 0],
          ["~~Beta~~Eta", 1],
          ["~~Beta~~Theta", 1],
        ]

    Finally, should we want the path column in the returned arrayref to be named something other than path, we can provide a value to the path_col key.

        [
          ["foo", "is_actionable"],
          [["", "Alpha"], 0],
          [["", "Beta"], 0],
          [["", "Alpha", "Epsilon"], 0],
          [["", "Alpha", "Epsilon", "Kappa"], 1],
          [["", "Alpha", "Zeta"], 0],
          [["", "Alpha", "Zeta", "Lambda"], 1],
          [["", "Alpha", "Zeta", "Mu"], 0],
          [["", "Beta", "Eta"], 1],
          [["", "Beta", "Theta"], 1],
        ]

    item * Arguments

    Optional single hash reference. If provided, the following keys may be used:

    • path_col

      User-supplied name for column holding path information in the returned array reference. Defaults to path.

    • as_string

      Boolean. If supplied with a true value, path information will be represented as a separator-delimited string rather than an array reference.

    • path_col_sep

      User-supplied string to be used to separate the parts of the route when as_string is called with a true value. Not meaningful unless as_string is true.

  • Return Value

    Reference to an array of array references. The first element in the array will be a reference to an array of field names. Each succeeding element will be a reference to an array holding data for one record in the original taxonomy. The path data will be represented, by default, as an array reference built up from the component (name) column in the original taxonomy, but if as_string is selected, the path data in all non-header elements will be a separator-delimited string.

write_pathified_to_csv()

  • Purpose

    Create a CSV-formatted file holding the data returned by pathify().

  • Arguments

        $csv_file = $obj->write_pathified_to_csv( {
           pathified => $pathified,                   # output of pathify()
           csvfile => './t/data/taxonomy_out5.csv',
        } );

    Single hash reference. That hash is keyed on:

    • pathified

      Required: Its value must be the arrayref of hash references returned by the pathify() method.

    • csvfile

      Optional. Path to location where a CSV-formatted text file holding the taxonomy-by-adjacent-list will be written. Defaults to a file called taxonomy_out.csv in the current working directory.

    • Text::CSV_XS options

      You can also pass through any key-value pairs normally accepted by Text::CSV_XS.

  • Return Value

    Returns path to CSV-formatted text file just created.

  • Example

    Suppose we have a CSV-formatted file holding the following taxonomy-by-adjacent-list:

        "id","parent_id","name","is_actionable"
        "1","","Alpha","0"
        "2","","Beta","0"
        "3","1","Epsilon","0"
        "4","3","Kappa","1"
        "5","1","Zeta","0"
        "6","5","Lambda","1"
        "7","5","Mu","0"
        "8","2","Eta","1"
        "9","2","Theta","1"

    After running this file through new(), pathify() and write_pathified_to_csv() we will have a new CSV-formatted file holding this taxonomy-by-materialized-path:

        path,is_actionable
        |Alpha,0
        |Beta,0
        |Alpha|Epsilon,0
        |Alpha|Epsilon|Kappa,1
        |Alpha|Zeta,0
        |Alpha|Zeta|Lambda,1
        |Alpha|Zeta|Mu,0
        |Beta|Eta,1
        |Beta|Theta,1

    Note that the id, parent_id and name columns have been replaced by the <path> column.