Aran Clary Deltac

NAME

Text::CSV::Auto - Comprehensive and automatic loading, processing, and analysis of CSV files.

SYNOPSIS

    use Text::CSV::Auto;
    
    my $auto = Text::CSV::Auto->new( 'path/to/file.csv' );
    
    $auto->process(sub{
        my ($row) = @_;
        ...
    });
    
    $rows = $auto->slurp();
    
    my $info = $auto->analyze();

If you need to set some attributes:

    my $auto = Text::CSV::Auto->new(
        file     => 'path/to/file.csv',
        max_rows => 100,
    );

There is also a non-OO interface:

    use Text::CSV::Auto qw( slurp_csv process_csv );
    
    process_csv('path/to/file.csv', sub{
        my ($row) = @_;
        ...
    });
    
    my $rows = slurp_csv('path/to/file.csv');

DESCRIPTION

This module provides utilities to quickly process and analyze CSV files with as little hassle as possible.

The reliable and robust Text::CSV module is used for the actual CSV parsing. This module provides a simpler and smarter interface. In most situations all you need to do is specify the filename of the file and this module will automatically figure out what kind of separator is used and set some good default options for processing the file.

The name CSV is misleading as any variable-width delimited file should be fine including TSV files and pipe "|" delimited files to name a few.

Install Text::CSV_XS to get the best possible performance.

ATTRIBUTES

file

This is the only required attribute and specifies the file name of the CSV file.

separator

If you do not set this the separator will be automatically detected using Text::CSV::Separator.

csv_options

Set this to a hashref of extra options that you'd like to have passed down to the underlying Text::CSV parser.

Read the Text::CSV docs to see the many options that it supports.

csv

This contains an instance of the Text::CSV object that is used to parse the CSV file. You may pass in your own parser object. If you don't then one will be instantiated for you with the csv_options().

If not set already in csv_options, the following defaults will be used:

    binary    => 1 # Assume there is binary data.
    auto_diag => 1 # die() if there are any errors.
    sep_char  => $self->separator()

headers

The headers as pulled from the first line of the CSV file, taking in to account skip_rows(). The format_headers() option may modifying the format of the headers.

In some cases a CSV file does not have headers. In these cases you should specify an arrayref of header names that you would like to use.

    headers => ['foo', 'bar']

format_headers

When the first row is pulled from the CSV to determine the headers this option will cause them to be formatted to be more consistent and remove duplications. For example, if this were the headers:

    Parents Name,Parent Age,Child Name,Child Age,Child Name,Child Age

The headers would be transformed too:

    parent_name,parent_age,child_name,child_age,child_name_2,child_age_2

This defaults to on and does not affect custom headers set via the headers attribute.

skip_rows

An arrayref of row numbers to skip can be specified. This is useful for CSV files that contain ancillary rows that you don't want to be processed. For example, you could ignore the 2nd row and the 5th through the 10th rows:

    skip_rows => [2, 5..10]

Do not that the headers are pulled *after* taking in to account skip_rows. So, for example, doing skip_row=>[1] will cause the headers to be pulled from the second row.

max_rows

By default all rows will be processed. In some cases you only want to run a sample set of rows. This option will limit the number of rows processed. This is most useful for when you are using analyze() on a very large file where you don't need every row to be analyzed.

METHODS

process

    $auto->process(sub{
        my ($row) = @_;
        ...
    });

Given a code reference, this will iterate over each row in the CSV and call the code with the $row hashref as the only argument.

slurp

    my $rows = $auto->slurp();

Slurps up all of the rows in to an arrayref of row hashrefs and returns it.

analyze

    my $info = $auto->analyze();

Returns an array of hashes where each hash represents a header in the CSV file. The hash will contain a lot of different meta data about the data that was found in the rows for that header.

It is possible that within the same header that multiple data types are found, such as finding a integer value on one row then a string value on another row within the same header. In a case like this both the integer=>1 and string=>1 flags would be set.

The possible data types are:

    empty    - The field was blank.
    integer  - Looked like a non-fractional number.
    decimal  - Looked like a fractional number.
    mdy_date - A date in the format of MM/DD/YYYY.
    ymd_date - A date in the format of YYYY-MM-DD.
    string   - Anything else.

There will also be a "data_type" key which will contain the most generalized data type from above. For example, if string was found on one row and decimal was found on another data_type will contain string.

Additionally the following attributes may be set:

    string_length     - The length of the largest string value.
    integer_length    - The number of integer digits in the largest number.
    fractional_length - The number of decimal digits in the value with the most decimal places.
    max               - The maximum number value found.
    min               - The minimum number value found.
    signed            - A negative number was found.

Each hash will also contain a 'header' key wich will contain the name of the header that is represents.

This method is implemented as an attribute so that calls beyond the first will not re-scan the CSV file.

FUNCTIONS

A non-OO interface is provided for simple cases.

process_csv

    use Text::CSV::Auto qw( process_csv );
    
    process_csv(
        'file.csv',
        { max_rows => 20 },
        sub{
            my ($row) = @_;
            ...
        },
    );

The first argument is the filename of the CSV file, the second argument is a hashref of options that can be any of the ATTRIBUTES in the OO interface. The third argument is a code reference which will be executed for each row.

The second argument is optional. You can just leave it out, like this:

    process_csv( 'file.csv', sub{
        my ($row) = @_;
        ...
    });

slurp_csv

    use Text::CSV::Auto qw( slurp_csv );
    
    my $rows = slurp_csv(
        'file.csv',
        { csv_options => {binary => 0} },
    );
    foreach my $row (@$rows) {
        ...
    }

Just like process_csv, the first option is required and the second is optional.

TODO

  • The date (not to mention time) handling in analyze is primitive and needs to be improved. Possibly by providing a CLDR pattern that then can be used with DateTime::Format::CLDR.

  • The original reason for creating analyze was to then take that meta data and produce table DDLs for relational databases. This would then allow for an extremely simple way to take a csv file and pump it in to a DB to then run queries on.

  • Not sure the best way to do this, but it would be really nice if the quote character could be automatically detected, or detect that no quote character is used.

AUTHOR

Aran Clary Deltac <bluefeet@gmail.com>

LICENSE

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.




Hosting generously
sponsored by Bytemark