NAME

Data::Validate::CSV - read and validate CSV

SYNOPSIS

CSV Schema (JSON):

  {
    "@context": "http://www.w3.org/ns/csvw",
    "url": "countries.csv",
    "tableSchema": {
      "columns": [{
        "name": "country",
        "datatype": { "base": "string", "length": 2 }
      },{
        "name": "country group",
        "datatype": "string"
      },{
        "name": "name (en)",
        "datatype": "string"
      },{
        "name": "name (fr)",
        "datatype": "string"
      },{
        "name": "name (de)",
        "datatype": "string"
      },{
        "name": "latitude",
        "datatype": { "base": "number", "maximum": 90, "minimum": -90 }
      },{
        "name": "longitude",
        "datatype": { "base": "number", "maximum": 180, "minimum": -180 }
      }]
    }
  }

CSV Data:

  "at","eu","Austria","Autriche","Österreich","47.6965545","13.34598005"
  "be","eu","Belgium","Belgique","Belgien","50.501045","4.47667405"
  "bg","eu","Bulgaria","Bulgarie","Bulgarien","42.72567375","25.4823218"

Perl:

  use Path::Tiny qw(path);
  use Data::Validate::CSV;
  
  my $table = Data::Validate::CSV::Table->new(
    schema     => path('countries.csv-metadata.json'),
    input      => path('countries.csv'),
    has_header => !!0,
  );
  
  while (my $row = $table->get_row) {
    for my $e (@{$row->errors}) {
      warn $e;
    }
    printf(
      "%s is at latitude %f, longitude %f.\n",
      $row->get("name (en)")->value,
      $row->get("latitude")->value,
      $row->get("longitude")->value,
    );
  }

DESCRIPTION

There's not really a lot of documentation right now.

Mostly there's three interfaces you need to know about: tables, rows, and cells. (There are also columns, schemas, and notes, but for most day-to-day usage, those can be considered internal implementation details.)

Table interface

The table is constructed with the following attributes:

schema

A schema for the table. Can be a hashref, a JSON string, a scalar ref to a JSON string, or a Path::Tiny path to a file containing the schema.

input

The CSV data for the table. Can be a filehandle, a scalar ref to a string of data, or a Path::Tiny path to a file.

has_header

A boolean indicating whether the CSV contains a header row. This will be used to supply any column names missing from the schema, and will be skipped from being returned by get_row.

reader

A coderef which, if given a filehandle, will return a parsed line of CSV. The default is basically something like:

  sub { Text::CSV_XS->new->getline($_[0]) }

That's probably sufficient for most cases, but you may need to supply your own reader for handling tab-delimited files.

skip_rows

An integer, number of additional rows to skip before the header. Some CSV files contain a title or credit line. Defaults to 0.

skip_rows_after_header

An integer, number of additional rows to skip after the header. Defaults to 0.

The table provides the following methods:

get_row

Returns a row object for the next row of the table.

all_rows

Gets all the rows as a list.

row_count

The number of non-skipped, non-header lines read so far.

Row interface

The rows returned by get_row and all_rows are blessed objects. They provide the following methods:

raw_values

The values returned by Text::CSV_XS without any further processing.

values

The values returned by Text::CSV_XS, processed by datatype. Date and time datatypes will be reformatted from any CLDR-based format to ISO 8601. Booleans using non-standard representations will be changed to "1" and "0". Fields that have a separator defined will be split into an arrayref. Numbers given as percentages will be divided by 100. And so forth.

cells

Returns the same values as values but wrapped in cell objects. The following are equivalent:

  $row->values->[0];
  $row->cells->[0]->value;
  $row->[0];  # $row overloads @{}

Why fetch a cell instead of directly fetching the value? The cell object offers a few other useful methods.

get($name)

Gets a single cell from the row by its name. Names are defined in the schema, or the header row if missing from the schema.

  $row->get("country")->value;
row_number

The row number for this row in the table. Rows are numbered starting at 1. Headers and skipped rows are not counted.

key_string

For tables that has a primary key, this returns a string formed by joining together the primary key columns. It ought to be a unique identifier for this row within the table, and if it is not, this will be raised as an error.

errors

An arrayref of strings of errors associated with this row. This includes data validation problems.

Cell interface

It is possible to bypass using the cell interface and access cell values directly from the rows, but if accessing cells, these are the methods they provide:

raw_value

The value returned by Text::CSV_XS without any further processing.

value

The value returned by Text::CSV_XS, processed by datatype.

inflated_value

Like value but inflates some values to blessed objects. Date and time related datatypes will be returned as DateTime, DateTime::Incomplete, or DateTime::Duration objects. Booleans will be returned as JSON::PP::Boolean objects.

row_number

The row number for the cell's parent row in the table. Rows are numbered starting at 1. Headers and skipped rows are not counted.

col_number

The column number of this cell within the parent row. Columns are numbered starting at 1.

datatype

The datatype for this cell as a hashref.

BUGS

Please report any bugs to http://rt.cpan.org/Dist/Display.html?Queue=Data-Validate-CSV.

SEE ALSO

https://www.w3.org/TR/2016/NOTE-tabular-data-primer-20160225/.

AUTHOR

Toby Inkster <tobyink@cpan.org>.

COPYRIGHT AND LICENCE

This software is copyright (c) 2019 by Toby Inkster.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.

DISCLAIMER OF WARRANTIES

THIS PACKAGE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE.