The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

App::CSVUtils - CLI utilities related to CSV

VERSION

This document describes version 1.034 of App::CSVUtils (from Perl distribution App-CSVUtils), released on 2024-02-02.

DESCRIPTION

This distribution contains the following CLI utilities:

1. csv-add-fields
2. csv-avg
3. csv-check-cell-values
4. csv-check-field-names
5. csv-check-field-values
6. csv-check-rows
7. csv-cmp
8. csv-concat
9. csv-convert-to-hash
10. csv-csv
11. csv-delete-fields
12. csv-dump
13. csv-each-row
14. csv-fill-cells
15. csv-fill-template
16. csv-find-values
17. csv-freqtable
18. csv-gen
19. csv-get-cells
20. csv-grep
21. csv-info
22. csv-intrange
23. csv-list-field-names
24. csv-lookup-fields
25. csv-ltrim
26. csv-map
27. csv-munge-field
28. csv-munge-rows
29. csv-pick
30. csv-pick-cells
31. csv-pick-fields
32. csv-pick-rows
33. csv-quote
34. csv-replace-newline
35. csv-rtrim
36. csv-select-fields
37. csv-select-rows
38. csv-setop
39. csv-shuf
40. csv-shuf-fields
41. csv-shuf-rows
42. csv-sort
43. csv-sort-fields
44. csv-sort-fields-by-example
45. csv-sort-fields-by-spec
46. csv-sort-rows
47. csv-sorted
48. csv-sorted-fields
49. csv-sorted-rows
50. csv-split
51. csv-sum
52. csv-transpose
53. csv-trim
54. csv-uniq
55. csv-unquote
56. csv2csv
57. csv2ltsv
58. csv2paras
59. csv2td
60. csv2tsv
61. csv2vcf
62. list-csvutils
63. paras2csv
64. tsv2csv

FUNCTIONS

gen_csv_util

Usage:

gen_csv_util(%args) -> bool

Generate a CSV utility.

This routine is used to generate a CSV utility in the form of a Rinci function (code and metadata). You can then produce a CLI from the Rinci function simply using Perinci::CmdLine::Gen or, if you use Dist::Zilla, Dist::Zilla::Plugin::GenPericmdScript or, if on the command-line, gen-pericmd-script.

Using this routine, by providing just one or a few hooks and setting some parameters like a couple of extra arguments, you will get a complete CLI with decent POD/manpage, ability to read one or multiple CSV's and write one or multiple CSV's, some command-line options to customize how the input CSV's should be parsed and how the output CSV's should be formatted and named. Your CLI also has tab completion, usage and help message, and other features.

To create a CSV utility, you specify a name (e.g. csv_dump; must be a valid unqualified Perl identifier/function name) and optionally summary, description, and other metadata like links or even add_meta_props. Then you specify one or more of on_* or before_* or after_* arguments to supply handlers (coderefs) for your CSV utility at various hook points.

THE HOOKS

All code for hooks should accept a single argument r. r is a stash (hashref) of various data, the keys of which will depend on which hook point being called. You can also add more keys to store data or for flow control (see hook documentation below for more details).

The order of the hooks, in processing chronological order:

  • on_begin

    Called when utility begins, before reading CSV. You can use this hook e.g. to process arguments, set output filenames (if you allow custom output filenames).

  • before_read_input

    Called before opening any input CSV file. This hook is still called even if your utility sets reads_csv to false.

    At this point, the input_filenames stash key (as well as other keys like input_filename, input_filenum, etc) has not been set. You can use this hook e.g. to set a custom input_filenames.

  • before_open_input_files

    Called before an input CSV file is about to be opened, including for stdin (-). You can use this hook e.g. to check/preprocess input file. Flow control is available by setting $r->{wants_skip_files} to skip reading all the input files and go directly to the after_read_input hook.

  • before_open_input_file

    Called before each input CSV file is about to be opened, including for stdin (-). For the first file, called after before_open_input_file hook. You can use this hook e.g. to check/preprocess input file. Flow control is available by setting $r->{wants_skip_file} to skip reading a single input file and go to the next file, or $r->{wants_skip_files} to skip reading the rest of the files and go directly to the after_read_input hook.

  • on_input_header_row

    Called when receiving header row. Will be called for every input file, and called even when user specify --no-input-header, in which case the header row will be the generated ["field1", "field2", ...]. You can use this hook e.g. to add/remove/rearrange fields.

    You can set $r->{wants_fill_rows} to a defined false if you do not want $r->{input_rows} to be filled with empty string elements when it contains less than the number of fields (in case of sparse values at the end). Normally you only want to do this when you want to do checking, e.g. in csv-check-rows.

  • on_input_data_row

    Called when receiving each data row. You can use this hook e.g. to modify the row or print output (for line-by-line transformation or filtering).

  • after_close_input_file

    Called after each input file is closed, including for stdin (-) (although for stdin, the handle is not actually closed). Flow control is possible by setting $r->{wants_skip_files} to skip reading the rest of the files and go straight to the after_close_input_files hook.

  • after_close_input_files

    Called after the last input file is closed, after the last after_close_input_file hook, including for stdin (-) (although for stdin, the handle is not actually closed).

  • after_read_input

    Called after the last row of the last CSV file is read and the last file is closed. This hook is still called, if you set reads_csv option to false. At this point the stash keys related to CSV reading have all been cleared, including input_filenames, input_filename, input_fh, etc.

    You can use this hook e.g. to print output if you buffer the output.

  • on_end

    Called when utility is about to exit. You can use this hook e.g. to return the final result.

THE STASH

The common keys that r will contain:

  • gen_args, hash. The arguments used to generate the CSV utility.

  • util_args, hash. The arguments that your CSV utility accepts. Parsed from command-line arguments (or configuration files, or environment variables).

  • name, str. The name of the CSV utility. Which can also be retrieved via gen_args.

  • code_print, coderef. Routine provided for you to print something. Accepts a string. Takes care of opening the output files for you.

  • code_print_row, coderef. Routine provided for you to print a data row. You pass the row (either arrayref or hashref). Takes care of opening the output files for you, as well as printing header row the first time, if needed.

  • code_print_header_row, coderef. Routine provided for you to print header row. You don't need to pass any arguments. Will only print the header row once per output file if output header is enabled, even if called multiple times.

If you are accepting CSV data (reads_csv gen argument set to true), the following keys will also be available (in on_input_header_row and on_input_data_row hooks):

  • input_parser, a Text::CSV_XS instance for input parsing.

  • input_fields, array of str. Input CSV's field names.

  • input_fields_idx, hash with field name as keys and field index (0-based integer) as values.

  • input_filenames, array of str.

  • input_filename, str. The name of the current input file being read (- if reading from stdin).

  • input_filenum, uint. The number of the current input file, 1 being the first file, 2 for the second, and so on.

  • input_fh, the handle to the current file being read.

  • input_rownum, uint. The number of rows that have been read (reset after each input file). In on_input_header_row phase, this will be 1 since header row (including the generated one) is the first row. Then in on_input_data_row phase (called the first time for a file), it will be 2 for the first data row, even if physically it is the first row for CSV file that does not have a header.

  • input_data_rownum, uint. The number of data rows that have been read (reset after each input file). This will be equal to input_rownum less 1 if input file has header.

  • input_row, aos (array of str). The current input CSV row as an arrayref.

  • input_row_as_hashref, hos (hash of str). The current input CSV row as a hashref, with field names as hash keys and field values as hash values. This will only be calculated if utility wants it. Utility can express so by setting $r->{wants_input_row_as_hashref} to true, e.g. in the on_begin hook.

  • input_header_row_count, uint. Contains the number of actual header rows that have been read. If CLI user specifies --no-input-header, this will stay at zero. Will be reset for each CSV file.

  • input_data_row_count, int. Contains the number of actual data rows that have read. Will be reset for each CSV file.

If you are outputting CSV (writes_csv gen argument set to true), the following keys will be available:

  • output_emitter, a Text::CSV_XS instance for output.

  • output_fields, array of str. Should be set to list of output field names. If unset, will be set to be the same as input_fields.

  • output_fields_idx, hash with field names as keys and field indexes (0-based integer) as values. Normally you do not need to set this manually; you just need to set output_fields and this hash will be computed automatically for you just before the first output row is outputted.

  • output_filenames, array of str.

  • output_filename, str, name of current output file.

  • output_filenum, uint, the number of the current output file, 1 being the first file, 2 for the second, and so on.

  • output_fh, handle to the current output file.

  • output_rownum, uint. The number of rows that have been outputted (reset after each output file).

  • output_data_rownum, uint. The number of data rows that have been outputted (reset after each output file). This will be equal to input_rownum less 1 if input file has header.

For other hook-specific keys, see the documentation for associated hook point.

ACCEPTING ADDITIONAL COMMAND-LINE OPTIONS/ARGUMENTS

As mentioned above, you will get additional command-line options/arguments in $r->{util_args} hashref. Some options/arguments are already added by gen_csv_util, e.g. input_filename or input_filenames along with input_sep_char, etc (when your utility declares reads_csv), output_filename or output_filenames along with overwrite, output_sep_char, etc (when your utility declares writes_csv).

If you want to accept additional arguments/options, you specify them in add_args (hashref, with key being Each option/argument has to be specified first via add_args (as hashref, with key being argument name and value the argument specification as defined in Rinci::function)). Some argument specifications have been defined in App::CSVUtils and can be used. See existing utilities for examples.

READING CSV DATA

To read CSV data, normally your utility would provide handler for the on_input_data_row hook and sometimes additionally on_input_header_row.

OUTPUTTING STRING OR RETURNING RESULT

To output string, usually you call the provided routine $r->{code_print}. This routine will open the output files for you.

You can also return enveloped result directly by setting $r->{result}.

OUTPUTTING CSV DATA

To output CSV data, usually you call the provided routine $r->{code_print_row}. This routine accepts a row (arrayref or hashref). This routine will open the output files for you when needed, as well as print header row automatically.

You can also buffer rows from input to e.g. $r->{output_rows}, then call $r->{code_print_row} repeatedly in the after_read_input hook to print all the rows.

READING MULTIPLE CSV FILES

To read multiple CSV files, you first specify reads_multiple_csv. Then, you can supply handler for on_input_header_row and on_input_data_row as usual. If you want to do something before/after each input file, you can also supply handler for before_open_input_file or after_close_input_file.

WRITING TO MULTIPLE CSV FILES

Similarly, to write to many CSv files, you first specify writes_multiple_csv. Then, you can supply handler for on_input_header_row and on_input_data_row as usual. To switch to the next file, set $r->{wants_switch_to_next_output_file} to true, in which case the next call to $r->{code_print_row} will close the current file and open the next file.

CHANGING THE OUTPUT FIELDS

When calling $r->{code_print_row}, you can output whatever fields you want. By convention, you can set $r->{output_fields} and $r->{output_fields_idx} to let other handlers know about the output fields. For example, see the implementation of csv-concat.

This function is not exported by default, but exportable.

Arguments ('*' denotes required arguments):

  • add_args => hash

    (No description)

  • add_args_rels => hash

    (No description)

  • add_meta_props => hash

    Add additional Rinci function metadata properties.

  • after_close_input_file => code

    (No description)

  • after_close_input_files => code

    (No description)

  • after_read_input => code

    (No description)

  • before_open_input_file => code

    (No description)

  • before_open_input_files => code

    (No description)

  • before_read_input => code

    (No description)

  • description => str

    (No description)

  • examples => array

    (No description)

  • links => array[hash]

    (No description)

  • name* => perl::identifier::unqualified_ascii

    (No description)

  • on_begin => code

    (No description)

  • on_end => code

    (No description)

  • on_input_data_row => code

    (No description)

  • on_input_header_row => code

    (No description)

  • reads_csv => bool (default: 1)

    Whether utility reads CSV data.

  • reads_multiple_csv => bool

    Whether utility accepts CSV data.

    Setting this option to true will implicitly set the reads_csv option to true, obviously.

  • summary => str

    (No description)

  • writes_csv => bool (default: 1)

    Whether utility writes CSV data.

  • writes_multiple_csv => bool

    Whether utility outputs CSV data.

    Setting this option to true will implicitly set the writes_csv option to true, obviously.

Return value: (bool)

compile_eval_code

Usage:

$coderef = compile_eval_code($str, $label);

Compile string code $str to coderef in 'main' package, without use strict or use warnings. Die on compile error.

eval_code

Usage:

$res = eval_code($coderef, $r, $topic_var_value, $return_topic_var);

FAQ

My CSV does not have a header?

Use the --no-header option. Fields will be named field1, field2, and so on.

My data is TSV, not CSV?

Use the --tsv option.

I have a big CSV and the utilities are too slow or eat too much RAM!

These utilities are not (yet) optimized, patches welcome. If your CSV is very big, perhaps a C-based solution is what you need.

HOMEPAGE

Please visit the project's homepage at https://metacpan.org/release/App-CSVUtils.

SOURCE

Source repository is at https://github.com/perlancar/perl-App-CSVUtils.

SEE ALSO

Similar CLI bundles for other format

App::TSVUtils, App::LTSVUtils, App::SerializeUtils.

xls2csv and xlsx2csv from Spreadsheet::Read

import-csv-to-sqlite from App::SQLiteUtils

Query CSV with SQL using fsql from App::fsql

csvgrep from csvgrep

Other non-Perl-based CSV utilities

Python

csvkit, https://csvkit.readthedocs.io/en/latest/

AUTHOR

perlancar <perlancar@cpan.org>

CONTRIBUTOR

Adam Hopkins <violapiratejunky@gmail.com>

CONTRIBUTING

To contribute, you can send patches by email/via RT, or send pull requests on GitHub.

Most of the time, you don't need to build the distribution yourself. You can simply modify the code, then test via:

% prove -l

If you want to build the distribution (e.g. to try to install it locally on your system), you can install Dist::Zilla, Dist::Zilla::PluginBundle::Author::PERLANCAR, Pod::Weaver::PluginBundle::Author::PERLANCAR, and sometimes one or two other Dist::Zilla- and/or Pod::Weaver plugins. Any additional steps required beyond that are considered a bug and can be reported to me.

COPYRIGHT AND LICENSE

This software is copyright (c) 2024, 2023, 2022, 2021, 2020, 2019, 2018, 2017, 2016 by perlancar <perlancar@cpan.org>.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.

BUGS

Please report any bugs or feature requests on the bugtracker website https://rt.cpan.org/Public/Dist/Display.html?Name=App-CSVUtils

When submitting a bug or request, please include a test-file or a patch to an existing test-file that illustrates the bug or desired feature.