The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

diff_spreadsheets - show differences between spreadsheets/csvs readably

SYNOPSIS

diff_spreadsheets [OPTIONS] file1.csv file2.xlsx!Sheet1

diff_spreadsheets [OPTIONS] file1.xls file2.ods # all coresponding sheets

DESCRIPTION

CSV files may be always used. Spreadsheets may be used if Libre Office 7.2 or later is installed.

If each input has only one sheet then they are compared regardless of sheet names. Each file could be a .CSV, a spreadsheet workbook containing a single sheet, or a multi-sheet workbook with a sheet name specified (using the syntax shown).

Otherwise, every sheet contained in each workbook is comapred with the same-named sheet in the other file, warning about any un-paired sheets.

Tabs, newlines, etc. and non-printable characters are replaced with escapes like "\t" or "\n" for human consumption.

OPTIONS

-m --method [native | diff | tkdiff | git]

-m native (the default) shows only the changed cells in rows which differ. Corresponding columns are identified by title, and need not be in the same order (see also --id-columns).

-m diff, -m tkdiff, or -m git run an external tool on temporary text (CSV) files created from the inputs, in which ignored columns have been removed and non-graphic characters changed to escapes.

Most diff(1) options are accepted and passed through, but are not documented here.

-m git uses git diff to, by default, color-code changed words; most git(1) diff options are passed through.

Following is an sample output using the *native* output format. Note that only changed columns are shown except that columns named with --id-columns are always shown for context:

 $ diff_spreadsheets --id-columns 'LAST NAME,FIRST NAME' File1.csv File2.csv

 -------- ADDED row 5 ------------------------------
   FIRST NAME: 'Frederick'
    LAST NAME: 'Douglass'
     Address1: 'Mount Hope Cemetary'
     Address2: '1133 Mt Hope Ave'
         CITY: 'Rochester'
        STATE: 'NY'
          ZIP: '14620'

 -------- CHANGED row 3 -----------------------------
   FIRST NAME: 'Lucretia' (unchanged)
    LAST NAME: 'Mott' (unchanged)
         CITY: '' →  'Philadelphia'

--title-row [ROWNUMBER]

The row number containing column titles (first==1). Auto-detected by default if the choice is obvious. Specify zero if there are no titles.

--columns COLSPEC[,COLSPEC ...]

--columns -COLSPEC[,-COLSPEC ...]

Ignore differences in certain columns.

In the first form (not negated), the specified columns are used for comparisons and others are ignored when deciding if a row changed. When negated ("-" prefix) the specified columns are ignored.

COLSPEC may be a column title, a /regex/ or m{regex} matching one or more titles, a column letter code (A, B, etc.), or an identifer as described in Spreadsheet::Edit.

--id-columns COLSPEC[,COLSPEC ...]

Specify columns which together uniquely identify records (i.e. rows).

Before running any "diff" operation, the data rows in each file are sorted on the specified columns (comparison is alphabetic), so that corresponding rows are in the same relative vertical position in each file. This makes it more likely that a change is detected as a CHANGE and not as confusing separate DELETE and INSERT.

--no-sort-rows disables this sorting so you can pre-sort the data yourself (for example, if "id columns" should be sorted numerically).

When using the default "native" diff algorithm, the original row numbers in each file are displayed in the results, hiding the effects of sorting.

The "native" algorithm always displays the values of the "id" columns in changed rows.

The following options apply only to the 'native' method:

--always-show COLSPEC[,COLSPEC ...]

When a changed row is displayed the changed cells are shown plus all cells implied by --id-columns whether changed or not.

The --always-show option supplies an alternative set of columns to always show instead of those given by --id-columns. The input data is still sorted using --id-columns.

Gory internals of the native diff method

The Diff algorithm is first applied using only the id-columns; matching rows are assumed to correspond, and the other columns are then compared and a "change" is reported if there are differences. If --id-columns is not used, the Diff algorithm is applied using all columns, and can get temporarily out of sync; if a row was inserted or deleted adjacent to other rows which were merely changed, the result can be a string of "changed" reports which actually describe pairs of unrelated records.

For even more control, see the --hashid-func option.

--hashid-func PERLCODE

--hash-func PERLCODE

These options allow arbitrary filtering of row data before use. PERLCODE must be an anonymous sub declaration which is called with two arguments for each row:

  ([cells], $row_index)

and must return a string representing those values, or undef to ignore the row.

Both default to

  sub { join ",", @{$_[0]} }  # concatenate all cells separated by commas

During the Diff algorithm, hashid-func is used to identify pairs of rows which represent the same data record, and then hash-func is used to determine if a pair has reportable changes.

Specifically: hashid-func is called for each row passing only values from --id-column (or if not specified then all columns). If undef is returned then that row is ignored.

If the same string is returned for a pair of rows from the two files then any reportable differences are shown as a "change" to the record. hash-func is later called to determine if there are reportable changes.

If a unique string is returned by hashid_func, i.e. there is no corresponding row in the other file, then that row will be reported as "deleted" or "inserted" and hash-func is not used for that row.

When hash-func is called it is passed all cells in a row (or those specified with --columns). If the result is different for corresponding rows then a "change" is reported.

Argument order: When called with a row from the first file, cell values are passed in their natural order; when called with a row from the second file, columns which also exist in the first file are passed first, in the order they appear in the first file, followed by any columns which exist only in the second file.

If there are no titles then all columns are passed in their natural order.

The user-defined PERLCODE subs are compiled into the same package.

--quote-char CHARACTER (default is ")

--sep-char CHARACTER (default is ,)

--encoding ENCODING (default is UTF-8)

Used when reading CSV files (see Text::CSV). The same options are applied to both input files.

--quiet

--verbose

--debug

--keep-temps

-h --help

Probably what you expect.

SEE ALSO

Spreadsheet::Edit, Sreadsheet::Edit::IO

AUTHOR

Jim Avera (jim.avera gmail)

LICENSE

CC0 1.0 / Public Domain