The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

Spreadsheet::Compare - Module for comparing spreadsheet-like datasets

SYNOPSIS

  use Spreadsheet::Compare;
  my $cfg = {
      type     => 'CSV',
      title    => 'Test 01',
      files    => ['left.csv', 'right.csv'],
      identity => ['RowId'],
  }
  Spreadsheet::Compare->new(config => $cfg)->run();

  or

  Spreadsheet::Compare->new->config($cfg)->run;

DESCRIPTION

Spreadsheet::Compare analyses differences between two similar record sets. It is designed to be used for regression testing of a large number of files, databases or spreadsheets.

The record sets can be read from a variety of different sources like CSV files, fixed column input, databases, Excel or Open/Libre Office Spreadsheets.

The differences can be saved in XLSX or HTML format and are visually highlighted to allow fast access to the relevant parts.

Configuration of a single comparison, sets of comparisons or whole suites can be defined as YAML configuration files in order to persist a setup that can be run multiple times.

Spreadsheet::Compare is the central component normally used by executing the commandline utility spreadcomp with a configuration file. It feeds the relevant parts of the configuration data to the reader, compare and reporting classes.

Reading is done by subclasses of Spreadsheet::Compare::Reader, The actual comparison is done by Spreadsheet::Compare::Single while the reporting output is handled by subclasses of Spreadsheet::Compare::Reporter.

ATTRIBUTES

    $sc = Spreadsheet::Compare->new;

All attributes will return the Spredsheet::Compare object when called as setter to allow method chaining.

config

    $sc->config($cfg);
    my $conf = $sc-config;

Get/Set the configuration for subsequent calls to run(). It hast to be either a hash reference with a single comparison configuration, a reference to an array with multiple configurations or the path to a YAML file.

errors (readonly)

    say "found error: $_" for $sc->run->errors->@*;

Returns a reference to an array of error messages. These are errors that prevented a single comparison from being executed.

exit_code (readonly);

    my $ec = $sc->run->exit_code;

log_level

    $sc->log_level('INFO');
    my $lev = $sc->log_level;

The global log level (valid are the strings TRACE, DEBUG, INFO, WARN, ERROR or FATAL). Default is no logging at all. The level can also be set with the environment variable SPREADSHEET_COMPARE_DEBUG. Using the attribute has precedence.

For a single comparison the log level can also be set with the log_level option.

quiet

    $sc->quiet(1);
    my $is_quiet = $sc->quiet;

Suppress the line counter when using "stdout".

result (readonly)

    my $res = $sc->run->result;
    say "$_ found $res->{$_}{diff} differences"  for sort keys %$res;

The result is a reference to a hash with the test titles as keys and the comparison counters as result.

    {
        <title1> => {
            add   => <number of additional records on the right>,
            diff  => <number of found differences>,
            dup   => <number of duplicate rows (maximum of left and right)>,
            left  => <number of records on the left>,
            limit => <number of record with differences below set ste limits>,
            miss  => <number of records missing on the right>,
            right => <number of records on the right>,
            same  => <number of identical records>,
        },
        <title2> => {
        ...
    }

stdout

    $sc->stdout(1);
    my $use_stdout = $sc->stdout;

Report progress and results to stdout.

METHODS

Spreadsheet::Compare is a Mojo::EventEmitter and additionally implements the following methods.

run

Run all defined comparisons. Returns the object itself. Throws exceptions on global errors. Exceptions during a single comparison are trapped and will be accessible via the "errors" attribute.

CONFIGURATION

A configuration can contain one or more comparison definitions. They have to be defined as a single hashref or a reference to an array of hashes, each hash defining one comparison. The keys of the hash specify the names of the options. Options that are common to all specified comparisons can be given in a special hash with title set to __GLOBAL__.

An example for a very basic configuration with 2 CSV comparisons:

    [
      {
        title           => '__GLOBAL__',
        type            => 'CSV',
        rootdir         => 'my/data/dir',
        reporter        => 'XSLX',
        report_filename => '%{title}.xlsx',
      },
      {
        title   => 'all defaults',
        files   => [
          'left/simple01.csv',
          'right/simple01.csv'
        ],
        identity => '[A]',
      },
      {
        title => 'semicolon separator',
        files => [
          'left/simple02.csv',
          'right/simple02.csv',
        ],
        identity => '[A]',
        limit_abs => {
          D => '0.1',
          B => '1',
        },
        ignore => [
          'Z',
        ],
        csv_options => {
          sep_char => ';',
        },
      },
    ];

or as YAML config file

    ---
    - title: __GLOBAL__
      type: CSV
      rootdir : my/data/dir
      reporter: XLSX
      report_filename: %{title}.xlsx,
    - title : all defaults
      files :
        - left/simple01.csv
        - right/simple01.csv
      identity: '[A]'
    - title: semicolon separator
      files:
        - left/simple02.csv
        - right/simple02.csv
      identity: '[A]'
      ignore:
        - Z
      limit_abs:
        D: 0.1
        B: 1
      csv_options:
        sep_char: ';'

To define a suite of compare batches, a central config using the "suite" option can be used.

To save unneccesary typing, configuration values can contain references to either environment variables or other single configuration values. To use environment variables refer to them as ${<VARNAME>}, for configuration references use %{<OPTION>}. This is a simple search and replace mechanism, so don't expect fancy things like referencing non scalar options to work.

OPTION REFERENCE

The following configuration options can be used to define a comparison. Options for Readers are documented in Spreadsheet::Compare::Reader or it's specific modules:

The the reporting options are documented in Spreadsheet::Compare::Reporter or it's specific modules:

left

  possible values: <string>
  default: 'left'

A descriptive name for the left side of the comparison

log_file

The file to write logs to.

log_level

  possible values: TRACE|DEBUG|INFO|WARN|ERROR|FATAL
  default: undef

The log level for the current comparison. To set this globally, use the -d option when using spreadcomp or set the environment variable SPREADSHEET_COMPARE_DEBUG. This option takes precedence over the global options.

reporter

  possible values: <string>
  default: 'None'

Choose a reporter module (e.g. XLSX or HTML)

  possible values: <string>
  default: 'right'

A descriptive name for the right side of the comparison

rootdir

  possible values: <string>
  default: 0

Root directory for configuration, report or input files. Options containing relative file names will be interpreted as relative to this directory.

suite

  possible values: <list of filenames>
  default: undef

Use a list of config files. The configurations will inherit all settings defined in the master config file. For an example please have a look at the tests from the t directory of the distribution.

summary

  possible values: <string>
  default: undef

Reporter engine for the summary. This will mostly be set in a global section, but can be used to pick only selected comparisons for inclusion into the summary. You could even specify a different summary engine for each comparison.

summary_filename

  possible values: <string>
  default: undef

Specify a summary filename. If not specified the filename will be derived from the filename of the configuration file or the name of the calling program.

title

  possible values: <string>
  default: ''

The title of the current comparison. If not set, the title will be the title of the current suite file or 'Untitled', followed by the current comparison count.

EVENTS

Spreadsheet::Compare is a Mojo::EventEmitter and emits the same events as Spreadsheet::Compare::Single (see "EVENTS" in Spreadsheet::Compare::Single).

WARNING The events from Spreadsheet::Compare will have the title of the single comparison as an additional second parameter. So instead of:

    $single->on(final_counters => sub ($obj, $counters) {
        # code
    });

it will be:

    $sc->on(final_counters => sub ($obj, $title, $counters) {
        # code
    });

In addition it emits the following events:

report_finished

    $sc->on(report_finished => sub ($obj, $title, $reporter_class, $report_filename) {
        say "finished writing $report_filename";
    });

Emitted after a single report is finished by the reporter

summary

    require Data::Dumper;
    $sc->on(summary => sub ($obj, $title, $counters) {
        say "next fetch for $title:", Dumper($counters);
    });

Emitted after a single comparison

LIMITING MEMORY USAGE

Per default Spreadsheet::Compare will load all records into memory before comparing them. This maybe not the best approach in cases where the number of records is very large and may not even fit into memory (which will terminate the perl interpreter).

There are two ways to handle these situations:

Sorted Data with option "fetch_size" in Spreadsheet::Compare::Reader

The option "fetch_size" in Spreadsheet::Compare::Single can be used to limit the number of records that are read into memory at a time (for each side, so the read number is twice the number given). For that to work the records have to sorted by the configured "identity" in Spreadsheet::Compare::Reader and "is_sorted" in Spreadsheet::Compare::Single have to be set to a true value. Possibly overlapping identity values at the borders of a fetch will be handled correctly.

Chunking

When sorting is not an option, memory usage can be reduced by sacrificing speed using the "chunk" option with Readers that support this (e.g. CSV and FIX, it is not implemented for the DB reader because sorting by identity can be done in SQL).

Chunking means that the the data will be read twice. First to determine the chunk name for the record and keeping only that and the record location for reference. In the second pass data will be read one chunk at a time. Since the chunk naming can be freely configured (e.g with regular expressions) the possibilities for limiting the chunk size are numerous.

See "chunk" in Spreadsheet::Compare::Reader for examples.

Limiting the number of records for testing configuration options

While testing the configuration of a comparison (for example finding the correct identity, the columns you want to ignore or fine tune the limits), always comparing whole data sets can be tedious. For sorted data this can be achieved with the option "fetch_limit" in Spreadsheet::Compare::Single limiting the number of fetches.

For unsorted data it is best to simply use selected subsets of the data.