The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Weather::GHCN::StationTable - collect station objects and weather data

VERSION

version v0.0.011

SYNOPSIS

  use Weather::GHCN::StationTable;

  my $ghcn = Weather::GHCN::StationTable->new;

  my ($opt, @errors) = $ghcn->set_options(
    country     => 'US',
    state       => 'NY',
    location    => 'New York',
    report      => 'yearly',
  );
  die @errors if @errors;

  $ghcn->load_stations;

  # generate a list of the stations that were selected
  say $ghcn->get_stations( kept => 1 );

  if ($opt->report) {
      say $ghcn->get_header;

      $ghcn->load_data();
      $ghcn->summarize_data;

      say $ghcn->get_summary_data;
      say $ghcn->get_footer;
  }

DESCRIPTION

The Weather::GHCN::StationTable module provides a class that is used to fetch stations information from the NOAA Global Historical Climatology Network database, along with temperature and/or precipitation records from the daily historical records.

For a more comprehensive example than the above Synopsis, see the section EXAMPLE PROGRAM.

Caveat emptor: incompatible interface changes may occur on releases prior to v1.00.000. (See VERSIONING and COMPATIBILITY.)

The module is primarily for use by modules Weather::GHCN::Fetch.

FIELDS (read-only)

opt_obj

Returns a reference to the Options object created by set_options.

opt_href

Returns a reference to a hash of the Options created by set_options.

profile_file

Returns the name of the profile file, if one was passed to set_options.

profile_href

Returns a reference to a hash containing the profile options set by set_options (if any).

stn_count

Returns a count of the total number of stations found in the station list.

stn_selected_count

Returns a count of the number of stations that were selected for processng.

stn_filtered_count

Returns a count of the number of stations that were selected for processing, excluding those rejected due to errors or other criteria.

missing_href

Returns a hash of the missing months and days for the selected data.

FIELDS (read or write)

return_list(<bool>)

For API use. By default, get methods return a tab-separated string of results. If return_list is set to a perl true value, then these methods will return a list (or list of lists). If no argument is given, the current value of return_list is returned.

stnid_filter_href(\%stnid_filter)

For API use. With no argument, the current value is returned. If an argument is given, it must be a hash reference the keys of which are the specific station id's you want to fetch and process. When this is used, many filtering options set via set_options will be overridden; e.g. country, state, location etc.

METHODS

new ()

Create a new StationTable object.

flag_counts ()

The load_stations() and load_data() methods may reject a station or a particular data entry due to quality or other issues. These decisions are kept in a hash field, and a reference to that hash is returned by this method. The caller can then report the values.

get_flag_statistics ( list => 0, no_header => 0 )

Gets a header row and summary table of data points that were kept and rejected, along with counts of QFLAGS (quality flags). Returns tab-separated text, or a list if the list argument is true. A heading line is provided unless no_header is true.

argument: list => <bool>

If the arguments include the 'list' keyword and a true value, then a list is returned rather than tab-separated lines of text. Defaults to false.

argument: no_header => <bool>

If the arguments include the 'no_header' keyword and a true value, then the return value will not include a header line. Default is false.

get_footer( list => 0 )

Get a footing section with explanatory notes about the output data produced by detail and summary reports.

argument: list => <bool>

If the arguments include the 'list' keyword and a true value, then a list is returned rather than tab-separated lines of text. Defaults to false.

get_hash_stats ( list => 0, no_header => 0 )

Gets the hash sizes collected during the execution of StationTable methods, notably load_stations and load_data, as tab-separated lines of text.

argument: list => <bool>

If the arguments include the 'list' keyword and a true value, then a list is returned rather than tab-separated lines of text. Defaults to false.

argument: no_header => <bool>

If the arguments include the 'no_header' keyword and a true value, then the return value will not include a header line. Default is false.

get_header ( list => 0 )

The weather data obtained by the laod_data() method is essentially a table. Which columns are returned depends on various options. For example, if report => monthly is given, then the key columns will be year and month -- no day. If the precip option is given, then extra columns are included for precipitation values.

This variabiliy makes it difficult for a consumer of these modules to emit a heading that matches the underlying columns. The purpose of this method is to return a set of column headings that will match the data. The value returned is a tab-separated string.

argument: list => <bool>

If the arguments include the 'list' keyword and a true value, then a list is returned rather than tab-separated lines of text. Defaults to false.

get_missing_data_ranges( list => 0, no_header => 0 )

Gets a list, by station id and year, of any months or day ranges when data was found to be missing. Missing data can lead to incorrect interpretation and can cause a station to be rejected if the percent of found data does not meet the -quality threshold (normally 90%).

Returns a heading line followed by lines of tab-separated strings.

argument: list => <bool>

If the arguments include the 'list' keyword and a true value, then a list of lists (stations containing years) is returned rather than tab-separated lines of text. Defaults to false.

argument: no_header => <bool>

If the arguments include the 'no_header' keyword and a true value, then the return value will not include a header line. Default is false.

option: report <daily|monthly|yearly|id>

Determines the number and content of heading values.

datarow_as_hash ( $row_aref )

This is a convenience method that may be used to convert table rows returned by the row_sub callback subroutine of load_data from a perl list into a hash. It automatically calls get_header to get the headers for the table data. When you pass it a reference to a data row (obtained vis the row_sub callback routine given to load_data) it combines the elements of the data row list with the column headings and returns a hash.

get_missing_rows( list => 0 )

In support of a -nogaps option, to generate detail output that does not have any gaps due to missing data, this method gets a list of rows for the months and days that had missing data for a given station id in a given year.

Returns lines of tab-separated strings.

argument: list => <bool>

If the arguments include the 'list' keyword and a true value, then a list is returned rather than tab-separated lines of text. Defaults to false.

option: nogaps

Emits extra rows after the detail data rows to make up for missing months or days. This is primarily so that if the data is charted by date, then the x-axis will have all the dates from start to finish. Otherwise, the chart and any trends that are projected on it will be distorted by the missing data.

get_options ( list => 0, no_header => 0 )

Get text which shows the options that were in effect for this processing run, in a Getopt style. Includes a heading and a footing with explanatory notes. If argument 'list' is true, returns the lines as a list. Line [1] contains the options string.

argument: list => <bool>

If the arguments include the 'list' keyword and a true value, then a list is returned rather than tab-separated lines of text. Defaults to false.

argument: no_header => <bool>

If the arguments include the 'no_header' keyword and a true value, then the return value will not include a header line or the explanatory footing notes. Default is false.

get_stations ( list => 0, kept => 1, no_header => 0 )

Return lines of text with tab-separated columns describing each of the stations for stations that were found to meet the filtering criteria specified in the user options.

argument: kept => <bool>

If the argument kept => 0 is specified, and load_data has already been invoked, then the stations which were rejected due to quality flags or missing data will be returned. If kept => 1 is specified, then the stations that were kept will be returned.

argument: list => <bool>

If the arguments include the 'list' keyword and a true value, then a list is returned rather than tab-separated lines of text. Defaults to false.

argument: no_header => <bool>

If the arguments include the 'no_header' keyword and a true value, then the return value will not include a header line. Default is false.

get_station_note_list ()

Return a list consisting of tab-separated code/description pairs that rejected stations were flagged with; i.e. the reasons for their rejection.

get_summary_data ( list => 0 )

Gets a list of summarized the temperature or precipitation data by day, month or year depending on the report option.

Returns undef if the report option is 'detail'.

The actual columns that are returned is dictated by the report option and by the tavg and precip options provided when the object was instantiated by new().

argument: list => <bool>

If the arguments include the 'list' keyword and a true value, then a list is returned rather than tab-separated lines of text. Defaults to false.

option: report <daily|monthly|yearly>

Determines the level of summarization.

option: range <rangelist>

If the range option is provided, the output rows are restricted to those years that are within the specified range(s).

get_timing_stats ( list => 0 )

Get a list of the timers, with durations and notes, in alphabetical order by timer label.

argument: list => <bool>

If the arguments include the 'list' keyword and a true value, then a list is returned rather than tab-separated lines of text. Defaults to false.

has_missing_data ()

Returns true if any missing data was detected amongst the stations that were processed. The calling script can use this to decide whether to issue a warning to the user. A list of missing data specifics can be sent to the output by calling method get_missing_data_ranges.

load_data ( progress_sub => undef, row_sub => sub { say @_ } )

Load the daily weather data for each of the stations that are were loaded into the collection. Print the data if option report detail is given. Otherwise cache the data so it can be aggregated at a later step.

argument: progress_sub => undef

As fetching and parsing each daily data page can take some time, an optional callback hook is provided so the caller can emit a progress message before each station's data is loaded; e.g. progress => sub{ say {STDERR} @_ }.

argument: row_sub => sub { say @_ }

Optional callback hook to allow the caller to provide their own subroutine for printing (or collecting in a list, or both) the row-level station data that is fetched when the report option is 'detail'. Defaults to printing via the 'say' operator.

option: report <id|daily|monthly|yearly>

When report detail is specified, the weather data for each station is printed immediately (via the row_sub callback hook).

For all other report options, the data is fetched from each station and kept in a cache so that it can be aggregated by invoking summarize_data(). The row_sub hook is not invoked.

load_stations ( content => undef )

Read the GHCN stations list and the stations inventory list and create a hash of Station objects, keyed on station id, filtered according to the options provided in set_options().

Returns a hash of Weather::GHCN::Station objects, keyed on station id.

argument: content => undef

Normally, load_stations will fetch the stations list from the cache, or if not in the cache (or its stale) then from the GHCN web repository. However, an API caller might want to obtain the station text by some other means, in which case it can use the optional 'content' keyword argument to just pass in a scalar containing the station text.

option: country <str>

Selects only those stations that match the 2-digit GEC (formerly FIPS) country code or that uniquely match the name or partial name given in <str>.

option: state <code>

Selects only those stations that match a US state or Canadian provinc code.

option: location <str>

Selects only those stations with a name that matches the specified pattern, which can be either a station id, or a comma-separated list of station id's, or a regex. If a regex, then it is anchored on the left and whitespace is NOT ignored.

option: gps <latitude,longitude>

This option selects stations within a certain radius of the designated latitude and longitude, expressed as positive and negative numbers (not using N, S, W, E designators).

option: radius <int>

In conjunction the gps options, determines the radius in kilometers for the search area. Defaults to 25 km.

option: gsn

Select only GCOS Surface Network stations, which is a baseline network comprising a subset of about 1000 stations chosen mainly to give a fairly uniform spatial coverage from places where there is a good length and quality of data record. See "/www.ncdc.noaa.gov/gosic/global-climate-observing-system-gcos/g cos-surface-network-gsn-program-overview" in https:

report_kml( list => 0 )

Output the coordinates of the station collection in KML format, for import into Google Earth as placemarks. The active range of each station will be included as timespans so that you can view the placemarks across time.

argument: list

If the argument list contains the 'list' keyword and a true value, then a perl list is returned. Otherwise, a string consisting of lines of text is returned.

option: kml

Print KML on stdout.

option: kmlcolor <str>

A color name, one of blue, green, azure, purple, red, white or yellow. Only the first character is recognized, so 'b' and 'bob' both result in blue. All colors are given an opacity of 50 (the range is 00 to ff).

report_urls( list => 0, curl => 0 )

Output the URL of the .dly (daily weather data) file for each of the stations that meet the selection criteria.

argument: list

If the argument list contains the 'list' keyword and a true value, then a perl list is returned. Otherwise, a string consisting of lines of text is returned.

argument: curl

If the argument list contains the 'curl' keyword and a true value, then the output will be a set of lines that can be saved in a file for subsequent input to the curl program using the -K option. This facilitates bulk fetching of .dly files into the cache.

($opt, @errors) = set_options ( %args )

Set various options for this StationTable instance. These options will affect the processing and output by subsequent method calls.

Returns an Option object and a list of errors. It is advised you check @errors after calling set_options to report the errors and to cease processing if there are any; e.g. die @errors if @errors.

You may want to set up a file-scoped lexical variable to hold the options object. That way it is accessible throughout your code. The typical calling pattern would look like this:

    my $Opt;  # a file-scope lexical

    sub run (@ARGV) {
        my $ghcn = Weather::GHCN::StationTable->new;

        my @errors;
        ($Opt, @errors) = set_options(...);
        die @errors if @errors;
        ...
}
timing_stats => $TimingStats_obj

This optional argument should point to a TimingStats object that was created by the caller and will be used to collect timing statistics.

hash_stats => \%hash_stats

This optional argument should be a reference to a hash that was created by the caller and will be used to collect performance and memory statistics.

summarize_data ()

Aggregate the daily weather data for the stations that were loaded, according to the report option.

option: report => 'daily|monthly|yearly'

When the report option is 'detail', no summarization is needed and the method immediately returns undef.

tstats ()

Provides access to the TimingStats object so the caller can start and stop script-level timers.

DOES

Defined by Object::Pad. Included for POD::Coverage.

META

Defined by Object::Pad. Included for POD::Coverage.

EXAMPLE PROGRAM

  use Weather::GHCN::StationTable;

  my $ghcn = Weather::GHCN::StationTable->new;

  my ($opt, @errors) = $ghcn->set_options(
    cachedir    => 'c:/ghcn_cache',
    refresh     => 'always',
    country     => 'US',
    state       => 'NY',
    location    => 'New York',
    active      => '2000-2022',
    report      => 'yearly',
  );

  die @errors if @errors;

  $ghcn->load_stations;

  my @rows;
  if ($opt->report) {
      say $ghcn->get_header;

      # this also prints detailed station data if $opt->report eq 'detail'
      $ghcn->load_data(
        # set a callback routine for printing progress messages
        progress_sub => sub { say {*STDERR} @_ },
        # set a callback routine for capturing data rows when report => 'detail'
        row_sub      => sub { push @rows, $_[0] },
      );

      # these only do something when $opt->report ne 'detail'
      $ghcn->summarize_data;
      say $ghcn->get_summary_data;

      say '';
      say $ghcn->get_footer;

      say '';
      say $ghcn->get_flag_statistics;
  }

  # print data rows collected by row_sub callback (when report => 'detail')
  foreach my $row_aref (@rows) {
      say join "\t", $row_aref->@*;
  }

  say '';
  say $ghcn->get_stations( kept => 1 );

  say '';
  say 'Stations that failed to meet range or quality criteria:';
  say $ghcn->get_stations( kept => 0, no_header => 1 );

  if ( $ghcn->has_missing_data ) {
      warn '*W* some data was missing for the stations and date range processed' . $NL;
      say '';
      say $ghcn->get_missing_data_ranges;
  }

  say $ghcn->get_options;

  say $ghcn->get_timing_stats;

  say $ghcn->get_hash_stats;

  $ghcn->export_kml if $opt->kml;

OPTIONS

StationTable supports almost all the options documented in ghcn_fetch. The only options not supported are ones that are listed in the Command-Line Only Options section of the POD, namely: -help, -usage, -readme, -gui, -optfile, and -outclip.

Options can be passed directly to the API via set_options().

Options can also be defined in a file (called a profile) that will be loaded at runtime and merged with the options passed to set_options.

Options passed to set_options() must be defined as a perl hash structure. See ghcn_fetch -help for a list of all options in Getopts::Long format. Simply translate the option to a hash key/value pair. For example, -report detail becomes report = 'detail'>.

Options defined in a profile file must be defined using YAML (see below).

Aliases

Aliases are a convenience feature that allow you to define mnemonic shortcuts for specific stations. GHCN station id's (like CA006106000) are difficult to remember and type, as can GHCN station names. Frequently-used station id's can be given easier alias names that can be used in the -location option for precise and reliable data retrieval.

The entries within the aliases hash are simply keyword/value pairs that represent the mnemonic alias name and the station id (or id's) that are to be retrieved when that alias is used in -location.

Aliases can only be defined via the set_options API or in a profile file. There is no command-line option for defining them.

YAML Example

This is what the YAML content for a typical profile file would look like:

    ---
    cachedir: C:/ghcn_cache_new

    aliases:
        yow: CA006106000,CA006106001    # Ottawa airport
        cda: CA006105976,CA006105978    # Ottawa (CDA and CDA RCS)

Hash Example

Here's what the options would look like as a hash passed to the ste_options API call:

    %options = (
        cachedir => 'C:/ghcn_cache',
        aliases => {
            yow => 'CA006106000,CA006106001',    # Ottawa airport
            cda => 'CA006105976,CA006105978',    # Ottawa (CDA and CDA RCS)
        }
    );

    my $ghcn->Weather::GHCN::StationTable->new();
    $ghcn->set_options(%options);

VERSIONING and COMPATIBILITY

The version number scheme used for this module consists of a 3-part dot-delimited string such as v0.0.003. This format was chosen for compatibility with Dist::Zilla version support, so that all modules in GHCN will get the same version number upon release. See also https://metacpan.org/pod/version.

The first digit of the string is a major release numbers, and the second is the minor release number. With the exception of v0.0 releases, which should be considered experimental pre-production versions, the interface is intended to be upward compatible within a set of releases sharing the same major release number. If an incompatible change becomes necessary, the major release number will be incremented.

An increment to the minor release number indicates significant new functionality, which usually mean new API's and options. But, it should be upward compatible with the prior release.

AUTHOR

Gary Puckering (jgpuckering@rogers.com)

LICENSE AND COPYRIGHT

Copyright 2022, Gary Puckering