The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

ghcn_fetch.pl - Fetch station and weather data from the NOAA GHCN repository

VERSION

version v0.0.004

SYNOPSIS

    ghcn_fetch.pl [-gui] [-savegui <filespec>]

    ghcn_fetch.pl [<report_type>]
            [-country <str>] [-state <str>] [-location <str>] [-gsn]
            [-gps "<lat> <long>" [-radius <n>] ]
            [-range <str>] [-active <str> [-partial]] [-quality <pct>]
            [-fmonth <str>] [-fday <str>]
            [-anomalies] [-baseline <str>] [-precip] [-tavg] [-nogaps]
            [-kml <filespec> [-color <str>] ]
            [-report <report_type>]
            [-dataonly] [-performance] [-verbose] [-outclip]
            [-cachedir <directory>] [-refresh <str>] 
            [-profile <filespec>] 
            

        <report_type> ::= detail | daily | monthly | weekly | ""

    ghcn_fetch.pl -readme

    ghcn_fetch.pl -help

    ghcn_fetch.pl -usage | -?

DESCRIPTION

Fetch data from the NOAA GHCN database and output as tab-separated lines. Various options are provided to allow filtering of the NOAA stations by country, state, location name, year range, station active year range, etc. When no report type is provided, or -report is an empty string, the output is simply a list of the selected stations.

If report type 'daily', 'monthly' or 'yearly' is given, then the pages for the selected stations are scanned and the data from them aggregated and output as one row per designated period. This is followed by the station list.

If report type 'detail' is given, then the daily data for each selected station id is reported, followed by the station list.

The report type can be abbreviated so long as it is unambiguous; e.g. da or dai for daily; de for detail.

The report type can be provided as the first argument, or it can be provided via the -report option anywhere within the argument list.

In general it's best to narrow your filter criteria as much as possible otherwise it will take a very long time to load and process the station pages. A good strategy is to omit the -report option so you can see how many stations will be queried before asking for any detailed data. Then you can adjust the number of stations using other filters.

If no options are given, and stdin isn't receiving from a pipe or a file, then -gui is assumed. This launches a dialog to provide a user-friendly way to set options, and to save and reload them (if -optfile is provided).

PARAMETERS

Getoptions::Long is used, so either - or -- may be used. Parameter names may be abbreviated, so long as they remains unambiguous.

Report Types

Data obtained from the GHCN database can be reported at various levels of aggregation using the report option. The string value for report specifies the type and level of aggregation. Abbrevations are permitted. This value can be given as the very first argument on the command line, or anywhere on the command line if preceded by -report.

-report "" (or omitted)

This is the default option when no report option is provided, or when the option is an empty string. It generates a list of the stations which match the criteria provided (location, geo coordinates, ranges etc.) No actual weather data is accessed; only station data.

-report daily

Scan the NOAA station pages that meet all the selection criteria and aggregate the data from them by year, month and day. Output the results as a tab-separated table suitable for import into Excel for analysis.

TMAX (temperature maximum) is aggregated by maximum; TMIN by minimum; TAVG values are averaged. Note that while most stations track TMAX and TMIN, a lot fewer track TAVG. When TAVG is missing, a proxy is calculated by averaging TMAX and TMIN.

-report monthly

Same as -daily except the output is summarized to the month level. Note that with this option, TAVG is average across days of the month and may of limited usefulness. Avg will be calculated as the average of the max and min for the month, which is what is typically used as the measure for monthly average temperature.

-report yearly

Same as -daily except the output is summarized to the year level. See the explanation of TAVG vs Avg on -monthly.

-report detail

Break the selected aggregation level down by station id and include the station id in the output. This is like -daily, but with a separate set of rows for each station id.

Station Filter

A list of station id's can be provided via stdin, and will be used in lieu of other filtering criteria. Each line of input will be searched for one or more station id's.

Geographic Filters

-country <str>

Filter the station list to include only those from a specific country. The string can be a 2-character GEC (formerly FIPS) country code, a 3-character UN country code, or a 3-character internet country code (including the dot). Longer strings are treated as a pattern and matched (unanchored) against country names.

NOAA uses GEC codes in their database. For a full list of country codes and names see https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-countries.txt and https://www.cia.gov/library/publications/the-world-factbook/appendix/appendix-d.html

-state <str> (or -province)

Filter the station list to include only those within the specified 2-character US state or Canadian province code.

-location <str>

Filter the station list to include only those whose name matches the specified pattern. For a starts-with match, prefix the pattern with ^ (or \A). For an ends-with match, suffix the pattern with $ (or \Z).

You can also specify a station id (e.g. CA006105978) or a comma-delimited list of station id's (e.g. CA006105978,USC00336346).

As a handy shortcut, mappings between user-defined names and a station id or id list can be defined in the locations section of .ghcn_fetch.yaml.

-gsn

Select only GCOS Surface Network stations, which is a baseline network comprising a subset of about 1000 stations chosen mainly to give a fairly uniform spatial coverage from places where there is a good length and quality of data record. See "/www.ncdc.noaa.gov/gosic/global-climate-observing-system-gcos/g cos-surface-network-gsn-program-overview" in https:

-gps <latitude>,<longitude>

Filter the station list to include only those stations that are within -radius kilometers (default 25) of the specified decimal latitude and longitude values; e.g. 45.3822 -75.7167. The two value can be delimited by spaces, or any punctuation character (e.g. comma). If a space is used, the string must be enclosed in quotes.

-radius <int>

Specify the radius, in kilometers, to be used for the -gps option.

Date Filters
-range <str>

Only include data from the specified range of years. The range is given as a string such as 1990-2018. Any punctuation character can be used to separate the two years. A single year may also be given. Alternatively, two discontiguous years can be given by separating the years with a comma (e.g. -range 1919,2019), although this feature cannot be combined with -active and with -anomalies.

Note that if -active is specified, then -range must be a subset of -active since there's no point in asking for data that lies outside the active range of data collection for a station.

-active <str>

Only include data from stations which have been fully active within the specified range. The range is given as a string such as 1990-2018. Any punctuation character can be used to separate the two years. A single year may also be given.

Instead of a year range, you can use an empty string to set the active range to match the range specified by -range.

-partial

The -partial option can be used in conjunction with -active to include stations that were only active during part of the active range.

-quality <int>

Only include stations which have <int>% days of unflagged data within -range. If -anomalies is given, the number of days within the -baseline range is also checked against <int>%. The default value for -quality is 90, meaning that 90% of the days found within -range (and -baseline) must be present and unflagged in order for the station's data to be included in the output.

-fday <str>

Filter the data so that it includes only the days of the month which match the specified range list; e.g. 5-10,20.

-fmonth <str>

Filter the data so that it includes only the months of the year which match the specified range list; e.g. 1-3,7-9 would select Jan-Mar and Jul-Sep.

Analysis Options

-anomalies

Calculate the mean temperature anomalies for each day at each station relative to a baseline year range (see -baseline). Include these in the output.

-baseline <str>

Use the date range <str> to compute anomalies. Default 1971-2000.

-precip

Include precipitation measures in the output, specifically SNOW, SNWD (snow depth), ans PRCP (all precipitation). Values are in cm. Like TMAX, SNWD is the maximum depth recorded across stations and across time. The others are averaged across stations and then summed across time. In other words, if -year is used you get the maximum snow depth for the year, and the total accumulation of snow and precipitfor the year.

-tavg

Include TAVG (average daily temperature) in the output. TAVG will be averaged across stations and also across months or years if -monthly or -yearly is given.

-nogaps

For report 'detail', generate rows for those months and days where data is missing. This enables charting with a complete time x-axis. Without it, large gaps result in horizontal compression of the chart and a distorted picture across time.

Kml Options

-kml <filespec>

Output the coordinates of the selected stations as a KML file, for import into Google Earth as placemarks. The active range of each station will be included as timespans so that you can view the placemarks across time.

-color <color> (or -colour)

Color of the KLM placemark pushpins. Acceptable values are red, green, blue, azure, purple, yellow and white. May be abbreviated down to one letter. Default is red.

Misc Options

-cachedir <filespec>

This section defines the location of the cache directory where pages fetched from the NOAA GHCN repository will be saved, in accordance with your -refresh option. Using a cache vastly improves the performance of subsequent invocations of ghcn_fetch, especially when using the same station filtering criteria.

-dataonly

Print only the data table. Other information, including notes, lists of stations kept and rejected, and statistics are suppressed.

-performance

Include performance statistics in the output. This includes some extra timing information (labelled "(internal)" in the Time Statistics list because they are internal to the other timing metrics) as well as statistics for the memory consumption of the Data hash table. Also some memory statistics are added to some Timing subjects.

-refresh <str>

This option determines whether and when cached files are refreshed from the network source. Default is yearly. Possible values are:

yearly

The origin HTTP server is contacted and the page refreshed if the cached file has not been changed within the current year. The rationale for this, and for this being the default, is that the GHCN data for the current year will always be incomplete, and that will skew any statistical analysis and so should normally be truncated. If the user needs the data for the current year, they should use a refresh value of 'always' or a number.

always

If a page is in the cache, the origin HTTP server is always checked for a fresher copy

never

The origin HTTP is never contacted, regardless of the page being in cache or not. If the page is missing from cache, the fetch method will return undef. If the page is in cache, that page will be returned, no matter how old it is.

<number>

The origin HTTP server is not contacted if the page is in cache and the cached page was inserted within the last <number> days. Otherwise the server is checked for a fresher page.

-verbose

When given, warning messages about missing data are displayed to stderr.

Command-Line Only Options

-gui

Launch a graphic user interface that can be used to set options. Not available unless modules Tk and Tk::Getopt are installed.

-savegui <filespec>

Designate a file to be used to save load options, or from which to load options that were previously saved from the GUI.

-outclip

Send output to the Windows clipboard. (Windows only)

-readme

Launch the default web browser and display the NOAA Daily Readme.txt file, providing a description of the Daily data files and station data.

-h | -help

Display this documentation.

-usage | -?

Display the Synopsis section of this documentation.

PROFILE FILE

At startup, ghcn_fetch will look for the file .ghcn_fetch.yaml in the user home directory (~ on Unix, %UserProfile% on Windows) in order to capture some additional options. The file content should contain something like this:

    ---
    cachedir: C:/ghcn_cache

    aliases:
        yow: CA006106000,CA006106001    # Ottawa airport
        cda: CA006105976,CA006105978    # Ottawa CDA and CDA RCS
        center: USC00326365             # geographic center of North America

Any option (except those listed in section Command-Line Only Options) can be included and will be preloaded as a default. Command-line options will override them. Anything left undefined will be filled in by built-in defaults.

One extra option not available via the command line but which can be specified in the profile file is aliases. This optional section provides a list of shortcut names that are mapped to station id's or id-lists and which can be used in the -location option. If a -location value matches a key defined in this section, the station id or id-list is substituted. Note that keys must be lowercase letter only, and may have a leading underscore.

RELATED SCRIPTS

Additional scripts are provided for data analysis. These scripts are designed to take the output ghcn_fetch.

For Windows users, a -outclip option directs the tab-separated output to the Windows clipboard, so it can be pasted into Excel for analysis using PivotTable and PivotChart. Alternatively you can use the usual '>' method to direct the output to a file.

ghcn_extremes.pl

Report patterns of temperature extremes (heatwaves or coldwaves) by analyzing daily temperature records and looking for consecutive days of extreme temperatures; e.g.

    ghcn_fetch -country CA -report daily | ghcn_extremes > extremes.tsv
ghcn_station_counts.pl

Report the station counts per year for a list of stations generated by this script using -report stations (which is the default -report option); e.g.

    ghcn_fetch -country CA | ghcn_station_counts > stn_counts.tsv

AUTHOR

Gary Puckering (jgpuckering@rogers.com)

LICENSE AND COPYRIGHT

Copyright 2022, Gary Puckering