The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

ghcn_fetch.pl - Fetch station and weather data from the NOAA GHCN repository

VERSION

version v0.0.011

SYNOPSIS

    ghcn_fetch.pl [-gui] [-savegui <filespec>]

    ghcn_fetch.pl [<report_type>]
            [-country <str>] [-state <str>] [-location <str>] [-gsn]
            [-gps "<lat> <long>" [-radius <n>] ]
            [-range <str>] [-active <str> [-partial]] [-quality <pct>]
            [-fmonth <str>] [-fday <str>]
            [-anomalies] [-baseline <str>] [-precip] [-tavg] [-nogaps]
            [-report <report_type>]
            [-dataonly] [-kmlcolor <str>] ] [-performance] [-verbose] [-outclip]
            [-cachedir <directory>] [-refresh <str>] 
            [-profile <filespec>] 
            

        <report_type> ::= <station> | <weather>
        
        <station> ::= kml | url | curl | ""        
        
        <weather> ::= detail | daily | monthly | weekly
                          

    ghcn_fetch.pl -readme

    ghcn_fetch.pl -help

    ghcn_fetch.pl -usage | -?

DESCRIPTION

Fetch data from the NOAA GHCN database and output as tab-separated lines. Various options are provided to allow filtering of the NOAA stations by country, state, location name, year range, station active year range, etc. When no report type is provided, or -report is an empty string, the output is simply a list of the selected stations.

There are two broad types of reports: station reports and weather data reports. The former provides information about the selected stations. The latter provides actual weather daily weather data for a range of time for the selected stations -- as well as station information (unless option -dataonly is provided).

The report type can be abbreviated so long as it is unambiguous; e.g. da or dai for daily; de for detail.

The report type can be provided as the first argument, or it can be provided via the -report option anywhere within the argument list.

In general it's best to narrow your filter criteria as much as possible otherwise it will take a very long time to load and process the station pages. A good strategy is to omit the -report option so you can see how many stations will be queried before asking for any detailed data. Then you can adjust the number of stations using other filters.

If no options are given, and stdin isn't receiving from a pipe or a file, then -gui is assumed. This launches a dialog to provide a user-friendly way to set options, and to save and reload them (if -optfile is provided).

PARAMETERS

Getoptions::Long is used, so either - or -- may be used. Parameter names may be abbreviated, so long as they remains unambiguous.

Station Report Types

-report "" (or omitted)

This is the default option when no report option is provided, or when the option is an empty string. It generates a list of the stations which match the criteria provided (location, geo coordinates, ranges etc.) No actual weather data is accessed; only station data.

-report curl

For the selected stations, print to stdout the commands necessary to fetch the daily page file from the NOAA repository using curl -K filename. You'll need to redirect stdout to save the output in filename.

This option is convenient in cases where you might want to prefetch station daily weather pages into the cache you've designated with the cachdir option. You'll need to cd into that directory so the files download by curl end up in the cache.

-report kml

For the selected stations, print to stdout the KML specifications that can be imported into Google Maps (or similar software) as pushpins. The -kmlcolor option can be used to designate a different color.

-report id

For the selected stations, print to stdout the id's of the stations. If saved to a file, this list can be used as a input filter to ghcn_fetch.pl using stdin.

-report stn

For the selected stations, print to stdout the station information as a tab-separated table (including header). This form is suitable for importing into a spreadsheet.

-report url

For the selected stations, print to stdout the URL's for the corresponding daily weather pages.

Weather Report Types

-report daily

Scan the NOAA station pages that meet all the selection criteria and aggregate the data from them by year, month and day. Output the results as a tab-separated table suitable for import into Excel for analysis.

TMAX (temperature maximum) is aggregated by maximum; TMIN by minimum; TAVG values are averaged. Note that while most stations track TMAX and TMIN, a lot fewer track TAVG. When TAVG is missing, a proxy is calculated by averaging TMAX and TMIN.

-report monthly

Same as -daily except the output is summarized to the month level. Note that with this option, TAVG is average across days of the month and may of limited usefulness. Avg will be calculated as the average of the max and min for the month, which is what is typically used as the measure for monthly average temperature.

-report yearly

Same as -daily except the output is summarized to the year level. See the explanation of TAVG vs Avg on -monthly.

-report detail

Break the selected aggregation level down by station id and include the station id in the output. This is like -daily, but with a separate set of rows for each station id.

Station Filter

A list of station id's can be provided via stdin, and will be used in lieu of other filtering criteria. Each line of input will be searched for one or more station id's.

Geographic Filters

-country <str>

Filter the station list to include only those from a specific country. The string can be a 2-character GEC (formerly FIPS) country code, a 3-character UN country code, or a 3-character internet country code (including the dot). Longer strings are treated as a pattern and matched (unanchored) against country names.

NOAA uses GEC codes in their database. For a full list of country codes and names see https://www1.ncdc.noaa.gov/pub/data/ghcn/daily/ghcnd-countries.txt and https://www.cia.gov/library/publications/the-world-factbook/appendix/appendix-d.html

-state <str> (or -province)

Filter the station list to include only those within the specified 2-character US state or Canadian province code.

-location <str>

Filter the station list to include only those whose name matches the specified pattern. For a starts-with match, prefix the pattern with ^ (or \A). For an ends-with match, suffix the pattern with $ (or \Z).

You can also specify a station id (e.g. CA006105978) or a comma-delimited list of station id's (e.g. CA006105978,USC00336346).

As a handy shortcut, mappings between user-defined names and a station id or id list can be defined in the locations section of .ghcn_fetch.yaml.

-gsn

Select only GCOS Surface Network stations, which is a baseline network comprising a subset of about 1000 stations chosen mainly to give a fairly uniform spatial coverage from places where there is a good length and quality of data record. See "/www.ncdc.noaa.gov/gosic/global-climate-observing-system-gcos/g cos-surface-network-gsn-program-overview" in https:

-gps <latitude>,<longitude>

Filter the station list to include only those stations that are within -radius kilometers (default 25) of the specified decimal latitude and longitude values; e.g. 45.3822 -75.7167. The two value can be delimited by spaces, or any punctuation character (e.g. comma). If a space is used, the string must be enclosed in quotes.

-radius <int>

Specify the radius, in kilometers, to be used for the -gps option.

Date Filters

-range <str>

Only include data from the specified range of years. The range is given as a string such as 1990-2018. Any punctuation character can be used to separate the two years. A single year may also be given. Alternatively, two discontiguous years can be given by separating the years with a comma (e.g. -range 1919,2019), although this feature cannot be combined with -active and with -anomalies.

Note that if -active is specified, then -range must be a subset of -active since there's no point in asking for data that lies outside the active range of data collection for a station.

-active <str>

Only include data from stations which have been fully active within the specified range. The range is given as a string such as 1990-2018. Any punctuation character can be used to separate the two years. A single year may also be given.

Instead of a year range, you can use an empty string to set the active range to match the range specified by -range.

-partial

The -partial option can be used in conjunction with -active to include stations that were only active during part of the active range.

-quality <int>

Only include stations which have <int>% days of unflagged data within -range. If -anomalies is given, the number of days within the -baseline range is also checked against <int>%. The default value for -quality is 90, meaning that 90% of the days found within -range (and -baseline) must be present and unflagged in order for the station's data to be included in the output.

-fday <str>

Filter the data so that it includes only the days of the month which match the specified range list; e.g. 5-10,20.

-fmonth <str>

Filter the data so that it includes only the months of the year which match the specified range list; e.g. 1-3,7-9 would select Jan-Mar and Jul-Sep.

Analysis Options

-anomalies

Calculate the mean temperature anomalies for each day at each station relative to a baseline year range (see -baseline). Include these in the output.

-baseline <str>

Use the date range <str> to compute anomalies. Default 1971-2000.

-precip

Include precipitation measures in the output, specifically SNOW, SNWD (snow depth), ans PRCP (all precipitation). Values are in cm. Like TMAX, SNWD is the maximum depth recorded across stations and across time. The others are averaged across stations and then summed across time. In other words, if -year is used you get the maximum snow depth for the year, and the total accumulation of snow and precipitfor the year.

-tavg

Include TAVG (average daily temperature) in the output. TAVG will be averaged across stations and also across months or years if -monthly or -yearly is given.

-nogaps

For report 'detail', generate rows for those months and days where data is missing. This enables charting with a complete time x-axis. Without it, large gaps result in horizontal compression of the chart and a distorted picture across time.

Misc Options

-cachedir <filespec>

This section defines the location of the cache directory where pages fetched from the NOAA GHCN repository will be saved, in accordance with your -refresh option. Using a cache vastly improves the performance of subsequent invocations of ghcn_fetch, especially when using the same station filtering criteria.

-dataonly

Print only the data table. Other information, including notes, lists of stations kept and rejected, and statistics are suppressed.

-kmlcolor <color>

Color of the KLM placemark pushpins. Acceptable values are red, green, blue, azure, purple, yellow and white. May be abbreviated down to one letter. Default is red.

-performance

Include performance statistics in the output. This includes some extra timing information (labelled "(internal)" in the Time Statistics list because they are internal to the other timing metrics) as well as statistics for the memory consumption of the Data hash table. Also some memory statistics are added to some Timing subjects.

-profile <filespec>

Location of the optional user profile YAML file, which can be used to define location aliases and set commonly used options such as -cachefile. Defaults to ~/.ghcn_fetch.yaml.

-refresh <str>

This option determines whether and when cached files are refreshed from the network source. Default is yearly. Possible values are:

yearly

The origin HTTP server is contacted and the page refreshed if the cached file has not been changed within the current year. The rationale for this, and for this being the default, is that the GHCN data for the current year will always be incomplete, and that will skew any statistical analysis and so should normally be truncated. If the user needs the data for the current year, they should use a refresh value of 'always' or a number.

always

If a page is in the cache, the origin HTTP server is always checked for a fresher copy

never

The origin HTTP is never contacted, regardless of the page being in cache or not. If the page is missing from cache, the fetch method will return undef. If the page is in cache, that page will be returned, no matter how old it is.

<number>

The origin HTTP server is not contacted if the page is in cache and the cached page was inserted within the last <number> days. Otherwise the server is checked for a fresher page.

-verbose

When given, warning messages about missing data are displayed to stderr.

Command-Line Only Options

Options documented in this section can be used on the command line, but cannot be specified within a profile file.

-gui

Launch a graphic user interface that can be used to set options. Not available unless modules Tk and Tk::Getopt are installed.

-savegui <filespec>

Designate a file to be used to save load options, or from which to load options that were previously saved from the GUI.

-outclip

Send output to the Windows clipboard. (Windows only)

-readme

Launch the default web browser and display the NOAA Daily Readme.txt file, providing a description of the Daily data files and station data.

-h | -help

Display this documentation.

-usage | -?

Display the Synopsis section of this documentation.

PROFILE FILE

At startup, ghcn_fetch will look for the file .ghcn_fetch.yaml in the user home directory (~ on Unix, %UserProfile% on Windows) in order to capture some additional options. The file content should contain something like this:

    ---
    cachedir: C:/ghcn_cache

    aliases:
        yow: CA006106000,CA006106001    # Ottawa airport
        cda: CA006105976,CA006105978    # Ottawa CDA and CDA RCS
        center: USC00326365             # geographic center of North America

Any option (except those listed in section Command-Line Only Options) can be included and will be preloaded as a default. Command-line options will override them. Anything left undefined will be filled in by built-in defaults.

One extra option not available via the command line but which can be specified in the profile file is aliases. This optional section provides a list of shortcut names that are mapped to station id's or id-lists and which can be used in the -location option. If a -location value matches a key defined in this section, the station id or id-list is substituted. Note that keys must be lowercase letter only, and may have a leading underscore.

RELATED SCRIPTS

Additional scripts are provided for data analysis. These scripts are designed to take the output ghcn_fetch.

For Windows users, a -outclip option directs the tab-separated output to the Windows clipboard, so it can be pasted into Excel for analysis using PivotTable and PivotChart. Alternatively you can use the usual '>' method to direct the output to a file.

ghcn_extremes.pl

Report patterns of temperature extremes (heatwaves or coldwaves) by analyzing daily temperature records and looking for consecutive days of extreme temperatures; e.g.

    ghcn_fetch -country CA -report daily | ghcn_extremes > extremes.tsv
ghcn_station_counts.pl

Report the station counts per year for a list of stations generated by this script using -report stations (which is the default -report option); e.g.

    ghcn_fetch -country CA | ghcn_station_counts > stn_counts.tsv

EXAMPLES

Here are some examples of the kinds of reports that can be generated:

List the weather stations in NY state with "New York" in the name:

    ghcn_fetch -country US -state NY -location "New York"

List the New York weather stations active between 1900 and 1920:

    ghcn_fetch -cou US -st NY -location "New York" -active 1900-1920

Report the yearly max, min and average temperatures at JFK airport:

    ghcn_fetch yearly -cou US -st NY -location "New York JFK"

Report the monthly max, min and average temperatures at JFK airport:

    ghcn_fetch monthly -cou US -st NY -location "New York JFK"

Report the daily max, min and average temperatures at JFK airport:

    ghcn_fetch daily -cou US -st NY -location "New York JFK"

Launch the GUI for an options dialog:

    ghcn_fetch -gui  (requires Tk and Tk::Getopt to be installed)

Get documentation on all the options:

    ghcn_fetch -help

Find the 5-day heatwaves at the JFK airport station:

    ghcn_fetch detail -cou US -st NY -loc "New York JFK" | ghcn_extremes

Find the 3-day coldwaves (<= 15C) at the JFK airport station:

    ghcn_fetch detail -cou US -st NY -loc "New York JFK" | 
      ghcn_extremes -cold -ndays 3 -limit -15

For each year between 1900 and 1950, count the number of active weather stations in NY state:

    ghcn_fetch detail -cou US -st NY -active 1900-1950 | ghcn_station_counts

AUTHOR

Gary Puckering (jgpuckering@rogers.com)

LICENSE AND COPYRIGHT

Copyright 2022, Gary Puckering