NAME
WWW::Scraper::Wikipedia::ISO3166 - Gently scrape Wikipedia for ISO3166-2 data
Synopsis
Wikipedia has been scraped. You do not need to run the scripts which download pages from there.
Just use the SQLite database shipped with this module, as discussed next.
Methods which return hashrefs
use WWW::Scraper::Wikipedia::ISO3166::Database;
my($database) = WWW::Scraper::Wikipedia::ISO3166::Database -> new;
my($countries) = $database -> read_countries_table;
my($subcountries) = $database -> read_subcountries_table;
...
Each key in %$countries and %$subcountries points to a hashref of all columns for the given key.
So, $$countries{13} points to this hashref:
{
id => 13,
code2 => 'AU',
code3 => '',
fc_name => 'australia',
hash_subcountries => 'Yes',
name => 'Australia',
timestamp => '2012-05-08 04:04:43',
}
One element of %$subcountries is $$subcountries{4276}:
{
id => 4276,
country_id => 13,
code => 'AU-VIC',
fc_name => 'victoria',
name => 'Victoria',
sequence => 5,
timestamp => '2012-05-08 04:05:27',
}
Warnings
# 1: These hashrefs use the table's primary key as the hashref's key. In the case of the countries table, the primary key is the country's id, and is used as subcountries.country_id. But, in the case of the subcountries table, the id does not have any meaning apart from being a db primary key. See "What is the database schema?" for details.
# 2: Do not assume subcountry names are unique within a country.
See 'Taichung' etc in Taiwan for example.
Scripts which output to a file
All scripts respond to the -h option.
Some examples:
shell>perl scripts/export.as.csv.pl -c countries.csv -s subcountries.csv
shell>perl scripts/export.as.html.pl -w iso.3166-2.html
This file is on-line at: http://savage.net.au/Perl-modules/html/WWW/Scraper/Wikipedia/ISO3166/iso.3166-2.html.
shell>perl scripts/report.statistics.pl
Output statistics:
countries_in_db => 249.
has_subcounties => 199.
subcountries_in_db => 4593.
subcountry_files_downloaded => 249.
Description
WWW::Scraper::Wikipedia::ISO3166
is a pure Perl module.
It is used to download various ISO3166-related pages from Wikipedia, and to then import data (scraped from those pages) into an SQLite database.
The pages have already been downloaded, so that phase only needs to be run when pages are updated.
Likewise, the data has been imported.
This means you would normally only ever use the database in read-only mode.
Its components are:
- o scripts/get.country.page.pl
-
1: Downloads the ISO3166-1_alpha-3 page from Wikipedia.
Input: http://en.wikipedia.org/wiki/ISO_3166-1_alpha-3.
Output: data/en.wikipedia.org.wiki.ISO_3166-2.3.html.
2: Downloads the ISO3166-2 page from Wikipedia.
Input: http://en.wikipedia.org/wiki/ISO_3166-2.
Output: data/en.wikipedia.org.wiki.ISO_3166-2.html.
- o scripts/populate.countries.pl
-
Imports country data into an SQLite database.
inputs: data/en.wikipedia.org.wiki.ISO_3166-2.html, data/en.wikipedia.org.wiki.ISO_3166-2.3.html.
Output: share/www.scraper.wikipedia.iso3166.sqlite.
- o scripts/get.subcountry.page.pl and scripts/get.subcountry.pages.pl
-
Downloads each countries' corresponding subcountries page.
Source: http://en.wikipedia.org/wiki/ISO_3166:$code2.html.
Output: data/en.wikipedia.org.wiki.ISO_3166-2.$code2.html.
- o scripts/populate.subcountry.pl and scripts/populate.subcountries.pl
-
Imports subcountry data into the database.
Source: data/en.wikipedia.org.wiki.ISO_3166-2.$code2.html.
Output: share/www.scraper.wikipedia.iso3166.sqlite.
Note: When the distro is installed, this SQLite file is installed too. See "Where is the database?" for details.
- o scripts/export.as.csv.pl -c c.csv -s s.csv
-
Exports the country and subcountry data as CSV.
Input: share/www.scraper.wikipedia.iso3166.sqlite.
Output: data/countries.csv and data/subcountries.csv.
- o scripts/export.as.html -w c.html
-
Exports the country and subcountry data as HTML.
Input: share/www.scraper.wikipedia.iso3166.sqlite.
Output: data/iso.3166-2.html.
On-line: http://savage.net.au/Perl-modules/html/WWW/Scraper/Wikipedia/ISO3166/iso.3166-2.html.
- o scripts/get.statoids.pl
-
Downloads some pages from http://statoids.com in case one day we need to convert from FIPS to ISO 3166-2.
See data/List_of_FIPS_region_codes_*.html.
- o scripts/populate.fips.codes.pl
-
This reads the files output by scripts/get.statoids.pl and produces 2 reports, data/wikipedia.fips.codes.txt and data/wikipedia.fips.mismatch.log. These are discussed in "What FIPS data is included?"
- o scripts/test.nfc.pl
-
See "Why did you use 's NFC() for sorting?" for a discussion of this script.
Constructor and initialization
new(...) returns an object of type WWW::Scraper::Wikipedia::ISO3166
.
This is the class's contructor.
Usage: WWW::Scraper::Wikipedia::ISO3166 -> new()
.
This method takes a hash of options.
Call new()
as new(option_1 => value_1, option_2 => value_2, ...)
.
Available options (these are also methods):
- o config_file => $file_name
-
The name of the file containing config info, such as css_url and template_path. These are used by "as_html()" in WWW::Scraper::Wikipedia::ISO3166::Database::Export.
The code prefixes this name with the directory returned by "dist_dir()" in File::ShareDir.
Default: .htwww.scraper.wikipedia.iso3166.conf.
- o sqlite_file => $file_name
-
The name of the SQLite database of country and subcountry data.
The code prefixes this name with the directory returned by "dist_dir()" in File::ShareDir.
Default: www.scraper.wikipedia.iso3166.sqlite.
- o verbose => $integer
-
Print more or less information.
Default: 0 (print nothing).
Distributions
This module is available as a Unix-style distro (*.tgz).
Install WWW::Scraper::Wikipedia::ISO3166 as you would for any Perl
module:
Run:
cpanm WWW::Scraper::Wikipedia::ISO3166
or run:
sudo cpan WWW::Scraper::Wikipedia::ISO3166
or unpack the distro, and then run:
perl Makefile.PL
make (or dmake)
make test
make install
See http://savage.net.au/Perl-modules.html for details.
See http://savage.net.au/Perl-modules/html/installing-a-module.html for help on unpacking and installing.
Methods
config_file($file_name)
Get or set the name of the config file.
The code prefixes this name with the directory returned by "dist_dir()" in File::ShareDir.
Also, config_file is an option to "new()".
log($level => $s)
Print $s at log level $level, if ($self -> verbose);
Since $self -> verbose defaults to 0, nothing is printed by default.
new()
See "Constructor and initialization".
sqlite_file($file_name)
Get or set the name of the database file.
The code prefixes this name with the directory returned by "dist_dir()" in File::ShareDir.
Also, sqlite_file is an option to "new()".
verbose($integer)
Get or set the verbosity level.
Also, verbose is an option to "new()".
FAQ
Design faults in ISO3166
Where ISO3166 uses Country Name, I would have used Long Name and Short Name.
Then we'd have:
Long Name: Bolivia, Plurinational State of
Short Name: Bolivia
This distro uses the value directly from Wikipedia, which is what I have called 'Long Name', for all country and subcountry names.
Where is the database?
It is shipped in share/www.scraper.wikipedia.iso3166.sqlite.
It is installed into the distro's shared dir, as returned by "dist_dir()" in File::ShareDir. On my machine that's:
/home/ron/perl5/perlbrew/perls/perl-5.14.2/lib/site_perl/5.14.2/auto/share/dist/WWW-Scraper-Wikipedia-ISO3166/www.scraper.wikipedia.iso3166.sqlite.
What is the database schema?
A single SQLite file holds 2 tables, countries and subcountries:
countries subcountries
--------- ------------
id id
code2 country_id
code3 code
fc_name fc_name
has_subcountries name
name sequence
timestamp timestamp
code3 has a couple of special cases. 2 countries have no value for code3: Libyan Arab Jamahiriya and Sint Maarten. 3-letter codes which almost match: LBY => Libya and MAF => Saint Martin (French part).
subcountries.country_id points to countries.id.
fc_name is output from calling fc(decode('utf8', $name) ).
For decode(), see "THE PERL ENCODING API" in Encode.
For fc(), see "fc($str)" in Unicode::CaseFold.
$name is from a Wikipedia page.
has_subcountries is 'Yes' or 'No'.
name is output from calling decode('utf8', $name).
sequence is a number (1 .. N) indicating the order in which subcountry names appear in the list on that subcountry's Wikipedia page.
See the source code of WWW::Scraper::Wikipedia::ISO3166::Database::Create for details of the SQL used to create the tables.
What do I do if I find a mistake in the data?
What data? What mistake? How do you know it's wrong?
Also, you must decide what exactly you were expecting the data to be.
If the problem is the ISO data, report it to them.
If the problem is the Wikipedia data, get agreement from everyone concerned and update Wikipedia.
If the problem is the output from my code, try to identify the bug in the code and report it via the usual mechanism. See "Support".
If the problem is with your computer's display of the data, consider (in alphabetical order):
- o CSV
-
Does the file display correctly in 'Emacs'? On the screen using 'less'?
scripts/export.as.csv.pl uses: use open ':utf8';
Is that not working?
- o DBD::SQLite
-
Did you set the sqlite_unicode attribute? Use something like:
my($dsn) = 'dbi:SQLite:dbname=www.scraper.wikipedia.iso3166.sqlite'; # Sample only. my($attributes) = {AutoCommit => 1, RaiseError => 1, sqlite_unicode => 1}; my($dbh) = DBI -> connect($dsn, '', '', $attributes);
The SQLite file ships in the share/ directory of the distro, and must be found by File::ShareDir at run time.
Did you set the foreign_keys pragma (if needed)? Use:
$dbh -> do('PRAGMA foreign_keys = ON');
- o HTML
-
The template htdocs/assets/templates/www/scraper/wikipedia/iso3166/iso3166.report.tx which ships with this distro contains this line:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
Is that not working?
- o Locale
-
Here's my setup:
shell>locale LANG=en_AU.utf8 LANGUAGE= LC_CTYPE="en_AU.utf8" LC_NUMERIC="en_AU.utf8" LC_TIME="en_AU.utf8" LC_COLLATE="en_AU.utf8" LC_MONETARY="en_AU.utf8" LC_MESSAGES="en_AU.utf8" LC_PAPER="en_AU.utf8" LC_NAME="en_AU.utf8" LC_ADDRESS="en_AU.utf8" LC_TELEPHONE="en_AU.utf8" LC_MEASUREMENT="en_AU.utf8" LC_IDENTIFICATION="en_AU.utf8" LC_ALL=
- o OS
-
Unicode is a moving target. Perhaps your OS's installed version of unicode fies needs updating.
- o SQLite
-
Both Oracle and SQLite.org ship a program called sqlite3. They are not compatible. Which one are you using? I use the one from the SQLite.org.
AFAICT, sqlite3 does not have command line options, or options while running, to set unicode or pragmas.
Why did you use Unicode::Normalize's NFC() for sorting?
This question implies why not use NFD() instead.
Run scripts/test.nfc.pl, and the output is:
code2 => AX
code3 => ALA
fc_name => åland islands
has_subcountries => No
id => 15
name => Åland Islands
timestamp => 2012-05-13 23:37:20
And this (Åland Islands) is what Wikipedia displays. So, NFC() it is.
See http://www.perl.com/pub/2012/04, and specifically prescription # 1.
See also section 1.2 Normalization Forms in http://www.unicode.org/reports/tr15/.
See also http://www.unicode.org/faq/normalization.html.
What is $ENV{AUTHOR_TESTING} used for?
When this env var is 1, scripts output to share/*.sqlite within the distro's dir. That's how I populate the database tables. After installation, the database is elsewhere, and read-only, so you don't want the scripts writing to that copy anyway.
At run-time, File::ShareDir is used to find the installed version of *.sqlite.
What FIPS data is included?
Firstly, scripts/get.fips.pages.pl downloads some Wikipedia data, into data/List_of_FIPS_region_codes_*.html.
Secondly, the latter files are parsed by scripts/populate.fips.codes.pl and the 2 reports are in data/wikipedia.fips.codes.txt, and data/wikipedia.fips.mismatch.log.
This data is not written into the SQLite database yet, but it's available in case it's included one day.
Wikipedia's Terms of Use
See http://wikimediafoundation.org/wiki/Terms_of_use.
Also, since I'm distributing copies of Wikipedia-sourced material, reformatted but not changed by editing, I hereby give notice that their material is released under CC-BY-SA. See http://creativecommons.org/licenses/by-sa/3.0/ for that licence.
References
In no particular order:
1: http://en.wikipedia.org/wiki/ISO_3166-2.
2: http://en.wikipedia.org/wiki/ISO_3166-1_alpha-3.
3: http://savage.net.au/Perl-modules/html/WWW/Scraper/Wikipedia/ISO3166/iso.3166-2.html.
5: http://unicode.org/Public/cldr/latest/core.zip.
This is complex set of XML files concerning currency, postal, etc, formats and other details for various countries and/or languages.
6: For Debian etc users: /usr/share/xml/iso-codes/iso_3166_2.xml, as installed from the iso-codes package, with:
sudo apt-get install iso-codes
8: http://www.geonames.de/index.html.
9: http://www.perl.com/pub/2012/04.
Check the Monthly Archives at Perl.com, starting in April 2012, for a series of Unicode-specific articles by Tom Christiansen.
10: http://www.unicode.org/reports/tr15/.
11: http://www.unicode.org/faq/normalization.html.
Support
Email the author, or log a bug on RT:
https://rt.cpan.org/Public/Dist/Display.html?Name=WWW::Scraper::Wikipedia::ISO3166.
Author
WWW::Scraper::Wikipedia::ISO3166
was written by Ron Savage <ron@savage.net.au> in 2012.
Home page: http://savage.net.au/index.html.
Copyright
Australian copyright (c) 2012 Ron Savage.
All Programs of mine are 'OSI Certified Open Source Software';
you can redistribute them and/or modify them under the terms of
The Artistic License, a copy of which is available at:
http://www.opensource.org/licenses/index.html
1 POD Error
The following errors were encountered while parsing the POD:
- Around line 248:
Nested L<> are illegal. Pretending inner one is X<...> so can continue looking for other errors.