The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Geo::PostalAddress - Country-specific postal address parsing/formatting

DESCRIPTION

This module converts postal (snail mail) addresses between an unstructured country-neutral format (an array of character strings) and a country-specific format that's hopefully meaningful by postal authorities, courier/delivery services, residents, ... of that country for postal address entry. It should handle most countries out of the box with only minor or technical divergences from approved bulk-mailing formats; if needed, country-specific code can be added to make it fully conformant to those formats.

The intended audience for this module is anyone needing to handle most addresses in a recognizable country-specific format, without going into the full generality and complexity that UPU standards would appear to require.

SYNOPSIS

  use Geo::PostalAddress;

  my $AU_parser = Geo::PostalAddress->new('AU');
  my $format = $AU_parser->format();
  # $format now contains:
  # [['Addr1', 40], ['Addr2', 40], ['Addr3', 40], ['Addr4', 40], 3,
  #  ['City', 40],
  #  ['State', {NSW => "New South Wales", TAS => "Tasmania",
  #             QLD => "Queensland", SA => "South Australia",
  #             WA  => "Western Australia", VIC => "Victoria",
  #             ACT => "Australian Capital Territory",
  #             NT  => "Northern Territory"}], ['Postcode', 4, qr/^\d\d\d\d$/]]
  # 40 in ['Addr1', 40] is the suggested displayed field width (not the maximum
  # length). 3 means that the next 3 fields should/could be on the same row.
  # ['State', {...}] means an enumerated list is used for this field, with keys
  # being the stored values and values being the labels used for display or
  # selection.
  my $display = $AU_parser->display(["4360 DUKES RD", "KALGOORLIE WA 6430"]);
  # $display now contains:
  # {Addr1 => "4360 DUKES RD", City => "KALGOORLIE",
  #  State => "WA", Postcode => "6430"}

  my $US_parser = Geo::PostalAddress->new('US');
  my $address = {Addr1 => "123 MAGNOLIA ST", City => "HEMPSTEAD",
                 State => "NY", ZIP => "11550­1234"};
  my $result = $US_parser->storage($address);
  unless (ref $result) { carp "Bad postal address: $result.\n"; }

  my $AU_to_US_address_label = $US_parser->label("AU", "MR JOHN DOE", $result);
  # What to print on an address label or on an envelope, if mailing from
  # Australia to the United States.

METHODS

new()

Geo::PostalAddress->new($country) returns undef, or a blessed reference to a parser suitable for handling the most common postal address formats for that country. Depending on the country, this reference may be blessed into Geo::PostalAddress or into a country-specific subclass.

format

$parser->format() returns a reference to an array describing the (display/input) fields that make a postal address, and gives some hints about on-screen layout. Each element of the array can be an integer n > 0, meaning the next n fields should be on the same line if window/screen width allows it, or a reference to an array describing a field. Each field description contains the field name and either a maximum length for a text field or a hash of {stored => display} values for an enumerated field. An optional regex can also be specified. If present, it should be compatible with both perl and javascript, so it can be used in both client-side and server-side programs or modules.

An example for Australia may be:

  [["Addr1", 40], ["Addr2", 40], ["Addr3", 40], ["Addr4", 40], 3, ["City", 40],
   ["State", {NSW => "New South Wales", TAS => "Tasmania", QLD => "Queensland",
              SA => "South Australia", WA => "Western Australia",
              VIC => "Victoria", ACT => "Australian Capital Territory",
              NT => "Northern Territory"}], ["Postcode", 4, qr/^\d\d\d\d$/]]

display

$parser->display($stored) converts the postal address in @$stored to a format suitable for data input and returns a reference to a hash. The keys of the hash appear as fieldnames in the return value of $parser->format().

If @$stored doesn't contain an address in the country $parser is an instance of, weird results are nearly certain.

storage

$parser->storage($display) makes country-dependent checks against the postal address in %$display. If it passes all the checks, $parser->storage($display) converts it to a format suitable for storage and returns a reference to an array. Otherwise, $parser->storage($display) returns a string representing an error message.

If %$display doesn't contain an address in the country $parser is an instance of, weird results are nearly certain.

label

$parser->label($origin_country, $recipient, $address) returns a reference to an array containing an address label suitable for correspondance from a sender in $origin_country (2-letter ISO 3166 code) to $recipient (can be a string or an array reference, eg ["Aby's Auto Repair", "Kell Dewclaw"]) at $address (as returned from $parser->storage()) in the country for $parser.

The default version just tacks on the name of the destination country, if not the same as the origin country.

option

$parser->option($name [ , $value] ) returns the setting of option $name for parser $parser, after changing it to $value if specified.

Available options and meaningful values for each option depend on the country $parser is for.

normalize

$parser->normalize($display) normalizes the address in %$display by tweaking unambiguous but technically incorrect elements. It can also, if needed, check it for validity and return an error message. If no problems were found, it should return "".

This method is called from within storage() and display(), and users of this module shouldn't normally need to call it directly. It exists so it can be overridden in subclasses. The default version does nothing.

INTERNALS

Unless you plan to add a country or change the format information for a country, either directly in the base class (this) or as a subclass, you can safely skip this. (But if you're curious, feel free to read on.)

%per_country_data is a hash using the 2-letter ISO 3166-1 country code as the key. The value is a hash reference ($hr in the following description) with the following fields:

_format

This array reference is actually what $parser->format() returns.

Each element can be a number n > 0, hinting that the next n fields should be on the same line, if the terminal or window width allows it, but otherwise ignored. Otherwise, it is an array reference describing a single field of the address, and has the following elements:

0

The name of a field. For maximum compatibility with form description languages (including the forms part of HTML), this should match /^w+$/ in the C locale, but this module only requires that it not contain {}. The name should be present in map { $_->{DisplayName} } @{$hr->{_s2d_map}} (see _s2d_map below).

1

Can be a number > 0, indicating the maximum length of a text field, or a hash of { stored => displayed } mappings, indicating an enumerated field. (Note that in the latter case, the order and layout of the values are left to the discretion of the user of this module.)

2

An optional validation regex can also be specified. If present, it should be compatible with both perl and javascript, so it can be used in both client-side and server-side programs or modules. Note that although most regexes would be anchored at both ends, this isn't required or enforced.

_s2d_map

(storage-to-display map) This is an array of hash references, each describing how to retrieve the value of one display field from the stored unstructured text strings. Each element has the following fields:

StoredRownum

(stored row number) The row in the array of text lines where the field is. That number is used as a perl-style array index (>=0 from the start, < 0 back from the end), except that on any given unstructured address, if there aren't enough rows to map to both positive and negative indices without overlap, the positive indices that would actually map to a row overlapping the region starting with the negative index having the largest absolute value and going to the end of the array are considered to return "" instead of the actual row. In other words, using the array of lines qw(eenie meenie minie moe), indexes -2 0 1 2 3 would return "minie", "eenie", "meenie", "", "" (even though there is no -1 that would return "moe").

StoredColnum

(stored column number) The optional column in the line where the field (or regex input) starts, from 0 for the first column. If absent, the field (or regex input) is the whole line, even if StoredCollen is present. Note that StoredColnum can be negative (with the expected result for the second argument to substr), but if so, there's no special handling, unlike for StoredRownum.

StoredCollen

(stored column length) The optional length of the field (or regex input). If absent or if StoredColnum is absent, the field (or regex input) extends to the end of the line. Note that StoredCollen can be negative (with the expected result for the third argument to substr), but if so, there's no special handling, unlike for StoredRownum.

StoredRegexnum

(stored regex number) The optional index of a regular expression in @{$hr->{_regexes}} to be matched against the line (or the substring selected by StoredColnum and StoredCollen if applicable) to extract the field value from it. See the description of _regexes below for important restrictions on regex use.

StoredFieldnum

(stored field number) The optional index into the array returned by the regex matching mentioned above of the data to be returned as the field value. Note that if StoredRegexnum is present, StoredFieldnum must be present too.

DisplayName

(display (field) name) The name of a field in @{$hr->{_format}}. This is also the key used in the record hash returned by $parser->display().

Note that although StoredColnum, StoredCollen, StoredRegexnum, and StoredFieldnum are all optional, not all combinations make sense. Specifically:

  • At least one of StoredColnum and StoredRegexnum must be present; if both are, StoredColnum (and StoredCollen if also present) are used before StoredRegexnum and StoredFieldnum.

  • If StoredCollen is present without StoredColnum, it is ignored.

  • If StoredRegexnum is present, StoredFieldnum must be present too; if StoredFieldnum is present without StoredRegexnum, it is ignored.

_s2d_map

(display-to-storage map) This is an array of hash references, each describing how to generate one line of the unstructured string array used for storage from the parsed fields used for display. Each element has the following fields:

StoredTemplate

(stored template) A string containing boilerplate text and field references of the form ${foo} for field foo (using the field names in _format and _s2d_map). Currently, there is no way to escape $, {, or } if they're part of a sequence that could be interpreted as a field reference.

StoredRownum

(stored row number) A number that indicates in which row of the unstructured storage string array this should go. This can be positive, 0, or negative, with the same intended meaning as for _s2dmap, except than while putting the array together, it grows in the middle as necessary to accomodate positive indexes.

_regexes

(regular expressions) A reference to an array of strings representing regexes, in any form perl will accept (single-quoted, double-quoted, qr//, etc...) for use in parsing unstructured storage strings into structured display fields. Note that each regex is matched at most once in the course of a single invocation to $hr->display(), and its results cached for reuse. This is true even if a subsequent match would use another string than the first. In practice, this isn't a problem, as a given regex would normally be applied to one storage line only. However, if this isn't the case, that regex must be repeated, each line pointing (through StoredRegexnum) to its own copy.

%default_per_country_data is similar, but for countries with unspecified address formats. It's a single hash with the same structure as %$hr above.

Geo::PostalAddress->new() initializes the object hash with those fields, and adds a _country_code field that holds the 2-letter code, in case we need to retrieve other info later.

Note that the above applies to the base class only. Subclasses may use other or different data, instead of or in addition to this.

BUGS

Only 2-letter country codes are supported.

A knob to carp on some errors would be nice.

Objects returned by the new method can be actually blessed into a country-specific subclass. This makes it impossible to have other derived classes than the country-specific ones.

40 is used as the suggested length for all text fields. This is probably too long for some and too short for others.

Support for most countries ranges from non-existent to sketchy.

The method name "display" is arguably a poor choice.

Some messages should go through a translation table.

Data validation should probably be a method of its own.

This module doesn't yet deal well with countries that want the recipient name in another position than 1st line, or the country name in another position than last line. Examples of such countries are: Ukraine (wants country, city+postcode, street address, recipient name from top down instead of the more widespread bottom up), Turkmenistan (wants city+postcode, country, recipient name, street address, from top down), Grenada (wants a supranational line - West Indies - below the country name). The interface to do that exists, but is do-nothing until I figure out how to deal with address formats for use between countries with conflicting requirements.

This module doesn't deal well with countries where the address format depends on the script used, such as Saudi Arabia.

This module doesn't yet support entities with their own ISO 3166-1 code that use another country's address format, including the country name.

This module assumes "no locale", and blissfully mixes character classes that could conceivably match in the locale with classes that have to match according to the Roman alphabet (eg, US ZIP codes and Canadian postal codes). This is probably nearly impossible to fix, as the relevant locale isn't well-defined anyway. (The locale for the machine running the application? The locale for the user? Or the locale for the country the address is in?)

This module assumes that the privileged order for entering address components is top-down, left-to right, according to the standard or most common address format. This may not be true of countries where the dominant language is written right-to-left.

This module doesn't use the PATDL (http://xml.coverpages.org/Lubenow-PATDL200204.pdf) in the address parsing rules.

HISTORY

SEE ALSO

Locale::Country(3)

Locale::Subcountry(3)

http://www.upu.int/post_code/en/postal_addressing_systems_member_countries.shtml

http://www.bitboost.com/ref/international-address-formats.html

http://www.indiapost.org/Netscape/Pincode.html

http://www.sterlingdata.com/colombia.htm

http://mailservices.berkeley.edu/intladdr.pdf (previous version of the first URL, incorrect in spots, and to be used only if no other info is available)

http://www.kevinandkell.com/

CONTRIBUTORS

Ailbhe, DamienPS, LeiaCat, Renée, and Martin DeMello clarified, corrected, or explained standards or usage for specific countries. See acknowledgements in comments throughout the source code.

Bill Holbrook draws (and holds the copyright to) comic strip Kevin and Kell, from which I got the names used in the description for $parser->label().

AUTHOR AND LICENSE

Copyright (c) 2004, Michel Lavondès. All rights reserved.

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

  • Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

  • Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

  • Neither the name of the Copyright holder nor the names of any contributors may be used to endorse or promote products derived from this software without specific prior written permission.

This software is provided by the copyright holder and contributors "as is" and any express or implied warranties, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose are disclaimed. In no event shall the copyright holder or contributors be liable for any direct, indirect, incidental, special, exemplary, or consequential damages (including, but not limited to, procurement of substiture goods or services; loss of use, data, or profits; or business interruption) however caused and on any theory of liability, whether in contract, strict liability, or tort (including negligence or otherwise) arising in any way out of the use of this software, even if advised of the possibility of such damage.

3 POD Errors

The following errors were encountered while parsing the POD:

Around line 76:

Non-ASCII character seen before =encoding in '"11550­1234"};'. Assuming CP1252

Around line 345:

Expected text after =item, not a number

Around line 352:

Expected text after =item, not a number