The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

fix_latin - filters a data stream that is predominantly utf8 and 'fixes' any latin (ie: non-ASCII 8 bit) characters

SYNOPSIS

  fix_latin options <input_file >output_file

  Options:

   -?     detailed help message

DESCRIPTION

The script acts as a filter, taking source data which may contain a mix of ASCII, UTF8, ISO8859-1 and CP1252 characters, and producing output will be all ASCII/UTF8.

Multi-byte UTF8 characters will be passed through unchanged. Single byte characters will be converted as follows:

  0x00 - 0x7F   ASCII - passed through unchanged
  0x80 - 0x9F   Converted to UTF8 using CP1252 mappings
  0xA0 - 0xFF   Converted to UTF8 using Latin-1 mappings

OPTIONS

-?

Display this documentation.

EXAMPLES

This script was originally written to assist in converting a Postgres database from SQL-ASCII encoding to UNICODE UTF8 encoding. The following examples illustrate its use in that context.

If you have a SQL format dump file that you would normally restore by piping into 'psql', you can simply filter the dump file through this script:

  fix_latin < dump_file | psql -d database

If you have a compressed dump file that you would normally restore using 'pg_restore', you can omit the '-d' option on pg_restore and pipe the resulting SQL through this script and into psql:

  pg_restore -O dump_file | fix_latin | psql -d database

To take a look at non-ASCII lines in the dump file:

  perl -ne '/^COPY (\S+)/ and $t = $1; print "$t:$_" if /[^\x00-\x7F]/' dump_file

SEE ALSO

This script is implemented using the Encoding::FixLatin Perl module. For more details see the module documentation with the command:

  perldoc Encoding::FixLatin

In particular you should read the 'LIMITATIONS' section to understand the circumstances under which data corruption might occur.

COPYRIGHT & LICENSE

Copyright 2009 Grant McLean <grantm at cpan.org>

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.