Name

Encode::ZapCP1252 - Zap Windows Western Gremlins

Synopsis

  use Encode::ZapCP152;

  zap_cp1252 $latin1_text;
  fix_cp1252 $utf8_text;

Description

Have you ever been processing a Web form submit, assuming that the incoming text was encoded in ISO-8859-1 (Latin-1), only to end up with a bunch of junk because someone pasted in content from Microsoft Word? Well, this is because Microsoft uses a superset of the Latin-1 encoding called "Windows Western" or "CP1252". So mostly things will come out right, but a few things--like curly quotes, m-dashes, elipses, and the like--will not. The differences are well-known; you see a nice chart at documenting the differences on Wikipedia: http://en.wikipedia.org/wiki/Windows-1252.

Of course, that won't really help you. What will help you is to quit using Latin-1 and switch to UTF-8. Then you can just convert from CP1252 to UTF-8 without losing a thing, just like this:

  use Encode;
  $text = decode 'cp1252', $text, 1;

But I know that there are those of you out there stuck with Latin-1 and who don't want any junk charactrs from Word users, and that's where this module comes in. Its zap_cp1252 function will zap those CP1252 gremlins for you, turning them into their appropriate ASCII approximations.

Another case that can occaisionally come up is when you are using UTF-8, and you're reading in text that claims to be UTF-8, but it still ends up with some CP1252 gremlins mixed in with true UTF-8 characters. I've seen examples of just this sort of thing when processing GMail messages and attempting to insert them into a UTF-8 database. Doesn't work so well. So this module also offers fix_cp1252, which converts those CP1252 gremlines into their UTF-8 equivalents.

Usage

This module exports two subroutines: zap_cp1252() and fix_cp1252(). You use these subroutines like so:

  zap_cp1252 $text;
  fix_cp1252 $text;

The zap_cp1252() subroutine performs in place conversions of any CP1252 gremlins into their appropriate ASCII approximations, while fix_cp1252() converts them, in place, into their UTF-8 equilvalents.

Note that because the conversion happens in place, the data to be converted cannot be a string constant; it must be a scalar variable.

Conversion Table

Here's how the characters are converted to ASCII and UTF-8. The ASCII conversions are not perfect, but they should be good enough for general cleanup. If you want perfect, switch to UTF-8 and be done with it!

   Hex | Char  | ASCII | UTF-8 Name
  -----+-------+-------+-------------------------------------------
  0x80 |   €   |   e   | EURO SIGN
  0x82 |   ‚   |   ,   | SINGLE LOW-9 QUOTATION MARK
  0x83 |   ƒ   |   f   | LATIN SMALL LETTER F WITH HOOK
  0x84 |   „   |   ,,  | DOUBLE LOW-9 QUOTATION MARK
  0x85 |   …   |  ...  | HORIZONTAL ELLIPSIS
  0x86 |   †   |   +   | DAGGER
  0x87 |   ‡   |   ++  | DOUBLE DAGGER
  0x88 |   ˆ   |   ^   | MODIFIER LETTER CIRCUMFLEX ACCENT
  0x89 |   ‰   |   %   | PER MILLE SIGN
  0x8a |   Š   |   S   | LATIN CAPITAL LETTER S WITH CARON
  0x8b |   ‹   |   <   | SINGLE LEFT-POINTING ANGLE QUOTATION MARK
  0x8c |   Œ   |   OE  | LATIN CAPITAL LIGATURE OE
  0x8e |   Ž   |   Z   | LATIN CAPITAL LETTER Z WITH CARON
  0x91 |   ‘   |   '   | LEFT SINGLE QUOTATION MARK
  0x92 |   ’   |   '   | RIGHT SINGLE QUOTATION MARK
  0x93 |   “   |   "   | LEFT DOUBLE QUOTATION MARK
  0x94 |   ”   |   "   | RIGHT DOUBLE QUOTATION MARK
  0x95 |   •   |   *   | BULLET
  0x96 |   –   |   -   | EN DASH
  0x97 |   —   |   --  | EM DASH
  0x98 |   ˜   |   ~   | SMALL TILDE
  0x99 |   ™   |  (tm) | TRADE MARK SIGN
  0x9a |   š   |   s   | LATIN SMALL LETTER S WITH CARON
  0x9b |   ›   |   >   | SINGLE RIGHT-POINTING ANGLE QUOTATION MARK
  0x9c |   œ   |   oe  | LATIN SMALL LIGATURE OE
  0x9e |   ž   |   z   | LATIN SMALL LETTER Z WITH CARON
  0x9f |   Ÿ   |   Y   | LATIN CAPITAL LETTER Y WITH DIAERESIS

Changing the Table

Don't like these conversions? You can modify them to your hearts content by accessing this module's internal conversion tables. For example, if you wanted zap_cp1252() to use an uppercase E for the euro sign, just do this:

  $Encode::ZapCP1252::ascii_for{"\x80"} = 'E';

Or if, for some bizarre reason, you wanted the UTF-8 equivalent for a bullet converted by fix_cp1252() to really be an asterisk (why would you? Just use zap_cp1252 for that!), you can do this:

  $Encode::ZapCP1252::utf8_for{"\x95"} = '*';

Just remember, this is a global change, so be careful if your code uses this module elsewhere. Of course, it shouldn't really be doing that. These functions are just for cleaning up messes in one spot in your code, not for makeing a fundamental part of your text handling. For that, use Encode.

Support

This module is stored in an open repository at the following address:

https://svn.kineticode.com/Encode-ZapCP1252/trunk/

Patches against SVN::Notify are welcome. Please send bug reports to <bug-encode-zapcp1252@rt.cpan.org>.

Author

David Wheeler <david@kineticode.com>

Acknowledgements

My thanks to Sean Burke for sending me his original method for converting CP1252 gremlins to more-or-less appropriate ASCII characters.

Copyright and License

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

To install Encode::ZapCP1252, copy and paste the appropriate command in to your terminal.

cpanm

cpanm Encode::ZapCP1252

CPAN shell

perl -MCPAN -e shell
install Encode::ZapCP1252

For more information on module installation, please visit the detailed CPAN module installation guide.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)