NAME

String::UnicodeUTF8 - non-collation related unicode/utf-8 bytes string-type-agnostic utils that work as far back as perl 5.6

VERSION

This document describes String::UnicodeUTF8 version 0.23

SYNOPSIS

    use String::UnicodeUTF8 qw(char_count bytes_size is_unicode);

    say '$string type is: ' . is_unicode($string) ? 'Unicode' : 'bytes';

    say '$string has this many characters: ' . char_count($string);

    say '$string takes up this many bytes: ' . bytes_size($string);

DESCRIPTION

Unicode is awesome. utf-8 is also awesome. They are related but different. That difference and all the little twiggles in between make it appear to be too hard but its really not, honest!

The unicode problem is a solved one. The easiest way to manage day to day is have a couple of simple items in mind:

“Unicode” is a set of characters.

Example: ♥ is Unicode character number 2665 (hexidecimal numbers those be)

“utf-8” is an encoding of Unicode characters

Example: ♥ (i.e. Unicode character number 2665) is made of of 3 octets, or “characters” semantically, numbered: e2, 99, and a5 (hexidecimal numbers those be)

You (almost) always want to input/output bytes in utf-8

By this I mean all of the files, data base connections/schema, HTTP request/response, etc etc. You may very well need to encode to/from utf-8 when dealing with 3rdparty/external stuff you have little control over.

I say almost because it is possible to use any number of encodings and I suppose you might encounter a situation when you have no other choice. But if you have choice and an IQ in the double digits just do utf-8, its not that hard to do and you’ll expontially* make your life and others' easier.

If you do have a situation (and its not an ignorant boss/client forcing his moron–induced–FUD on you) please drop me a line w/ details. Who knows I may recant!

* no actual math has been harmed in this statement, patches welcome!

perl basically has 2 types if strings: “Unicode” and “bytes”

The former has the UTF-8 SV flag set which tells perl to treat a Unicode character as one item (i.e. as apposed to 3 in our ♥ example).

The latter are just bytes that could be anything (hopefully explicitly utf-8 in our case!).

What this module is not meant for

Use something like Unicode::Collate for that.

Unicode problem stuff.

See perlunicode for more info.

Anything not explicitly stated in the POD.

What this module is meant for

Consistent terminology.

The term “utf-8” and “Unicode” (akin to “encoding” and “charset”) are typically used ambiguously and perl docs are not immune.

It could mean either a Unicode string or a bytes string depending on the “thing” in question. ick, just ick. That is where this module comes in.

It defines those concepts strictly as “Unicode string” and “utf-8 bytes string” (the latter is shortened by removing the first or second word because they are essentially synonymous conceptually).

Based on that it gives functions that operate consistently regardless of the type (or regardful if you intend one or the other, your needs; your call).

Availablity

The functions necessary to do all of this are not available on older perls.

e.g. utf8::is_utf8 is not available before 5.8.1. Encode is not avialble before 5.7.3.

The steps to do the things this does are better wrapped up for sanity/reusability.

Do I need to encode, decode, upgrade, downgrade.

Do I use the return value or does it modify the SV in place?

Glossary

This glossary holds true when doing the stuff this module does only with this module. If you fiddle with the guts then its more likely you can end up in a wonky pseudo state.

UTF-8 Bytes String

A string of bytes whose Unicode characters are made up of utf-8 byte sequences (e.g. \xe2\x99\xa5 in our heart example). Each Unicode character is handled internally by perl as the bytes that make it up (and not as a single Unicode character).

Unicode String

A "UTF-8 Bytes String" that additionally has it’s UTF-8 flag set so that perl treats utf-8 byte sequences as the individual Unicode character it makes up (e.g. \x{2665} in our heart example).

A word on unicode and utf-8 representation in source code

Another point of confusion can be how unicode and utf-8 are represented in source code and the default or pragma set treatment of utf-8.

The characer itself:

    perl -e 'print utf8::is_utf8("I ♥ perl") . "\n";'          # could be a L<UTF-8 Bytes String> or a L<Unicode String> depending on perl’s “mode”.
    perl -e 'use utf8;print utf8::is_utf8("I ♥ perl") . "\n";' # a L<Unicode String> because of perl’s “mode”.
    perl -e 'no utf8;print utf8::is_utf8("I ♥ perl") . "\n";'  # a L<UTF-8 Bytes String>because of perl’s “mode”.

\x octet notation:

    perl -e 'print utf8::is_utf8("I \xe2\x99\xa5 perl") . "\n";'          # a L<UTF-8 Bytes String> regardless of perl’s “mode”.
    perl -e 'use utf8;print utf8::is_utf8("I \xe2\x99\xa5 perl") . "\n";' # a L<UTF-8 Bytes String> regardless of perl’s “mode”.
    perl -e 'no utf8;print utf8::is_utf8("I \xe2\x99\xa5 perl") . "\n";'  # a L<UTF-8 Bytes String> regardless of perl’s “mode”.

\x unicode notation:

    perl -e 'print utf8::is_utf8("I \x{2665} perl") . "\n";'          # a L<Unicode String> regardless of perl’s “mode”.
    perl -e 'use utf8;print utf8::is_utf8("I \x{2665} perl") . "\n";' # a L<Unicode String> regardless of perl’s “mode”.
    perl -e 'no utf8;print utf8::is_utf8("I \x{2665} perl") . "\n";'  # a L<Unicode String> regardless of perl’s “mode”.

bracketed \x octet:

This one I don’t like. It is ambiguous (it is octets but it looks like unicode). I almost always only see it when data is in the process of being corrupted.

    perl -e 'print utf8::is_utf8("I \x{e2}\x{99}\x{a5} perl") . "\n";'

Good rule of thumb is to be explicit with your intent: use brackets form with 4+ digits (zero padded if necessary) and non-bracket form with 2 digits.

Tips on troubleshooting Unicode/utf-8 problems

I’ll maintain some more detailed Unicode resources at my Unicode page but for this doc there are 3 things that will help you:

1 checks the bytes

Don’t look so much at seemingly corrupt display, examine the bytes at the source. Once you verify they are legit you can move on to finding out what it is that is mishandling them along the route.

For example, you might do a SELECT on a column and also include the column in HEX and the character and bytes lengths of the column in the query. If the bytes are correct but the character length is wrong then that is a great hint as to where to look next.

For perl, make sure you do so on bytes strings:

    multivac:~ dmuey$ perl -le 'no utf8;print unpack("H*", "I ♥ Perl");'
    4920e299a5205065726c
    multivac:~ dmuey$ perl -le 'use utf8;print unpack("H*", "I ♥ Perl");'
    492065205065726c
    multivac:~ dmuey$ perl -le 'no utf8;print pack("H*", "4920e299a5205065726c");'
    I ♥ Perl
    multivac:~ dmuey$ perl -le 'use utf8;print pack("H*", "4920e299a5205065726c");'
    I ♥ Perl
    multivac:~ dmuey$ perl -le 'use utf8;print pack("H*", "492065205065726c");'
    I e Perl
    multivac:~ dmuey$ perl -le 'no utf8;print pack("H*", "492065205065726c");'
    I e Perl
    multivac:~ dmuey$

Even better, use a tool that does what you mean regardless of the type of string:

e.g. Devel::Kit does what you mean regardless of the type (via this module as it happens ;p):

    [dmuey@multivac ~]$ perl -MDevel::Kit -e 'no utf8;xe("I ♥ Perl",1);'
    debug(): Hex:       [
          'I : 49',
          '  : 20',
          '♥ : e299a5',
          '  : 20',
          'P : 50',
          'e : 65',
          'r : 72',
          'l : 6c'
        ]
    [dmuey@multivac ~]$ perl -MDevel::Kit -e 'use utf8;xe("I ♥ Perl",1);'
    debug(): Hex:       [
          'I : 49',
          '  : 20',
          '♥ : e299a5',
          '  : 20',
          'P : 50',
          'e : 65',
          'r : 72',
          'l : 6c'
        ]
    [dmuey@multivac ~]$
2 use the simplest scenario

If you can rule out as many factors as possible (HTTP request/response, database settings, perl -E enabling optional features that could affect Unicode/utf8-bytes, etc) it will help you hone in on where your good bytes went bad.

3 use the simplest string

I tend to use 'I ♥ Unicode' so that there is one multi-byte Unicode character to examine. Also, it is a visible charcater that most fonts support, which helps.

INTERFACE

All of these functions are exportable.

is_unicode()

Like utf8::is_utf8() but is less ambiguously named* and works on perls before utf8::is_utf8() and Encode::is_utf8() as far back as, at least, 5.6.2.

There is one rare caveat: If you have an old perl, you have a string that contains no Unicode characters, you are in compiled perl w/ B optomized away, and you've upgraded a string outside of the functions in this module (or use the same text in different scalars). You *may* get erroneous results.

* is_utf8() does not mean “are these bytes in utf-8 encoding (as apposed to, say, utf-16, latin1, etc etc)”, it means “are these bytes in utf-8 encoding and is the UTF-8 flag set on this string” (i.e. is this a Uncode string):

Don’t take my word for it, try it your self:

    perl -e 'print utf8::is_utf8("I \xe2\x99\xa5 perl") . "\n";print utf8::is_utf8("I \x{2665} perl") . "\n";' # this is the same on 5.6.2 as 5.16.0

char_count()

Get the number of characters, conceptually, of the given string regardless of the argument’s type.

e.g. "I \x{2665} perl" and "I \xe2\x99\xa5 perl" both have 8 characters. The latter just happens to be encoded in utf-8 which uses a sequence of three smaller “characters” to represent the one conceptual unicode character “♥”.

bytes_size()

Get the number of bytes of the given string regardless of the argument’s type.

get_unicode()

Get a "Unicode String" version of the given string regardless of the argument’s type.

get_utf8()

Get a "UTF-8 Bytes String" version of the given string regardless of the argument’s type.

escape_utf8_or_unicode()

Serialize unicode characters as slash-x notation:: \x{2665} style if the argument was a "Unicode String". \xe2\x99\xa5 style if the argument was a "UTF-8 Bytes String".

Returns a "UTF-8 Bytes String" since it should contain no unicode characters at this point.

escape_utf8()

Like escape_utf8_or_unicode() but force it to be in "UTF-8 Bytes String" style \xe2\x99\xa5 notation.

escape_unicode()

Like escape_utf8_or_unicode() but force it to be in "Unicode String" style \x{2665} notation.

unescape_utf8_or_unicode()

Turn slash-x notation back into the character.

If there was a "Unicode String" \x{2665} style escape it returns a "Unicode String".

Otherwise it returns a "UTF-8 Bytes String".

unescape_utf8()

Like unescape_utf8_or_unicode() but force it to return a "UTF-8 Bytes String" regardless of slash-x type.

unescape_unicode()

Like unescape_utf8_or_unicode() but force it to return a "Unicode String" regardless of slash-x type.

quotemeta_bytes()

Unicode aware version of quotemeta() that returns a "UTF-8 Bytes String" that has unicode characters represented as their characters.

quotemeta_utf8()

Unicode aware version of quotemeta() that returns a "UTF-8 Bytes String" that has unicode characters represented in \xe2\x99\xa5 notation.

quotemeta_unicode()

Unicode aware version of quotemeta() that returns a "Unicode String" that has unicode characters represented in \x{2665} notation.

unquotemeta_bytes()

Alias of unquotemeta_utf8(). Exists to semantically correspond to quotemeta_bytes().

unquotemeta_utf8()

Unicode aware version of "unquotemeta()" in String::Unquotemeta that returns a "UTF-8 Bytes String".

unquotemeta_unicode()

Unicode aware version of "unquotemeta()" in String::Unquotemeta that returns a "Unicode String".

contains_nonhuman_characters()

Returns true if the given string contains invisible, Control, or WhiteSpace (other than a normal space) characters regardless of the argument’s type. Returns false otherwise.

After the string you can pass in a hash of certain “special” characters you may want to allow.

e.g. this is the same as `contains_nonhuman_characters($string)` except it will allow non breaking space character also:

    contains_nonhuman_characters($string, 'NO-BREAK SPACE' => 1);

The valid keys are:

'NO-BREAK SPACE'

U+00A0

'LINE FEED (LF)'

U+000A

'CARRIAGE RETURN (CR)'

U+000D

'CHARACTER TABULATION'

U+0009

DIAGNOSTICS

Throws no warnings or errors of its own, except:

pack() did not result in unicode string and there is no way to emulate utf8::upgrade

This essentially should never happen and mainly exists for completeness. It is only possible on pre 5.8.1 perls. If you are ever able to get get_unicode() to carp() this please send the details!

CONFIGURATION AND ENVIRONMENT

String::UnicodeUTF8 requires no configuration files or environment variables.

DEPENDENCIES

String::Unquotemeta

is_unicode(), when given a string with no unicode characters, lazy loads Encode for perl versions from 5.7.3 to 5.8.1, B::Flags for < 5.7.3

Module::Want is used for the lazy loading since there are advantages over straight eval.

INCOMPATIBILITIES

None reported.

BUGS AND LIMITATIONS

No bugs have been reported.

Please report any bugs or feature requests to bug-string-unicodeutf8@rt.cpan.org, or through the web interface at http://rt.cpan.org.

TODO

\N notation escaping/unescaping: Seems like YAGNI but if there is enough demand we can add it (lazy/separate since it’d be heavy).

AUTHOR

Daniel Muey <http://drmuey.com/cpan_contact.pl>

LICENCE AND COPYRIGHT

Copyright (c) 2012, Daniel Muey <http://drmuey.com/cpan_contact.pl>. All rights reserved.

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See perlartistic.

DISCLAIMER OF WARRANTY

BECAUSE THIS SOFTWARE IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE SOFTWARE, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE SOFTWARE "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE SOFTWARE IS WITH YOU. SHOULD THE SOFTWARE PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR, OR CORRECTION.

IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE SOFTWARE AS PERMITTED BY THE ABOVE LICENCE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE SOFTWARE (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE SOFTWARE TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.