NAME

unichars - list characters for one or more properties

SYNOPSIS

unichars [options] criterion ...

Each criterion is either a square-bracketed character class, a regex starting with a backslash, or an arbitrary Perl expression. See the EXAMPLES section below.

OPTIONS:

 Selection Options:

    --bmp           include the Basic Multilingual Plane (plane 0) [DEFAULT]
    --smp           include the Supplementary Multilingual Plane (plane 1)
    --astral    -a  include planes above the BMP (planes 1-15)
    --unnamed   -u  include various unnamed characters (see DESCRIPTION)
    --locale    -l  specify the locale used for UCA functions

 Display Options:

    --category  -c  include the general category (GC=) 
    --script    -s  include the script name (SC=) 
    --block     -b  include the block name (BLK=) 
    --bidi      -B  include the bidi class (BC=) 
    --combining -C  include the canonical combining class (CCC=)
    --numeric   -n  include the numeric value (NV=) 
    --casefold  -f  include the casefold status
    --decimal   -d  include the decimal representation of the code point

 Miscellaneous Options:

    --version   -v  print version information and exit
    --help      -h  this message
    --man       -m  full manpage
    --debug     -d  show debugging of criteria and examined code point span

 Special Functions:

     $_    is the current code point
     ord   is the current code point's ordinal

     NAME is charname::viacode(ord)
     NUM is Unicode::UCD::num(ord), not code point number
     CF is casefold->{status}
     NFD, NFC, NFKD, NFKC, FCD, FCC  (normalization)
     UCA, UCA1, UCA2, UCA3, UCA4 (binary sort keys)

     Singleton, Exclusion, NonStDecomp, Comp_Ex 
     checkNFD, checkNFC, checkNFKD, checkNFKC, checkFCD, checkFCC 
     NFD_NO, NFC_NO, NFC_MAYBE, NFKD_NO, NFKC_NO, NFKC_MAYBE 

DESCRIPTION

The unichars program reports which characters match all selection criteria anded together.

A criterion beginning with a square bracket or a backslash is assumed to be a regular expression. Anything else is a Perl expression such as you might pass to the Perl grep function. The $_ variable is set to each successive Unicode character, and if all criteria match, that character is displayed.

The numeric code point is therefore accessible as ord.

The special token NAME is set to the full name of the current code point. Also, the tokens NFD, NFKD, NFC, and NFKC are set to the corresponding normalization form.

By default only plane 0, the Basic Multilingual Plane, is examined. For plane 1, the Supplementary Multilingual Plane, use --smp. To examine either, specify both --bmp and --smp options, or -bs. To include all valid code points, use the -a or --astral option.

Unless the --unnamed option is given, characters with any of the properties Unassigned, PrivateUse, Han, or InHangulSyllables will be excluded.

EXAMPLES

Could all non-ASCII digits:

     $ unichars -a '\d' '\P{ASCII}' | wc -l
     401

Find all line terminators:

    $ unichars '\R'
     --       10  0000A  LINE FEED (LF)
     --       11  0000B  LINE TABULATION
     --       12  0000C  FORM FEED (FF)
     --       13  0000D  CARRIAGE RETURN (CR)
     --      133  00085  NEXT LINE (NEL)
     --     8232  02028  LINE SEPARATOR
     --     8233  02029  PARAGRAPH SEPARATOR

Find what is not \s but is [\h\v]:

    $ unichars '\S' '[\h\v]'
     --       11  0000B  LINE TABULATION

Count how many code points in the Basic Multilingual Plane are not marks but are diacritics:

    $ unichars '\PM' '\p{Diacritic}' | wc -l
         209

Count how many code points in the Basic Multilingual Plane are marks but are not diacritics:

    $ unichars '\pM' '\P{Diacritic}' | wc -l
         750

Find all code points that are Letters, are in the Greek script, have differing canonical and compatibility decompositions, and whose name contains "SYMBOL":

    $ unichars -a '\pL' '\p{Greek}' 'NFD ne NFKD' 'NAME =~ /SYMBOL/'
     ϐ       976  003D0  GREEK BETA SYMBOL
     ϑ       977  003D1  GREEK THETA SYMBOL
     ϒ       978  003D2  GREEK UPSILON WITH HOOK SYMBOL
     ϓ       979  003D3  GREEK UPSILON WITH ACUTE AND HOOK SYMBOL
     ϔ       980  003D4  GREEK UPSILON WITH DIAERESIS AND HOOK SYMBOL
     ϕ       981  003D5  GREEK PHI SYMBOL
     ϖ       982  003D6  GREEK PI SYMBOL
     ϰ      1008  003F0  GREEK KAPPA SYMBOL
     ϱ      1009  003F1  GREEK RHO SYMBOL
     ϲ      1010  003F2  GREEK LUNATE SIGMA SYMBOL
     ϴ      1012  003F4  GREEK CAPITAL THETA SYMBOL
     ϵ      1013  003F5  GREEK LUNATE EPSILON SYMBOL
     Ϲ      1017  003F9  GREEK CAPITAL LUNATE SIGMA SYMBOL

Find all numeric nondigits in the Latin script (within the BMP):

    $ unichars '\pN' '\D' '\p{Latin}'
     Ⅰ      8544  02160  ROMAN NUMERAL ONE
     Ⅱ      8545  02161  ROMAN NUMERAL TWO
     Ⅲ      8546  02162  ROMAN NUMERAL THREE
     Ⅳ      8547  02163  ROMAN NUMERAL FOUR
     Ⅴ      8548  02164  ROMAN NUMERAL FIVE
     Ⅵ      8549  02165  ROMAN NUMERAL SIX
     Ⅶ      8550  02166  ROMAN NUMERAL SEVEN
     Ⅷ      8551  02167  ROMAN NUMERAL EIGHT
     (etc)

Find the first three alphanumunderish code points with no assigned name:

    $ unichars -au '\w' '!length NAME' | head -3
     㐀   13312 003400 <unnamed codepoint>
     㐁   13313 003401 <unnamed codepoint>
     㐂   13314 003402 <unnamed codepoint>

Count the combining characters in the Suuplemental Multilingual Plane:

    $ unichars -s '\pM' | wc -l
          61

ENVIRONMENT

If your environment smells like it's in a Unicode encoding, program arguments will be in UTF-8.

BUGS

The --man option does not correctly process the page for UTF-8, because it does not pass the necessary --utf8 option to pod2man.

SEE ALSO

uniprops, uninames, perluniprops, perlunicode, perlrecharclass, perlre

AUTHOR

Tom Christiansen <tchrist@perl.com>

COPYRIGHT AND LICENCE

Copyright 2010 Tom Christiansen.

This program is free software; you may redistribute it and/or modify it under the same terms as Perl itself.