The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

uniprops - list unicode properties for one or more characters

SYNOPSIS

uniprops [options] character | U+codepoint | "name" ...

 Options:

    --version   print version information 
    --help      this message
    --man       full manpage

    --unicode   list simple Unicode properties (DEFAULT)
    --general   include even the long form of general properties

    --perl      list lowercase Perl short-cuts, plus \R (DEFAULT)
    --negated   list uppercase Perl short-cuts

    --all       list all Unicode categories, not just one-parters
    --list      list all known Unicode properties, then exit

    --reorder   sort Unicode property lists shortest first
    --single    output each property one per line

    --verbose   wrap Unicode properties in \p{xxx}
    --width N   set column width

    --debug     noisy internal processing

  options may be bundled if used in the short form; e.g., -va

DESCRIPTION

Each argument to uniprops specifies a character in one of three forms:

  1. a one-character literal, such as "#" or "A".

  2. a code point number in hex, (optionally) prefixed by "0x" or "U+", or "\x" or "\u", with the backslash prefixes admitting but not requiring enclosing curly braces. Examples: "0x23", "U+394", "\x{0394}", "0394".

  3. a case-sensitive character name, such as "COMMA" or "GREEK CAPITAL LETTER DELTA". Names may be specified by their full names or their short names per the charnames pragma, or they may be Latin or Greek (in that order). See the EXAMPLES.

The uniprops program reports the properties that apply to a given character for use in regular expressions. By default, the Perl character class short-cuts and the one-part Unicode properties are listed, which are mostly those from the general category.

The --all option adds all the two-part Unicode properties from the non-general categories.

Long, two-part forms of general category properties are not listed unless the --general option is given.

The --negated option adds the Perl shortcuts that are in capitals. The --verbose option encloses Unicode properties with \p{PROPNAME}.

To simply list out all available Unicode properties, use the --list option, which then exits without processing further arguments.

Lines will be wrapped before the edge of your screen. You can override the window width with the --width NN option. To get only one property per line without any indentation, use the --single or -1 option.

Unicode properties are by default listed in the same order in which they occur in perluniprops(), but the --reorder option will sort them smallest to largest.

Unicode properties designated as deprecated, obsolete, or discouraged, or which begin with an underscore, are ignored.

It takes quite some time to load up and test all the Unicode properties, so if you just need confirmation of a character, just ask for Perl properties, not Unicode ones, and it will run at least six times faster.

EXAMPLES

Count known Unicode properties:

    $ uniprops -l | wc -l
    2478

List all known Unicode properties, sorted by length:

    $ uniprops -lr

List all known Unicode properties, sorted by name:

    $ uniprops -l | sort -df | more

List Greek-related Unicode properties:

    $ uniprops -l | grep Greek | sort -dfu
    Blk=Greek
    Block:Ancient_Greek_Musical_Notation
    Block:Ancient_Greek_Numbers
    Block:Greek
    Block=Greek_And_Coptic
    Block:Greek_Extended
    Greek
    Greek_And_Coptic
    InAncientGreekMusicalNotation
    InAncientGreekNumbers
    InGreek
    InGreekExtended
    Is_Greek
    Script=Greek

List just Perl properties for three named characters:

    $ uniprops -p delta greek:delta Greek:Delta
    U+1E9F ‹ẟ› \N{ LATIN SMALL LETTER DELTA }:
        \w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
    U+03B4 ‹δ› \N{ GREEK SMALL LETTER DELTA }:
        \w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
    U+0394 ‹Δ› \N{ GREEK CAPITAL LETTER DELTA }:
        \w \pL \p{LC} \p{L_} \p{L&} \p{Lu}

List just Perl properties negations for four named characters:

    $ uniprops -p Thorn pi hebrew:alef cyrillic:be
    U+00DE ‹Þ› \N{ LATIN CAPITAL LETTER THORN }:
        \w \pL \p{LC} \p{L_} \p{L&} \p{Lu}
    U+03C0 ‹π› \N{ GREEK SMALL LETTER PI }:
        \w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
    U+05D0 ‹א› \N{ HEBREW LETTER ALEF }:
        \w \pL \p{L_} \p{Lo}
    U+0431 ‹б› \N{ CYRILLIC SMALL LETTER BE }:
        \w \pL \p{LC} \p{L_} \p{L&} \p{Ll}

List Perl and Unicode properties for three different literal characters:

    $ uniprops \# ç π
    U+0023 ‹#› \N{ NUMBER SIGN }:
        \pP \p{Po}
        All Any ASCII Assigned Common Zyyy Po P Gr_Base
           Grapheme_Base Graph GrBase Other_Punctuation Punct Pat_Syn
           Pattern_Syntax PatSyn PosixGraph PosixPrint PosixPunct
           Print Punctuation
    U+00E7 ‹ç› \N{ LATIN SMALL LETTER C WITH CEDILLA }:
        \w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
        All Any Alnum Alpha Alphabetic Assigned InLatin1 Cased
           Cased_Letter LC Changes_When_Casemapped CWCM
           Changes_When_Titlecased CWT Changes_When_Uppercased CWU Ll
           L Gr_Base Grapheme_Base Graph GrBase ID_Continue IDC
           ID_Start IDS Letter L_ Latin Latn Lowercase_Letter Lower
           Lowercase Print Word XID_Continue XIDC XID_Start XIDS
    U+03C0 ‹π› \N{ GREEK SMALL LETTER PI }:
        \w \pL \p{LC} \p{L_} \p{L&} \p{Ll}
        All Any Alnum Alpha Alphabetic Assigned Greek Is_Greek
           InGreek Cased Cased_Letter LC Changes_When_Casemapped CWCM
           Changes_When_Titlecased CWT Changes_When_Uppercased CWU Ll
           L Gr_Base Grapheme_Base Graph GrBase Grek Greek_And_Coptic
           ID_Continue IDC ID_Start IDS Letter L_ Lowercase_Letter
           Lower Lowercase Print Word XID_Continue XIDC XID_Start XIDS

Just list Perl shortcuts, including negated ones, for a named character:

    $ uniprops -pn LF
    U+000A ‹U+000A› \N{ LINE FEED (LF) }:
        \s \v \R \pC \p{Cc}
        \W \D \H

For the Greek final sigma character, list Unicode properties that are either one-parters or else two-part general categories

    $ uniprops -ug "greek:final sigma"
    U+03C2 ‹ς› \N{ GREEK SMALL LETTER FINAL SIGMA }:
        All Any Alnum Alpha Alphabetic Assigned Greek Is_Greek InGreek
           Cased Cased_Letter LC Changes_When_Casefolded CWCF
           Changes_When_Casemapped CWCM Changes_When_NFKC_Casefolded CWKCF
           Changes_When_Titlecased CWT Changes_When_Uppercased CWU Ll L
           Gr_Base Grapheme_Base Graph GrBase Grek Greek_And_Coptic
           ID_Continue IDC ID_Start IDS Letter L_ Lowercase_Letter Lower
           Lowercase Print Word XID_Continue XIDC XID_Start XIDS
        General_Category=Cased_Letter General_Category:Cased_Letter Gc=LC
           General_Category:L General_Category=Letter General_Category:LC
           General_Category:Letter Gc=L General_Category:Ll
           General_Category=Lowercase_Letter
           General_Category:Lowercase_Letter Gc=Ll

List just Unicode properties for a code point, given in hex:

    $ uniprops -u 0xDF
    U+00DF ‹ß› \N{ LATIN SMALL LETTER SHARP S }:
        All Any Alnum Alpha Alphabetic Assigned InLatin1 Cased
           Cased_Letter LC Changes_When_Casefolded CWCF
           Changes_When_Casemapped CWCM Changes_When_NFKC_Casefolded
           CWKCF Changes_When_Titlecased CWT Changes_When_Uppercased
           CWU Ll L Gr_Base Grapheme_Base Graph GrBase ID_Continue
           IDC ID_Start IDS Letter L_ Latin Latn Lowercase_Letter
           Lower Lowercase Print Word XID_Continue XIDC XID_Start XIDS

List Perl and Unicode properties for a named character, verbosely:

    $ uniprops -v "ALEF SYMBOL"
    U+2135 ‹ℵ› \N{ ALEF SYMBOL }:
        \w \pL \p{L_} \p{Lo}
        \p{All} \p{Any} \p{Alnum} \p{Alpha} \p{Alphabetic} \p{Assigned}
           \p{InLetterlikeSymbols} \p{Changes_When_NFKC_Casefolded}
           \p{CWKCF} \p{Common} \p{Zyyy} \p{L} \p{Lo} \p{Gr_Base}
           \p{Grapheme_Base} \p{Graph} \p{GrBase} \p{ID_Continue} \p{IDC}
           \p{ID_Start} \p{IDS} \p{Letter} \p{L_} \p{Other_Letter}
           \p{Math} \p{Print} \p{Word} \p{XID_Continue} \p{XIDC}
           \p{XID_Start} \p{XIDS}

List Unicode properties in all categories except for two-part general categories:

    $ uniprops -au INFINITY
    U+221E ‹∞› \N{ INFINITY }:
        All Any Assigned InMathematicalOperators Common Zyyy Sm S
           Gr_Base Grapheme_Base Graph GrBase Math Math_Symbol
           Pat_Syn Pattern_Syntax PatSyn Print Symbol
        Age:1.1 Bidi_Class:ON Bidi_Class=Other_Neutral
           Bidi_Class:Other_Neutral Bc=ON Block:Mathematical_Operators
           Canonical_Combining_Class:0
           Canonical_Combining_Class=Not_Reordered
           Canonical_Combining_Class:Not_Reordered Ccc=NR
           Canonical_Combining_Class:NR Script=Common
           Decomposition_Type:None Dt=None East_Asian_Width:A
           East_Asian_Width=Ambiguous East_Asian_Width:Ambiguous Ea=A
           Grapheme_Cluster_Break:Other GCB=XX Grapheme_Cluster_Break:XX
           Grapheme_Cluster_Break=Other Hangul_Syllable_Type:NA
           Hangul_Syllable_Type=Not_Applicable
           Hangul_Syllable_Type:Not_Applicable Hst=NA
           Joining_Group:No_Joining_Group Jg=NoJoiningGroup
           Joining_Type:Non_Joining Jt=U Joining_Type:U
           Joining_Type=Non_Joining Line_Break:AI Line_Break=Ambiguous
           Line_Break:Ambiguous Lb=AI Numeric_Type:None Nt=None
           Numeric_Value:NaN Nv=NaN Present_In:1.1 Age=1.1 In=1.1
           Present_In:2.0 In=2.0 Present_In:2.1 In=2.1 Present_In:3.0
           In=3.0 Present_In:3.1 In=3.1 Present_In:3.2 In=3.2
           Present_In:4.0 In=4.0 Present_In:4.1 In=4.1 Present_In:5.0
           In=5.0 Present_In:5.1 In=5.1 Present_In:5.2 In=5.2
           Script:Common Sc=Zyyy Script:Zyyy Sentence_Break:Other SB=XX
           Sentence_Break:XX Sentence_Break=Other Word_Break:Other WB=XX
           Word_Break:XX Word_Break=Other

For the HYPHEN character, verbosely list all Unicode properties including the two-part general categories, one per line, and sort them:

    $ uniprops -1vgau HYPHEN | sort

List Perl and Unicode properties for code point U+2212, reordered by length and with width set to 50:

    $ uniprops -r -w 50 U+2212
    U+2212 ‹−› \N{ MINUS SIGN }:
        \pS \p{Sm}
        S Sm All Any Dash Math Zyyy Graph Print
           Common GrBase PatSyn Symbol Gr_Base Pat_Syn
           Assigned Math_Symbol Grapheme_Base
           Pattern_Syntax InMathematicalOperators

Ask for a (currently) unassigned code point:

    $ uniprops 1F12F
    U+1F12F ‹U+1F12F› \N{ U+1F12F }:
        \pC \p{Cn}
        All Any InEnclosedAlphanumericSupplement C Other Cn
            Unassigned Zzzz Unknown

ERRORS

It is an error to ask for properties of code points representing a UTF-16 surrogate.

Characters not legal for interchange are flagged as errors.

ENVIRONMENT

If your environment smells like it's in a Unicode encoding, program arguments and output will be in UTF-8. This allows you to enter a single, literal UTF-8 character as a program argument.

The PAGER environment variable is used for the --list option.

FILES

The pod source for the perluniprops(1) manpage is parsed to determine Unicode properties. This is expected to be found in the Config module's $installprivlib/pods directory.

PROGRAMS

The stty(1) program is called on Unix systems to determine the window size.

If the standard output is to a tty when the --list option is requested, the user's pager is used, defaulting to more(1).

BUGS

The --man option does not correctly process the page for UTF-8; pod2text(1) works fine, though.

SEE ALSO

unichars, uninames, perluniprops, perlunicode, perlrecharclass, perlre

AUTHOR

Tom Christiansen <tchrist@perl.com>

COPYRIGHT AND LICENCE

Copyright 2011 Tom Christiansen.

This program is free software; you may redistribute it and/or modify it under the same terms as Perl itself.