NAME

Unicode::Towctrans - Generate small case mapping tables

SYNOPSIS

gen_wctrans
gen_wctrans --safec
gen_wctrans --musl
gen_wctrans -n     # no network for default -v
gen_wctrans -v 10
gen_wctrans -v 10 --ud UnicodeData.txt.10 --out towctrans-10.h
gen_wctrans --lower16
gen_wctrans --fn __towcase
gen_wctrans --min-excl 10000
gen_wctrans --unroll 6
gen_wctrans --bits 18:14:10
gen_wctrans --lower16
gen_wctrans --bsearch
gen_wctrans --bsearch-both
gen_wctrans --if-tree --bsearch
gen_wctrans --if-tree --bsearch-both
gen_wctrans --table

DESCRIPTION

gen_wctrans generates a towctrans.h header file, which is used by musl and safeclib to generate small and efficient case mapping tables, to build the libc towupper() and towlower() functions and its secure variants towupper_s() and towlower_s().

If the code may run on a system with the turkish or azeri locale, you need to define -DHAVE_LOCALE_TR to check for the special turkish i locale and mappings at run-time.

If you know that your iswalpha() works correctly (only with musl), then use --with_iswalpha to get a lightly faster function. E.g. for benchmarking.

With --lower16 it creates larger and more casemaps tables, with less long casemapl tables. Thus it finds those ranges earlier, at the cost of more caches misses. For --bits the fastest are 18:14:10 and 12:12:8, the smallest is the default 16:8:8.

With --bsearch the tolower check is done with a binary search, the toupper check does a linear search without early exit. It needs more space, and its performance is not that good as with --lower16.

With --bsearch-both the speed is faster and the size is even bigger, as we have to store the order of the upper maps and pairs also to be able to binary search it.

With --table, the musl-new style, the size is much bigger, as we have to store mappings for all blocks. The lookup is much faster though.

With --if-tree and --bsearch the tolower check is done with an inlined binary search as ternary tree, the toupper check does a binary search.

With --if-tree and --bsearch-both both lower and upper checks are done with an inlined binary search as ternary tree. It trades data for more code. It is the fastest of the non-table variants, but also the biggest.

More tuning options are --min-excl and --unroll. --min-excl gives a threshold for the size for the very first exclusion checks. The range must be larger than the given threshold. Default is 2500. --unroll sets the maximum array size for its loops to be unrolled and inlined. Default is 5.

v set the UnicodeData version to use or download. -n sets the method the default UnicodeData version to the UCD version from perl (which is usually older than the version from https://www.unicode.org/versions/latest/). C--ud> set the name of the used UnicodeData.txt file. Default: UnicodeData.txt. C--out> sets the output filename, default: towctrans.h

Planned also for the multi-byte folding tables for wcsfc_s() for safeclib. As the single-byte towupper and towlower conversions are meaningless for many multi-byte unicode mappings, those with status F - full folding. Use a full string foldcasing function instead, as safeclib wcsfc_s, ICU u_strToUpper or libunistring uc_toupper.

PERFORMANCE

Currently it is small and fast enough compared to the other implementations. And esp. correct compared to glibc, which ignores characters from other locales.

The bench uses Unicode 10.0 data (-v 10) so that our tables match the Unicode version compiled into musl-old. Benchmark errors fall into three categories, none of which are bugs in our code:

Circled letters 0x24B6-0x24E9 (affects musl-old, 52 diffs): Our code correctly maps these per UnicodeData.txt (e.g. towupper(0x24D0)=0x24B6). musl-old does not map them at all.
Georgian Mtavruli 0x1C90-0x1CBF (affects musl-new, 96 diffs): These uppercase Georgian letters were added in Unicode 11.0. musl-new includes them, but our Unicode 10.0 bench tables do not, so musl-new reports differences for every Mtavruli codepoint.
Post-Unicode-10.0 additions (affects musl-new, 16+ diffs): Additional cased characters introduced after Unicode 10.0 (Osage, Adlam, etc.) are present in musl-new but absent from our Unicode 10.0 tables.
glibc errors: glibc errors are caused by glibc ignoring cased characters from non-latin locales entirely.

make -C examples
./bench
            my:        738 [us]  100,00 %
       my_excl:        818 [us]   90,22 %
      my_low16:        733 [us]  100,68 %
       my_bits:        681 [us]  108,37 %
    my_bsearch:        576 [us]  128,12 %
   my_bsearchb:        575 [us]  128,35 %
     my_unroll:        554 [us]  133,21 %
     my_iftree:        479 [us]  154,07 %	42 errors
    my_iftreeb:        489 [us]  150,92 %	86 errors
      my_table:        131 [us]  563,36 %
      musl-new:        149 [us]  495,30 %	9 errors
      musl-old:        970 [us]   76,08 %	3 errors
         glibc:        135 [us]  546,67 %	14 errors

wc -c towctrans-*.o
  3336 towctrans-my.o
  3384 towctrans-myexcl.o
  3424 towctrans-mylow16.o
  3648 towctrans-mybits.o
  3704 towctrans-mybsearch.o
  4496 towctrans-mybsearch-both.o
  3808 towctrans-myunroll.o
  8264 towctrans-myiftree.o
 11008 towctrans-myiftree-both.o
  6784 towctrans-mytable.o
  6808 towctrans-musl-new.o
  3416 towctrans-musl-old.o
 97432 towctrans-glibc.o

Results with more various --bits size combinations. They need just some logical fixups for the 5 errors.

--bits 16:10:8,--bits 12:12:8 and more being promising, the best being twice as fast as the default.

./bench-bits.sh
                                           C  CL P  PL EX
     16:8:8:        316 [us] 100.0 % 	66 12 120 0 6
    16:16:8:        252 [us] 125.4 % 	72 6 120 0 6
    16:10:8:        190 [us] 166.3 % 	66 12 120 0 6
   18:14:10:        167 [us] 189.2 % 	76 2 120 0 6	5 errors
    18:14:8:        157 [us] 201.3 % 	76 2 120 0 6	5 errors
   18:12:10:        154 [us] 205.2 % 	75 3 120 0 6	5 errors
    18:12:8:        155 [us] 203.9 % 	75 3 120 0 6	5 errors
    16:12:6:        207 [us] 152.7 % 	66 12 120 0 6	5 errors
    16:10:6:        327 [us] 96.6 % 	66 12 120 0 6	5 errors
    14:10:8:        242 [us] 130.6 % 	60 18 120 0 6	5 errors
    14:12:6:        157 [us] 201.3 % 	56 22 120 0 6	5 errors
    12:12:8:        157 [us] 201.3 % 	33 45 120 0 6	5 errors

5248 towctrans-bmy.o (16:8:8)
5320 towctrans-bmylow16.o (16:16:8)
5656 towctrans-bmybits.o (16:10:8)
5832 bits-12_12_8.o
5760 bits-14_12_6.o
5728 bits-14_10_8.o
5680 bits-16_10_6.o
5680 bits-16_12_6.o
5440 bits-18_12_8.o
5456 bits-18_12_10.o
5352 bits-18_14_8.o
5368 bits-18_14_10.o

INSTALLATION

Perl 5.12 or later is required. Also the LWP::UserAgent cpan module.

This module does not need to be installed. Running gen_wctrans is enough. However for full testing and global installation run this:

perl Makefile.PL
make
make test
make test-all
sudo make install

sudo apt install wget / sudo dnf install wget / ...
sudo cp bin/gen_wctrans /usr/local/bin/
cpan LWP::UserAgent / sudo apt install libwww-perl / ...

DEPENDENCIES

This module requires a UnicodeData.txt file from Unicode Character Database, which is automatically downloaded if missing.

AUTHOR

Reini Urban <rurban@cpan.org>

COPYRIGHT AND LICENSE

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

The generated files are MIT licensed. See the generated files headers.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	Go to GitHub issues (only if GitHub is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)