Sys::Binmode - A fix for Perl’s system call character encoding
use Sys::Binmode; my $foo = "\xff"; $foo .= "\x{100}"; chop $foo; # Prints a single octet (0xFF) and a newline: print $foo, $/; # In Perl 5.32 this may print the same single octet, or it may # print UTF-8-encoded U+00FF. With Sys::Binmode, though, it always # gives the single octet, just like print: exec 'echo', $foo;
tl;dr: Use this module in all new code.
Ideally, a Perl application doesn’t need to know how the interpreter stores a given string internally. Perl can thus store any Unicode code point while still optimizing for size and speed when storing “bytes-compatible” strings—i.e., strings whose code points all lie below 256. Perl’s “optimized” string storage format is faster and less memory-hungry, but it can only store code points 0-255. The “unoptimized” format, on the other hand, can store any Unicode code point.
Of course, Perl doesn’t always optimize “bytes-compatible” strings; Perl can also, if it wants, store such strings “unoptimized” (i.e., in Perl’s internal “loose UTF-8” format), too. For code points 0-127 there’s actually no difference between the two forms, but for 128-255 the formats differ. (cf. "The "Unicode Bug"" in perlunicode) This means that anything that reads Perl’s internals MUST differentiate between the two forms in order to use the string correctly.
Alas, that differentiation doesn’t always happen. Thus, Perl can output a string that stores one or more 128-255 code points differently depending on whether Perl has “optimized” that string or not.
Remember, though: Perl applications should not care about Perl’s string storage internals. (This is why, for example, the bytes pragma is discouraged.) The catch, though, is that without that knowledge, the application can’t know what it actually says to the outside world!
Thus, applications must either monitor Perl’s string-storage internals or accept unpredictable behaviour, both of which are categorically bad.
This module provides predictable behaviour for Perl’s built-in functions by downgrading all strings before giving them to the operating system. It’s equivalent to—but faster than!—prefixing your system calls with utf8::downgrade() (cf. utf8) on all arguments.
utf8::downgrade()
Predictable behaviour is always a good thing; ergo, you should use this module in all new code.
If you apply this module injudiciously to existing code you may see exceptions thrown where previously things worked just fine. This can happen if you’ve neglected to encode one or more strings before sending them to the OS; if Perl has such a string stored upgraded then Perl will, under default behaviour, send a UTF-8-encoded version of that string to the OS. In essence, it’s an implicit UTF-8 auto-encode.
The fix is to apply an explicit UTF-8 encode prior to the system call that throws the error. This is what we should do anyway; Sys::Binmode just enforces that better.
In a POSIX operating system, an application’s communication with the OS happens entirely through byte strings. Thus, treating all OS-destined strings as byte strings is good and natural.
In Windows, though, things are weirder. For example, Windows exposes multiple APIs for creating a directory, and the one Perl uses (as of 5.32, anyway) only accepts code points 0-255. In this context Sys::Binmode doesn’t break anything, but it does reinforce one of Perl’s unfortunate limitations on Windows.
Sys::Binmode is a good idea anywhere that Perl sends byte strings to the OS. As far as I know, that’s everywhere that Perl runs. If that’s not true, please file a bug.
The unpredictable-behaviour problem that this module fixes in core Perl is also common in XS modules due to rampant use of the SvPV macro and variants. SvPV is like the bytes pragma in C: it gives you the string’s internal bytes with no regard for what those bytes represent. XS authors generally should prefer SvPVbyte or SvPVutf8 in lieu of SvPV unless the C code in question deals with Perl’s encoding abstraction.
Note in particular that, as of Perl 5.32, the default XS typemap converts scalars to C char * and const char * via an SvPV variant. This means that any module that uses that conversion logic also has this problem. So XS authors should also avoid the default typemap for such conversions.
char *
const char *
If, for some reason, you want Perl’s unpredictable default behaviour, you can disable this module for a given block via no Sys::Binmode, thus:
no Sys::Binmode
use Sys::Binmode; system 'echo', $foo; # predictable/sane/happy { # You should probably explain here why you’re doing this. no Sys::Binmode; system 'echo', $foo; # nasal demons }
exec and system
exec
system
do and require
do
require
File tests (e.g., -e) and the following: chdir, chmod, chown, chroot, link, lstat, mkdir, open, opendir, readlink, rename, rmdir, stat, symlink, sysopen, truncate, unlink, utime
-e
chdir
chmod
chown
chroot
link
lstat
mkdir
open
opendir
readlink
rename
rmdir
stat
symlink
sysopen
truncate
unlink
utime
bind, connect, and setsockopt
bind
connect
setsockopt
syscall
dbmopen and the System V IPC functions aren’t covered here. If you’d like them, ask.
dbmopen
There’s room for optimization, if that’s gainful.
Ideally this behaviour should be in Perl’s core distribution.
Even more ideally, Perl should adopt this behaviour as default. Maybe someday!
Thanks to Leon Timmermans (LEONT) and Paul Evans (PEVANS) for some debugging and design help.
Copyright 2021 Gasper Software Consulting. All rights reserved.
This library is licensed under the same license as Perl.
To install Sys::Binmode, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Sys::Binmode
CPAN shell
perl -MCPAN -e shell install Sys::Binmode
For more information on module installation, please visit the detailed CPAN module installation guide.