tconv - iconv-like interface with automatic charset detection
#include <tconv.h> size_t tconv(tconv_t cd, char **inbuf, size_t *inbytesleft, char **outbuf, size_t *outbytesleft);
tconv is like iconv, but without the need to know the input charset. Caller might want to play with macros e.g.
#define iconv_t tconv_t #define iconv_open(tocode, fromcode) tconv_open(tocode, fromcode) #define iconv(cd, ipp, ilp, opp, olp) tconv(cd, ipp, ilp, opp, olp) #define iconv_close(cd) tconv_close(cd)
When calling tconv_open:
tconv_open(const char *tocode, const char *fromcode)
it is legal to have NULL for fromcode. In this case the first chunk of input will be used for charset detection, it is therefore recommended to use enough bytes at the very beginning. If fromcode is not NULL, no charset detection will occur, and tconv will behave like iconv(3), modulo the engine being used (see below). If tocode is NULL, it will default to fromcode.
fromcode
tocode
Testing if the return value is equal to (size_t) -1 or not, together with errno value when it is (size_t) -1 as documented for iconv, is the only reliable check: in theory it should return the number of non-reversible characters, and this is what will happen is this is iconv running behind. In case of another convertion engine, the return value depend on this engine capabilities, or how the corresponding plugin is implemented.
(size_t) -1
errno
iconv
When the number of bytes left in the input is 0, the return value is equal to (size_t) -1, and errno is E2BIG: you should not count on *ilp position: the conversion engine may have an internal staging array that have consumed all the input bytes, but is waiting for more space to produce the output bytes. This is happening for instance:
0
E2BIG
*ilp
Regardless if you use //TRANSLIT option or not, the ICU convert engine is always doing two conversions internally, one from input encoding to UTF-16, then from UTF-16 to output encoding. This means that it is always eating entirely the input bytes into an internal staging area.
//TRANSLIT
tconv support two engine types: one for charset detection, one for character conversion, please refer to the tconv_open_ext documentation for technical details. Engines, whatever their type, are supposed to have three entry points: new, run and free. They can be:
new
run
free
The application already have the new, run and free entry points.
The application give the path of a shared library, and tconv will look at it.
Python's cchardet charset detection engine, bundled with tconv, is always available. If tconv is compiled with ICU support, then ICU charset and conversion engines will be available. The ICONV conversion engine is always available.
The default charset detection engine is cchardet, bundled statically with tconv.
The default character conversion engine is ICU, if tconv has been compiled with ICU support, else iconv.
The iconv plugin is always available and is based on libiconv, bundled within tconv.
tconv() only guarantees that his plug-ins support the //TRANSLIT and //IGNORE iconv notation.
//IGNORE
In some cases, an internal buffer is used. This mean that in case of a failure with errno being EINVAL or EILSEQ, input and output pointers are left in a state that, if being called again, will correctly handle the continuation of the conversion, but may not be exactly at the point of failure as per the original iconv specification.
EINVAL
EILSEQ
tconv_ext(3), libiconv, cchardet
To install MarpaX::ESLIF, copy and paste the appropriate command in to your terminal.
cpanm
cpanm MarpaX::ESLIF
CPAN shell
perl -MCPAN -e shell install MarpaX::ESLIF
For more information on module installation, please visit the detailed CPAN module installation guide.