The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

jacode.pl - Perl library for Japanese character code conversion

SYNOPSIS

    require 'jacode.pl';

    # note: file name is 'jacode.pl', but package name is 'jcode'

    # Perl4 interface:

    &jcode'getcode(*line)
    &jcode'convert(*line, $ocode [, $icode [, $option]])
    &jcode'xxx2yyy(*line [, $option])
    &jcode'to($ocode, $line [, $icode [, $option]])
    &jcode'jis($line [, $icode [, $option]])
    &jcode'euc($line [, $icode [, $option]])
    &jcode'sjis($line [, $icode [, $option]])
    &jcode'utf8($line [, $icode [, $option]])
    &jcode'jis_inout($in, $out)
    &jcode'get_inout($string)
    &jcode'cache()
    &jcode'nocache()
    &jcode'flushcache()
    &jcode'flush()
    &jcode'h2z_xxx(*line)
    &jcode'z2h_xxx(*line)
    &jcode'tr(*line, $from, $to [, $option])
    &jcode'trans($line, $from, $to [, $option])
    &jcode'init()

    $jcode'convf{'xxx', 'yyy'}
    $jcode'z2hf{'xxx'}
    $jcode'h2zf{'xxx'}

    # Perl5 interface:

    jcode::getcode(\$line)
    jcode::convert(\$line, $ocode [, $icode [, $option]])
    jcode::xxx2yyy(\$line [, $option])
    jcode::to($ocode, $line [, $icode [, $option]])
    jcode::jis($line [, $icode [, $option]])
    jcode::euc($line [, $icode [, $option]])
    jcode::sjis($line [, $icode [, $option]])
    jcode::utf8($line [, $icode [, $option]])
    jcode::jis_inout($in, $out)
    jcode::get_inout($string)
    jcode::cache()
    jcode::nocache()
    jcode::flushcache()
    jcode::flush()
    jcode::h2z_xxx(\$line)
    jcode::z2h_xxx(\$line)
    jcode::tr(\$line, $from, $to [, $option])
    jcode::trans($line, $from, $to [, $option])
    jcode::init()

    &{$jcode::convf{'xxx', 'yyy'}}(\$line)
    &{$jcode::z2hf{'xxx'}}(\$line)
    &{$jcode::h2zf{'xxx'}}(\$line)

ABSTRACT

This software has upper compatibility to jcode.pl. 'Ja' is a meaning of 'Japanese' in ISO 639-1 code and is unrelated to 'JA Group Organization'.

The code conversion from 'sjis' to 'utf8' is done by using following table.

http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP932.TXT

From 'utf8' to 'sjis' is done by using the CP932.TXT and following table.

PRB: Conversion Problem Between Shift-JIS and Unicode

http://support.microsoft.com/kb/170559/en-us

What's this software good for ...

  • jcode.pl upper compatible

  • Perl4 script

  • Acts as a wrapper to Encode::from_to

  • Support HALFWIDTH KATAKANA

  • Support UTF-8

  • Hidden UTF8 flag

  • No object-oriented programming

  • Possible to re-use past code and how to

DEPENDENCIES

This software requires perl 4.036 or later.

PERL4 INTERFACE

&jcode'getcode(*line)
  Return 'jis', 'sjis', 'euc', 'utf8' or undef according
  to Japanese character code in $line.  Return 'binary' if
  the data has non-character code.
  
  When evaluated in array context, it returns a list
  contains two items.  First value is the number of
  characters which matched to the expected code, and
  second value is the code name.  It is useful if and
  only if the number is not 0 and the code is undef;
  that case means it couldn't tell 'euc' or 'sjis'
  because the evaluation score was exactly same.  This
  interface is too tricky, though.
  
  Code detection between euc and sjis is very difficult
  or sometimes impossible or even lead to wrong result
  when it includes JIS X0201 KANA characters.
&jcode'convert(*line, $ocode [, $icode [, $option]])
  Convert the contents of $line to the specified
  Japanese code given in the second argument $ocode.
  $ocode can be any of "jis", "sjis", "euc" or "utf8", or
  use "noconv" when you don't want the code conversion.
  Input code is recognized automatically from the line
  itself when $icode is not supplied.  $icode also can be
  specified, but xxx2yyy routine is more efficient when
  both codes are known.
  
  It returns the code of input string in scalar context,
  and a list of pointer of convert subroutine and the
  input code in array context.
  
  Japanese character code JIS X0201, X0208, X0212 and
  ASCII code are supported.  JIS X0212 characters can not
  be represented in sjis or utf8 and they will be replased
  by "geta" character when converted to sjis.
  JIS X0213 characters can not be represented in all.
  
  For perl is 5.8.1 or later, &jcode'convert acts as a wrapper
  to Encode::from_to. When $ocode or $icode is neither "jis",
  "sjis", "euc" nor "utf8", and Encode module can be used,
 
  Encode::from_to( $line, $icode, $ocode )
 
  is executed instead of
 
  &jcode'convert(*line, $ocode, $icode, $option).
 
  In this case, there is no effective return value of pointer
  of convert subroutine in array context.
 
  See next paragraph for $option parameter.
&jcode'xxx2yyy(*line [, $option])
  Convert the Japanese code from xxx to yyy.  String xxx
  and yyy are any convination from "jis", "euc", "sjis"
  or "utf8". They return *approximate* number of converted
  bytes.  So return value 0 means the line was not
  converted at all.
  
  Optional parameter $option is used to specify optional
  conversion method.  String "z" is for JIS X0201 KANA
  to JIS X0208 KANA, and "h" is for reverse.
$jcode'convf{'xxx', 'yyy'}
  The value of this associative array is pointer to the
  subroutine jcode'xxx2yyy().
&jcode'to($ocode, $line [, $icode [, $option]])
&jcode'jis($line [, $icode [, $option]])
&jcode'euc($line [, $icode [, $option]])
&jcode'sjis($line [, $icode [, $option]])
&jcode'utf8($line [, $icode [, $option]])
  These functions are prepared for easy use of
  call/return-by-value interface.  You can use these
  funcitons in s///e operation or any other place for
  convenience.
&jcode'jis_inout($in, $out)
  Set or inquire JIS start and end sequences.  Default
  is "ESC-$-B" and "ESC-(-B".  If you supplied only one
  character, "ESC-$" or "ESC-(" is prepended for each
  character respectively.  Acutually "ESC-(-B" is not a
  sequence to end JIS code but a sequence to start ASCII
  code set.  So `in' and `out' are somewhat misleading.
&jcode'get_inout($string)
  Get JIS start and end sequences from $string.
&jcode'cache()
&jcode'nocache()
&jcode'flushcache()
&jcode'flush()
  Usually, converted character is cached in memory to
  avoid same calculations have to be done many times.
  To disable this caching, call &jcode'nocache().  It
  can be revived by &jcode'cache() and cache is flushed
  by calling &jcode'flushcache().  &cache() and &nocache()
  functions return previous caching state.
  &jcode'flush() is an alias of &jcode'flushcache() to save
  an old document.
&jcode'h2z_xxx(*line)
  JIS X0201 KANA (so-called Hankaku-KANA) to JIS X0208 KANA
  (Zenkaku-KANA) code conversion routine.  String xxx is
  any of "jis", "sjis", "euc" and "utf8".  From the difficulty
  of recognizing code set from 1-byte KATAKANA string,
  automatic code recognition is not supported.
&jcode'z2h_xxx(*line)
  JIS X0208 to JIS X0201 KANA code conversion routine.
  String xxx is any of "jis", "sjis", "euc" and "utf8".
$jcode'z2hf{'xxx'}
$jcode'h2zf{'xxx'}
  These are pointer to the corresponding function just
  as $jcode'convf.
&jcode'tr(*line, $from, $to [, $option])
  &jcode'tr emulates tr operator for 2 byte code.  Only 'd'
  is interpreted as an option.

  Range operator like `A-Z' for 2 byte code is partially
  supported.  Code must be JIS or EUC, and first byte
  have to be same on first and last character.

  CAUTION: Handling range operator is a kind of trick
  and it is not perfect.  So if you need to transfer `-'
  character, please be sure to put it at the beginning
  or the end of $from and $to strings.
&jcode'trans($line, $from, $to [, $option])
  Same as &jcode'tr but accept string and return string
  after translation.
&jcode'init()
  Initialize the variables used in this package.  You
  don't have to call this when using jocde.pl by `do' or
  `require' interface.  Call it first if you embedded
  the jacode.pl at the end of your script.

PERL5 INTERFACE

Current jacode.pl is written in Perl 4 but it is possible to use from Perl 5 using `references'. Fully perl5 capable version is future issue.

Since lexical variable is not a subject of typeglob, *string style call doesn't work if the variable is declared as `my'. Same thing happens to special variable $_ if the perl is compiled to use thread capability. So using reference is generally recommented to avoid the mysterious error.

jcode::getcode(\$line)
jcode::convert(\$line, $ocode [, $icode [, $option]])
jcode::xxx2yyy(\$line [, $option])
&{$jcode::convf{'xxx', 'yyy'}}(\$line)
jcode::to($ocode, $line [, $icode [, $option]])
jcode::jis($line [, $icode [, $option]])
jcode::euc($line [, $icode [, $option]])
jcode::sjis($line [, $icode [, $option]])
jcode::utf8($line [, $icode [, $option]])
jcode::jis_inout($in, $out)
jcode::get_inout($string)
jcode::cache()
jcode::nocache()
jcode::flushcache()
jcode::flush()
jcode::h2z_xxx(\$line)
jcode::z2h_xxx(\$line)
&{$jcode::z2hf{'xxx'}}(\$line)
&{$jcode::h2zf{'xxx'}}(\$line)
jcode::tr(\$line, $from, $to [, $option])
jcode::trans($line, $from, $to [, $option])
jcode::init()

SAMPLES

Convert any Kanji code to JIS and print each line with code name.

  #require 'jcode.pl';
  require 'jacode.pl';
  while (defined($s = <>)) {
      $code = &jcode'convert(*s, 'jis');
      print $code, "\t", $s;
  }

Convert all lines to JIS according to the first recognized line.

  #require 'jcode.pl';
  require 'jacode.pl';
  while (defined($s = <>)) {
      print, next unless $s =~ /[\033\200-\377]/;
      (*f, $icode) = &jcode'convert(*s, 'jis');
      print;
      defined(&f) || next;
      while (<>) { &f(*s); print; }
      last;
  }

The safest way of JIS conversion.

  #require 'jcode.pl';
  require 'jacode.pl';
  while (defined($s = <>)) {
      ($matched, $icode) = &jcode'getcode(*s);
      if (@buf == 0 && $matched == 0) {
          print $s;
          next;
      }
      push(@buf, $s);
      next unless $icode;
      while (defined($s = shift(@buf))) {
          &jcode'convert(*s, 'jis', $icode);
          print $s;
      }
      while (defined($s = <>)) {
          &jcode'convert(*s, 'jis', $icode);
          print $s;
      }
      last;
  }
  print @buf if @buf;

Convert SJIS to UTF-8 and print each line by perl 4.036 or later.

  #retire 'jcode.pl';
  require 'jacode.pl';
  while (defined($s = <>)) {
      &jcode'convert(*s, 'utf8', 'sjis');
      print $s;
  }

Convert SJIS to UTF16-BE and print each line by perl 5.8.1 or later.

  require 'jacode.pl';
  use 5.8.1;
  while (defined($s = <>)) {
      jcode::convert(\$s, 'UTF16-BE', 'sjis');
      print $s;
  }

Convert SJIS to MIME-Header-ISO_2022_JP and print each line by perl 5.8.1 or later.

  require 'jacode.pl';
  use 5.8.1;
  while (defined($s = <>)) {
      jcode::convert(\$s, 'MIME-Header-ISO_2022_JP', 'sjis');
      print $s;
  }

BUGS AND LIMITATIONS

You must use -Llatin switch if you use on the JPerl.

AUTHOR

This project was originated by INABA Hitoshi <ina@cpan.org>.

LICENSE AND COPYRIGHT

This software is free software;

Use and redistribution for ANY PURPOSE are granted as long as all copyright notices are retained. Redistribution with modification is allowed provided that you make your modified version obviously distinguishable from the original one. THIS SOFTWARE IS PROVIDED BY THE AUTHOR ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES ARE DISCLAIMED.

This software is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

SEE ALSO

 PERL PUROGURAMINGU
 Larry Wall, Randal L.Schwartz, Yoshiyuki Kondo
 December 1997
 ISBN 4-89052-384-7
 http://www.context.co.jp/~cond/books/old-books.html

 Understanding Japanese Information Processing
 By Ken Lunde
 January 1900
 Pages: 470
 ISBN 10: 1-56592-043-0 | ISBN 13: 9781565920439
 http://oreilly.com/catalog/9781565920439/

 CJKV Information Processing
 Chinese, Japanese, Korean & Vietnamese Computing
 By Ken Lunde
 First Edition  January 1999
 Pages: 1128
 ISBN 10: 1-56592-224-7 | ISBN 13:9781565922242
 http://www.oreilly.com/catalog/cjkvinfo/index.html
 ISBN 4-87311-108-0
 http://www.oreilly.co.jp/books/4873111080/

 JIS KANJI JITEN
 Kouji Shibano
 Pages: 1456
 ISBN 4-542-20129-5
 http://www.webstore.jsa.or.jp/lib/lib.asp?fn=/manual/mnl01_12.htm

 Unicode NI YORU JIS X 0213 JISSOU NYUMON
 Kenzaburo Tamaru
 Pages: 200
 ISBN 978-4-89100-608-2
 http://ec.nikkeibp.co.jp/item/books/A04500.html

ACKNOWLEDGEMENTS

This software was made referring to software and the document that the following hackers or persons had made. I am thankful to all persons.

 Larry Wall, Perl
 http://www.perl.org/

 Kazumasa Utashiro, jcode.pl
 ftp://ftp.iij.ad.jp/pub/IIJ/dist/utashiro/perl/
 http://mail.pm.org/pipermail/tokyo-pm/2002-March/001319.html

 gama, getcode.pl
 http://www2d.biglobe.ne.jp/~gama/cgi/jcode/jcode.htm

 Gappai, jcodeg.diff
 http://www.vector.co.jp/soft/win95/prog/se347514.html

 OHZAKI Hiroki, Perl memo
 http://www.din.or.jp/~ohzaki/perl.htm#JP_Code

 NAKATA Yoshinori, Ad hoc patch for reduce waring on h2z_euc
 http://white.niu.ne.jp/yapw/yapw.cgi/jcode.pl%A4%CE%A5%A8%A5%E9%A1%BC%CD%DE%C0%A9

 Dan Kogai, Jcode module and Encode module
 http://search.cpan.org/dist/Jcode/
 http://search.cpan.org/dist/Encode/
 http://blog.livedoor.jp/dankogai/archives/50116398.html
 http://blog.livedoor.jp/dankogai/archives/51004472.html

 Donzoko CGI+--, Jcode like Encode Wrapper
 http://www.donzoko.net/cgi/jencode/

 Yusuke Kawasaki, Encode561 module
 http://www.kawa.net/works/perl/i18n-emoji/i18n-emoji.html#Encode561

 Tokyo-pm archive
 http://mail.pm.org/pipermail/tokyo-pm/