The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Lingua::JA::FindDates - scan text to find dates in a Japanese format

SYNOPSIS

  # Find and replace Japanese dates:

  use Lingua::JA::FindDates 'subsjdate';

  # Given a string, find and substitute all the Japanese dates in it.

  my $dates = '昭和41年三月16日';
  print subsjdate ($dates);

  # prints "March 16, 1966"

  # Find and substitute Japanese dates within a string:

  my $dates = 'blah blah blah 三月16日';
  print subsjdate ($dates);

  # prints "blah blah blah March 16"

  # subsjdate can also call back a user-supplied routine each time a
  # date is found:

  sub replace_callback
  {
    my ($data, $before, $after) = @_;
    print "'$before' was replaced by '$after'.\n";
  }
  my $dates = '三月16日';
  my $data = 'xyz'; # something to send to replace_callback
  subsjdate ($dates, {replace => \&replace_callback, data => $data});

  # prints "'三月16日' was replaced by 'March 16'."

  # A routine can be used to format the date any way, letting C<subsjdate>
  # print it:

  sub my_date
  {
    my ($data, $original, $date) = @_;
    return join '/', $date->{month}."/".$date->{date};
  }
  my $dates = '三月16日';
  print subsjdate ($dates, {make_date => \&my_date});

  # This prints "3/16"

  # Convert Western to Japanese dates
  use Lingua::JA::FindDates 'seireki_to_nengo';
  print seireki_to_nengo ('1989年1月1日');
  # This prints "昭和64年1月1日".

VERSION

This documents version 0.021 of Lingua::JA::FindDates corresponding to git commit ff55fff51d7e0be76b7462a848cde2c6cad64d15 released on Mon Aug 29 10:43:27 2016 +0900.

DESCRIPTION

This module offers pattern matching of dates in the Japanese language. Its main routine, "subsjdate" scans a text and finds things which appear to be Japanese dates.

The module recognizes the typical format of dates with the year first, followed by the month, then the day, such as 平成20年七月十日 (Heisei nijūnen shichigatsu tōka). It also recognizes combinations such as years alone, years and months, a month and day without a year, fiscal years (年度, "nendo"), parts of the month, like 中旬 (chūjun, the middle of the month), and periods between two dates.

It recognizes both the Japanese-style era-base year format, such as "平成24年" (Heisei) for the current era, and European-style Christian era year format, such as 2012年. It recognizes several forms of numerals, including the ordinary ASCII numerals, 1, 2, 3; the "wide" or "double width" numerals sometimes used in Japan, 1, 2, 3; and the kanji-based numeral system, 一, 二, 三. It recognizes some special date formats such as 元年 for the first year of an era. It recognizes era names identified by their initial letters, such as S41年 for Shōwa 41 (1966). It recognizes dates regardless of any spacing which might be inserted between individual Japanese characters, such as "平 成 二 十 年 八 月".

The input text must be marked as Unicode, in other words character data, not byte data.

The module has been tested on several hundreds of documents, and it should cope with all common Japanese dates. If you find that it cannot identify some kind of date within Japanese text, please report that as a bug.

FUNCTIONS

subsjdate

    my $translation = subsjdate ($text);

Translate Japanese dates into American dates. The first argument to subsjdate is a string like "平成20年7月3日(木)". The routine looks through the string to see if there is anything which appears to be a Japanese date. If it finds one, it calls "make_date" to make the equivalent date in English (American-style), and then substitutes it into $text, as if performing the following type of operation:

    $text =~ s/平成20年7月3日(木)/Thursday, July 3, 2008/g;

Users can supply a different date-making function using the second argument. The second argument is a hash reference which may have the following members:

replace
    subsjdate ($text, {replace => \&my_replace, data => $my_data});
    # Now "my_replace" is called as
    # my_replace ($my_data, $before, $after);

If there is a replace value in the callbacks, subsjdate calls it as a subroutine with the data in $callbacks->{data} and the before and after string, in other words the matched date and the string with which it is to be replaced.

data

Any data you want to pass to "replace", above.

make_date
    subsjdate ($text, {make_date => \& mymakedate});

This is a replacement for the default "make_date" function. The default function turns the Japanese dates into American-style dates, so, for example, "平成10年11月12日" is turned into "November 12, 1998". If you don't need to replace the default (if you want American-style dates), you can leave this blank. If, for example, you want dates in the form "Th 2008/7/3", you could write a routine like the following:

   sub mymakedate
   {
       my ($data, $original, $date) = @_;
       return qw{Bad Mo Tu We Th Fr Sa Su}[$date->{wday}].
           $date->{year}.'/'.$date->{month}.'/'.$date->{date};
   } 

Your routine will be called in the same way as the default routine, "make_date". It is necessary to check for the hash values for the fields year, month, date, and wday being zero, since "subsjdate" matches "month/day" and "year/month" only dates.

$data is any data which is passed in to "subsjdate". $original is the original text.

make_date_interval

This is a replacement for the make_date_interval function.

    subsjdate ($text, {make_date_interval => \&mymakedateinterval});

Your routine is called in the same way as the default routine, "make_date_interval". Its arguments are $data and $original as for make_date, and the two dates in the form of hash references with the same keys as for make_date.

kanji2number

    kanji2number ($knum)

kanji2number is a simple kanji number convertor for use with dates. Its input is one string of kanji numbers only, like '三十一'. It can deal with kanji numbers with or without ten/hundred/thousand kanjis. The return value is the numerical value of the kanji number, like 31, or zero if it can't read the number.

kanji2number only goes up to thousands, because usually dates only go that far. For a more comprehensive Japanese number convertor, see Lingua::JA::Numbers.

seireki_to_nengo

  use Lingua::JA::FindDates 'seireki_to_nengo';
  print seireki_to_nengo ('1989年1月1日');
  # This prints "昭和64年1月1日".

This function substitutes Western-style dates with Japanese-style "nengo" dates (年号). The "nengo" dates go back to the Meiji period (1868).

nengo_to_seireki

  use Lingua::JA::FindDates 'nengo_to_seireki';
  print nengo_to_seireki ('昭和64年1月1日');
  # This prints "1989年1月1日".

This function substitutes Japanese-style "nengo" dates (年号) with Western-style dates. The "nengo" dates go back to the Meiji period (1868).

DEFAULT CALLBACKS

make_date

   # Monday 19th March 2012.
   make_date ({
       year => 2012,
       month => 3,
       date => 19,
       wday => 1,
   })

make_date is the default date-string-making routine. It turns the date information supplied to it into a string representing the date. make_date is not exported.

subsjdate, given a date like 平成20年7月3日(木) (Heisei year 20, month 7, day 3, in other words "Thursday the third of July, 2008"), passes make_date a hash reference with values year => 2008, month => 7, date => 3, wday => 4 for the year, month, date and day of the week. make_date returns a string, 'Thursday, July 3, 2008'. If some fields of the date aren't defined, for example in the case of a date like 7月3日 (3rd July), the hash values for the keys of the unknown parts of the date, such as year or weekday, are undefined.

To replace the default routine make_date with a different format, supply a make_date callback to subsjdate:

  sub my_date
  {
    my ($data, $original, $date) = @_;
    return join '/', $date->{month}."/".$date->{date};
  }
  my $dates = '三月16日';
  print subsjdate ($dates, {make_date => \&my_date});

This prints

  3/16

make_date_interval

   make_date_interval (
   {
   # 19 February 2010
       year => 2010,
       month => 2,
       date => 19,
   },
   # Monday 19th March 2012.
   {
       year => 2012,
       month => 3,
       date => 19,
       wday => 1,
   },);

This function is called when an interval of two dates, such as 平成3年 7月2日〜9日, is detected. It makes a string to represent that interval in English. It takes two arguments, hash references to the first and second date. The hash references are in the same format as "make_date".

This function is not exported. It is the default used by "subsjdate". You can use another function instead of this default by supplying a value make_date_interval as a callback in "subsjdate".

BUGS

The following special cases are not covered.

Doesn't do 元日 (ganjitsu)

This date (another way to write "1st January") is a little difficult, since the characters which make it up could also occur in other contexts, like 元日本軍 gennihongun, "the former Japanese military". Correctly parsing it requires a linguistic analysis of the text, which this module isn't able to do.

10月第4月曜日

"10月第4月曜日", which means "the fourth Monday of October", comes out as "October第April曜日".

今年6月

The module does not handle things like 今年 (this year), 去年 (last year), or 来年 (next year).

末日

The module does not handle "末日" (matsujitsu) "the last day" (of a month).

土日祝日

The module does not handle "土日祝日" (weekends and holidays).

年末年始

The module does not handle "年末年始" (the new year period).

Please also note the following:

Minimal sanity check of Japanese era dates

It does not detect that dates like 昭和99年 (Showa 99, an impossible year, since Showa 63 (1988) was succeeded by Heisei 1 (1989)) are invalid. It does, however, only allow two digits for these named-era dates.

Only goes back to Meiji

The date matching only goes back to the Meiji era. There is DateTime::Calendar::Japanese::Era if you need to go back further.

Doesn't find dates in order

For those supplying their own callback routines, note that the dates returned won't be in the order that they are in the text, but in the order that they are found by the regular expressions, which means that in a string with two dates, the callbacks might be called for the second date before they are called for the first one. Basically the longer forms of dates are searched for before the shorter ones.

UTF-8 version only

This module only understands Japanese encoded in Perl's internal form (UTF-8).

Trips a bug in Perl 5.10

If you send subsjdate a string which is pure ASCII, you'll get a stream of warning messages about "uninitialized value". The error messages are wrong - this is actually a bug in Perl, reported as bug number 56902 (http://rt.perl.org/rt3/Public/Bug/Display.html?id=56902). But sending this routine a string which is pure ASCII doesn't make sense anyway, so don't worry too much about it.

EXPORTS

This module exports one function, "subsjdate", on request.

SEE ALSO

DateTime::Locale::JA

Minimal selection of Japanese date functions. It's not complete enough to deal with the full range of dates in actual documents.

DateTime::Format::Japanese

This parses Japanese dates. Unlike the present module it claims to also format them, so it can turn a DateTime object into a Japanese date, and it also does times.

Lingua::JA::Numbers

Kanji / numeral convertors. It converts numbers including decimal points and numbers into the billions and trillions.

DateTime::Calendar::Japanese::Era

A full set of Japanese eras.

AUTHOR

Ben Bullock, <bkb@cpan.org>

COPYRIGHT AND LICENCE

Copyright (C) 2008-2013 Ben Bullock.

You may use, copy, distribute, and modify this module under the same terms as the Perl programming language.