Ben Bullock

NAME

Lingua::JA::FindDates - scan text to find dates in a Japanese format

SYNOPSIS

    use utf8;
    use Lingua::JA::FindDates 'subsjdate';
    
    # Given a string, find and substitute all the Japanese dates in it.
    
    my $dates = '昭和41年三月16日';
    print subsjdate ($dates), "\n";
    $dates = 'blah blah blah 三月16日';
    print subsjdate ($dates), "\n";

produces output

    March 16, 1966
    blah blah blah March 16

(This example is included as synopsis.pl in the distribution.)

    use utf8;
    use Lingua::JA::FindDates 'subsjdate';
    
    # subsjdate can call back when a date is found:
    
    sub replace_callback
    {
        my ($data, $before, $after) = @_;
        print "'$before' was replaced by '$after'.\n";
    }
    my $date = '三月16日';
    subsjdate ($date, {replace => \&replace_callback});
    

produces output

    '三月16日' was replaced by 'March 16'.

(This example is included as synopsis-replace.pl in the distribution.)

    use utf8;
    use Lingua::JA::FindDates 'subsjdate';
    
    # A routine can be used to format the date any way, letting C<subsjdate>
    # print it:
    
    sub my_date
    {
        my ($data, $original, $date) = @_;
        return join '/', $date->{month}."/".$date->{date};
    }
    my $dates = '三月16日';
    print subsjdate ($dates, {make_date => \&my_date}), "\n";

produces output

    3/16

(This example is included as synopsis-make-date.pl in the distribution.)

    use utf8;
    # Convert Western to Japanese dates
    use Lingua::JA::FindDates 'seireki_to_nengo';
    print seireki_to_nengo ('1989年1月1日'), "\n";

produces output

    昭和64年1月1日

(This example is included as synopsis-nengo.pl in the distribution.)

VERSION

This documents Lingua::JA::FindDates version 0.029 corresponding to git commit 2b26b4468108dd021c541a311ab964d8079b1af3 released on Wed May 1 19:18:01 2019 +0900.

DESCRIPTION

This is an aid for translating documents in Japanese. This module's main routine, "subsjdate", scans a text and finds things which appear to be Japanese dates.

The module recognizes a variety of date formats. It recognizes the typical format of dates with the year first, followed by the month, then the day, such as 平成20年七月十日 (Heisei nijūnen shichigatsu tōka). It also recognizes combinations such as years alone, years and months, a month and day without a year, fiscal years (年度, nendo), parts of the month, like 中旬 (chūjun, the middle of the month), and periods between two dates.

The module recognizes both Japanese years, such as "平成24年" (Heisei), and European years, such as 2012年. It recognizes ASCII numerals, 1, 2, 3; the "wide" or "double width" numerals sometimes used in Japan, 1, 2, 3 (see What is "wide ASCII"?); and the kanji-based numeral system, 一, 二,三. It recognizes some special date formats such as 元年 for the first year of an era. It recognizes era names identified by their initial letters, such as S41 年 for Shōwa 41 (1966). It recognizes dates regardless of spacing between characters, such as "平 成 二 十 年 八 月".

The input text must be marked as Unicode, in other words character data, not byte data.

The module has been tested on several hundreds of documents, and it should cope with all common Japanese dates. If you find that it cannot identify some kind of date within Japanese text, please report a bug.

FUNCTIONS

subsjdate

    my $translation = subsjdate ($text);

Translate Japanese dates into American dates. The first argument to subsjdate is a string like "平成20年7月3日(木)". The routine looks through the string to see if there is anything which appears to be a Japanese date. If it finds one, it makes an equivalent date in English, and then substitutes it into $text, as if performing the following type of operation:

    $text =~ s/平成20年7月3日(木)/Thursday, July 3, 2008/g;

If the text contains the interval between two dates, subsjdate attempts to convert that into an English-language interval.

The default dates are American-style, with the month first. Users can supply a different date-making function using the second argument:

    my $translation = subsjdate ($text, {make_date => \&mymakedate,
                                 make_date_interval => \&myinterval});

The second argument is a hash reference containing callbacks. It may have the following members:

data

Any data you want to pass to your callback functions. This is passed as the first argument to your functions. If you do not supply a value, subsjdate passes the undefined value as the first argument.

make_date
    subsjdate ($text, {make_date => \& mymakedate});

A replacement for the default "default_make_date" function. The default function is American-style and turns "平成10年11月12日" into "November 12, 1998". For example, to change this to dates in the form "Th 1998/11/10", use a routine like the following:

    use utf8;
    use Lingua::JA::FindDates 'subsjdate';
    sub mymakedate
    {
        my ($data, $original, $date) = @_;
        return qw{Bad Mo Tu We Th Fr Sa Su}[$date->{wday}]. " " .
        $date->{year}.'/'.$date->{month}.'/'.$date->{date};
    } 
    my $input = '昭和34年1月17日(土)。平成9年12月20日(土)。';
    my $output = subsjdate ($input, {make_date => \& mymakedate});
    print "$output\n";

produces output

    Sa 1959/1/17。Sa 1997/12/20。

(This example is included as subsjdate-make-date.pl in the distribution.)

The first two arguments passed to your routine will be whatever you supplied in "data", and $original, the original Japanese-language date as a string, and the third argument is the parsed date in the form of a hash reference, with the fields year (Western-style year), month (1-12), date (1-31), and wday (1-7 for Monday to Sunday). Your routine must check whether the fields year, month, date, and wday are defined, since "subsjdate" matches all kinds of dates, including year only, month/day only, and year/month only dates.

make_date_interval

A replacement for the "default_make_date_interval" function.

    subsjdate ($text, {make_date_interval => \&mymakedateinterval});

Your routine is called in the same way as the default routine, "default_make_date_interval". Its arguments are $data and $original, as for "make_date", and the two dates in the form of hash references, with the same keys as for make_date.

    use utf8;
    use Lingua::JA::FindDates 'subsjdate';
    sub crazy_date
    {
        my ($date) = @_;
        my $out = "$date->{month}/$date->{date}";
        if ($date->{year}) {
            $out = "$date->{year}/$out";
        }
        return $out;
    }
    sub myinterval
    {
        my ($data, $original, $date1, $date2) = @_;
        # Ignore C<$data> and C<$original>.
        return crazy_date ($date1) . " until " . crazy_date ($date2);
    } 
    my $input = '昭和34年1月17日〜12月20日。';
    #$Lingua::JA::FindDates::verbose = 1;
    my $output = subsjdate ($input, {make_date_interval => \& myinterval});
    print "$output\n";

produces output

    1959/1/17 until 12/20。

(This example is included as subsjdate-make-interval.pl in the distribution.)

replace

This enables you to supply a function which is called back to notify you when text is replaced.

    subsjdate ($text, {replace => \&my_replace, data => $my_data});
    # Now "my_replace" is called as
    # my_replace ($my_data, $before, $after);

The arguments to your function are whatever you gave in "data", the "before" date text, and the "after" date text.

kanji2number

    kanji2number ($knum)

kanji2number is a simple kanji number convertor for use with dates. Its input is one string of kanji numbers only, like '三十一'. It can deal with kanji numbers with or without ten/hundred/thousand kanjis. The return value is the numerical value of the kanji number, like 31, or zero if it can't read the number.

kanji2number only goes up to thousands, because usually dates only go that far. Other modules on CPAN may be able to convert larger numbers.

seireki_to_nengo

    use utf8;
    use Lingua::JA::FindDates 'seireki_to_nengo';
    print seireki_to_nengo ('1989年1月1日');

produces output

    昭和64年1月1日

(This example is included as seireki-to-nengo.pl in the distribution.)

This function substitutes Western-style dates with Japanese-style "nengo" dates (年号), leaving the rest of the text unchanged. The "nengo" dates go back to the Meiji period (1868). See "BUGS".

nengo_to_seireki

    use utf8;
    use Lingua::JA::FindDates 'nengo_to_seireki';
    print nengo_to_seireki ('昭和64年1月1日');

produces output

    1989年1月1日

(This example is included as nengo-to-seireki.pl in the distribution.)

This function substitutes Japanese-style "nengo" dates (年号) with Western-style dates, leaving the rest of the text unchanged. The "nengo" dates go back to the Meiji period (1868). See "BUGS".

regjnums

    my $number = regjnums ($number);

This is a simplistic number regularizer which utilizes "kanji2number" to convert kanji digits or wide ascii numerals into Arabic numeral equivalents. Like "kanji2number", it can only cope with kanji numbers up to 9,999.

DEFAULT CALLBACKS

This section discusses the default subroutines which are called as dates are found to convert the Japanese dates into another string format. These callbacks are not exported. In versions of this module prior to 0.022, these functions were called make_date and make_date_interval respectively. The previous names still work.

default_make_date

"subsjdate", given a date like 平成20年7月3日(木) (Heisei year 20, month 7, day 3 (Th), "Thursday 3rd July, 2008"), passes make_date a hash reference with values year => 2008, month => 7, date => 3, wday => 4 for the year, month, date and day of the week. default_make_date turns the date information supplied to it into a string representing the date. default_make_date is not exported.

Here is an example of how it operates:

    use Lingua::JA::FindDates;
    my $outdate = Lingua::JA::FindDates::default_make_date ({
        year => 2012,
        month => 3,
        date => 19,
        wday => 1,
    });
    print "$outdate\n";

produces output

    Monday, March 19, 2012

(This example is included as make-date.pl in the distribution.)

To replace the default routine default_make_date with a different format, supply a make_date callback to "subsjdate":

    use utf8;
    use Lingua::JA::FindDates 'subsjdate';
    sub my_date
    {
        my ($data, $original, $date) = @_;
        return join '/', $date->{month}."/".$date->{date};
    }
    my $dates = '三月16日';
    print subsjdate ($dates, {make_date => \&my_date});

produces output

    3/16

(This example is included as my-date.pl in the distribution.)

Note that, depending on what dates are in your document, some of the hash values may not be available, so the user routine needs to handle the cases when the year or the month or the day of the month are missing.

default_make_date_interval

    use Lingua::JA::FindDates;
    print Lingua::JA::FindDates::default_make_date_interval (
    {
        # 19 February 2010
        year => 2010,
        month => 2,
        date => 19,
        wday => 5,
    },
    # Monday 19th March 2012.
    {
        year => 2012,
        month => 3,
        date => 19,
        wday => 1,
    },), "\n";
    

produces output

    Friday 19 February, 2010-Monday 19 March, 2012

(This example is included as default-make-date-interval.pl in the distribution.)

This function is called when an interval of two dates, such as 平成3年 7月2日〜9日, is detected. It makes a string to represent that interval in English. It takes two arguments, hash references to the first and second date. The hash references are in the same format as "default_make_date".

This function is not exported. It is the default used by "subsjdate". You can use another function instead of this default by supplying a value make_date_interval as a callback in "subsjdate".

BUGS

The following special cases are not covered.

Doesn't do 元日 (ganjitsu)

This date (another way to write "1st January") is a little difficult, since the characters which make it up could also occur in other contexts, like 元日本軍 gennihongun, "the former Japanese military". Correctly parsing it requires a linguistic analysis of the text, which this module isn't able to do.

10月第4月曜日

"10月第4月曜日", which means "the fourth Monday of October", comes out as "October第April曜日".

今年6月

The module does not handle things like 今年 (this year), 去年 (last year), or 来年 (next year).

末日

The module does not handle "末日" (matsujitsu) "the last day" (of a month).

土日祝日

The module does not handle "土日祝日" (weekends and holidays).

年末年始

The module does not handle "年末年始" (the new year period).

Please also note the following:

Minimal sanity check of Japanese era dates

It does not detect that dates like 昭和99年 (Showa 99, an impossible year, since Showa 63 (1988) was succeeded by Heisei 1 (1989)) are invalid.

Only goes back to Meiji

The date matching only goes back to the Meiji era. There is DateTime::Calendar::Japanese::Era if you need to go back further.

Doesn't find dates in order

For those supplying their own callback routines, note that the dates returned won't be in the order that they are in the text, but in the order that they are found by the regular expressions, which means that in a string with two dates, the callbacks might be called for the second date before they are called for the first one. Dates which contain more information, such as a month, day and year, are searched for before shorter ones, such as a year only or a month and day only.

UTF-8 version only

This module only understands Japanese encoded in Perl's internal form (UTF-8).

Trips a bug in Perl 5.10

If you send subsjdate a string which is pure ASCII, you'll get a stream of warning messages about "uninitialized value". The error messages are wrong - this is actually a bug in Perl, reported as bug number 56902 (http://rt.perl.org/rt3/Public/Bug/Display.html?id=56902). But sending this routine a string which is pure ASCII doesn't make sense anyway, so don't worry too much about it.

EXPORTS

This module exports "subsjdate", "kanji2number", "seireki_to_nengo", "nengo_to_seireki" and "regjnums" on request. The variable "@jdatere" is also exportable on request. "$verbose" is not exportable, since it controls the module's behaviour.

All the exportable functions and variables can be exported using the tag :all as in

    use Lingua::JA::FindDates ':all';

SEE ALSO

Other modules

DateTime::Locale::JA

Minimal selection of Japanese date functions. It's not complete enough to deal with the full range of dates in actual documents.

DateTime::Format::Japanese

This parses Japanese dates. Unlike the present module it claims to also format them, so it can turn a DateTime object into a Japanese date, and it also does times.

DateTime::Calendar::Japanese::Era

A full set of Japanese eras.

Lingua::JA::Moji

Japanese character transliterations, including wide ascii numerals.

Online date converter

An online date converter which uses this module may be found at https://www.lemoda.net/japanese/findjdates/index.cgi.

DEPENDENCIES

Carp is used to report errors. This module has no dependencies on non-core modules.

VARIABLES

@jdatere

The Japanese date regular expressions are stored in an array @jdatere containing a pair of a regular expression to match a kind of date, and a string like "ymdw" which contains letters saying what to do with captured subexpressions.

The array is ordered from the regular expression with the longest match (like "year / month / day / weekday") to the shortest (like "year" only).

The string explains what to do with the captured elements like $1, $2, etc. from the regular expression. For example, if the first letter is "y", then $1 is a year in Western format like 2008, or if the third letter is "w", then $3 is the day of the week, from 1 to 7. The following elements exist:

e

Japanese era (string).

j

Japanese year (string representing small number)

x

Empty month and day

m

The month number (from 1 to 12, 13 for a blank month, 0 for an invalid month)

d

The day of month (from 1 to 31, 32 for a blank day, 0 for an invalid day)

w

A weekday (from Monday = 1 to Sunday = 7, zero or undefined for an invalid weekday)

z

A jun (旬), a period of one third of a month, or ten days.

x

This indicates a date like a "fill in the blanks" date on a form.

1

After another code, indicates the first of a pair of two things. For example, the matching code for

    平成9年10月17日〜20日

is

    ejmd1d2

$verbose

If you want to see what the module is doing, set

  $Lingua::JA::FindDates::verbose = 1;

This makes "subsjdate" print out each regular expression and reports whether it matched, which looks like this:

  Looking for y in ([0-90-9]{4}|[十六七九五四千百二一八三]?千[十六七九五四千百二一八三]*)\h*年
  Found '千九百六十六年': Arg 0: 1966 -> '1966'

AUTHOR

Ben Bullock, <bkb@cpan.org>

COPYRIGHT & LICENCE

This package and associated files are copyright (C) 2008-2019 Ben Bullock.

You can use, copy, modify and redistribute this package and associated files under the Perl Artistic Licence or the GNU General Public Licence.