Daniel Muey
and 1 contributors

NAME

Text::Extract::MaketextCallPhrases - Extract phrases from maketext–call–looking text

VERSION

This document describes Text::Extract::MaketextCallPhrases version 0.93

SYNOPSIS

    use Text::Extract::MaketextCallPhrases;
    my $results_ar = get_phrases_in_text($text);

    use Text::Extract::MaketextCallPhrases;
    my $results_ar = get_phrases_in_file($file);

DESCRIPTION

Well designed systems use consistent calls for localization. If you're really smart you've also used Locale::Maketext!!

You will probably have a collection of data that contains things like this:

    $locale->maketext( ... ); (perl)

    [% locale.maketext( ..., arg1 ) %] (TT)

    !!* locale%greetings+programs | ... , arg1 | *!! (some bizarre thing you've invented)

This module looks for the first argument to things that look like maketext() calls (See "SEE ALSO") so that you can process as needed (lint check, add to lexicon management system, etc).

By default it looks for calls to maketext(), maketext_*_context(), lextext(), and translatable() (ala Locale::Maketext::Utils::MarkPhrase). If you use a shortcut (e.g. _()) or an unperlish format, it can do that too (You might also want to look at "SEE ALSO" for an alternative this module).

EXPORTS

get_phrases_in_text() and get_phrases_in_file() are exported by default unless you bring it in with require() or no-import use()

    require Text::Extract::MaketextCallPhrases;

    use Text::Extract::MaketextCallPhrases ();

INTERFACE

These functions return an array ref containing a "result hash" (described below) for each phrase found, in the order they appear in the original text.

get_phrases_in_text()

The first argument is the text you want to parse for phrases.

The second optional argument is a hashref of options. It’s keys can be as follows:

'regexp_conf'

This should be an array reference. Each item in it should be an array reference with at least the following 2 items:

First

A regex object (i.e. qr()) that matches the beginning of the thing you are looking for.

The regex should simply match and remain simple as it gets used by the parser where and as needed. Do not anchor or capture in it!

   qr/\<cptext/
Second

A regex object (i.e. qr()) that matches the end of the thing you are looking for.

It can also be a coderef that gets passed the string matched by item 1 and returns the appropriate regex object (i.e. qr()) that matches the end of the thing you are looking for.

The regex should simply match and remain simple as it gets used by the parser where and as needed. Do not anchor or capture in it! If it is possible that there is space before the closing "whatever" you should include that too.

   qr/\s*\>/
Third (Optional)

A hashref to configure this particular token’s behavior.

Keys are:

'optional'

Default is false. When set to true, tokens not followed by a string are not included in the results (e.g. no_arg).

    blah("I am howdy", [ …], {…}); # 'I am howdy'
    blah([…],{…}); # usually included in the results w/ a type of 'perlish' but under optional => 1 it will not be included in the results
'arg_position'

Default is not to use it but conceptually it is '1' as in “first”.

After the token match, the next thing (per Text::Balanced) is typically the phrase. If that is not the case w/ a given token you can use arg_position to specify what position it takes in a list of arguments after the token as found by Text::Balanced.

For example:

    mythingy('Merp', 'I am the phrase we want to parse.', 'foo')

The list is 3 things: 'Merp', 'I am the phrase we want to parse.', and 'foo', positioned at 1, 2 , and 3 respectively.

In that case you want to specify arg_position => 2 in order to find 'I am the phrase we want to parse.' instead of 'Merp'.

Example:

    'regexp_conf' => [
        [ qr/greetings\+programs \|/, qr/\s*\|/ ],
        [ qr/\_\(?/, sub { return substr( $_[0], -1, 1 ) eq '(' ? qr/\s*\)/ : qr/\s*\;/ } ],
    ],

    'regexp_conf' => [
        [ qr/greetings\+programs \|/, qr/\s*\|/ ],
        [ qr/\_\(?/, sub { return substr( $_[0], -1, 1 ) eq '(' ? qr/\s*\)/ : qr/\s*\;/ } ],
        { 'optional' => 1 }
    ],
'no_default_regex'

If you are using 'regexp_conf' then setting this to true will avoid using the default maketext() lookup. (i.e. only use 'regexp_conf')

'cpanel_mode'

Boolean. Default false, when true it enables cPanel specific checks (e.g. cptext call syntax).

'encode_unicode_slash_x'

Boolean (default is false) that when true will turn Unicode string notation \x{....} into a non-grapheme byte string. This will cause Encode to be loaded if needed.

Otherwise \x{....} are left in the phrase as-is.

'debug_ignored_matches'

This is an array that gets aggregate debug info on matches that did not look like something that should have a phrase associated with it.

Some examples of things that might match but would not :

    sub i_heart_maketext { 1 }

    *i_heart_maketext = "foo";

    goto &xyz::maketext;

    print $locale->Maketext("Hello World"); # maketext() is cool
'ignore_perlish_statement'

Boolean (default is false) that when true will cause matches that look like a statement to be put in 'debug_ignored_matches' instead of a result with a 'type' of 'no_arg'.

'ignore_perlish_comment'

Boolean (default is false) that when true will cause matches that look like a perl comment to be put in 'debug_ignored_matches' instead of a result.

Since this is parsing arbitrary text and thus there is no real context, interpreting what is a comment or not becomes very complex and context sensitive.

If you do not want to grab phrases from commented out data and this check does not work with this text's commenting scheme then you could instead strip comments out of the text before parsing.

get_phrases_in_file()

Same as get_phrases_in_text() except it takes a path whose contents you want to process instead of text you want to process.

If it can't be opened returns false:

    my $results = get_phrases_in_file($file) || die "Could not read '$file': $!";

The "result hash"

This hash contains the following keys that describe the phrase that was parsed.

'phrase'

The phrase in question.

'offset'

The offset in the text where the phrase started.

'line'

Available via get_phrases_in_file() only, not get_phrases_in_text().

The line number the offset applies to. If a phrase spans more than one line it should be the line it starts on - but you're too smart to let the phrase dictate output format right ;p?

'file'

Available via get_phrases_in_file() only, not get_phrases_in_text().

The file the result is from. Useful when aggregating results from multiple files.

'original_text'

This is 'phrase' before any final normalizations happens.

You should be able to match the result's exact instance of the phrase if you find qr/\Q$rh->{'original_text'}\E/ right around $rh->{'file'} -> $rh->{'line'} -> $rh->{'offset'}.

'matched'

Chunk that matched the "maketext call" regex.

'regexp'

The array reference used to match this call/phrase. It is the same thing as each array ref passed in the regexp_conf list.

'quotetype'

If the match was in double quote context it will be 'double'. Specials like \t and \n are interpolated.

If the match was in single quote context it will be 'single'. Specials like \t and \n remain literal.

Otherwise it won't exist.

'quote_before' and 'quote_after'

If 'quotetype' is set these will be set also, it will be the quote-string before and after the phrase. For example, w/ 'foo' they'd both be '. For q{foo} they'd be q{ and } respectively.

If 'heredoc' is set then keep the following caveat in mind: Due to how Text::Balanced has to handle here docs 'quote_before' will not contain anything after '<<TOKEN'. i.e. it is not exactly the string that was before it in the source code.

'heredoc'

If the match was a here-doc, it will contain the opening token/the left delimiter, including any quotes.

'is_warning'

The phrase we found wasn't a string, which is odd.

'is_error'

The phrase we found looks like a mistake was made.

'type'

If the phrase is a warning or error this is a keyword that highlights why the parser wants you to look at it further.

The value can be:

undef/non-existent

Was a normal string, all is well.

'command'

The phrase was a backtick or qx() expression.

'pattern'

The phrase was a regex or transliteration expression.

'empty'

The phrase was a hardcoded empty value.

'bareword'

The phrase was a bare word expression.

'perlish'

The phrase was perl-like expression (e.g. a variable)

'no_arg'

The call had no arguments

'multiline'

The call’s argument did not contain a full entity. Probably due to a multiline phrase that is cut off at the end of the text being parsed.

This should only happen in the last item and means that some data need prepended to the next chunk you will be parsing in effort to get a complete, parsable, argument.

    my $string_1 = "maketext('I am the very model of ";
    my $string_2 = "of a modern major general.')";

    my $results = get_phrases_in_text($string_1);

    if ( $results->[-1]->{'type'} eq 'multiline' ) {
        my $trailing_partial = pop @{$results};
        $string_2 = $trailing_partial->{'matched'} . substr( $string_1, $trailing_partial->{'offset'} ) . $string_2;
    }
    push @{$results}, @{ get_phrases_in_text($string_2) };

“## no extract maketext” notation

If you have a token in the text being parsed that is not actually a maketext call (or is a maketext call that you want to ignore for some reason) you can mark it as such (i.e. so that it is not included in the results) by putting the string “## no extract maketext” after said token on the same line.

    print $lh->maketext('I am a localized string!');
    print $lh->maketext('I am not to be parsed for various undisclosed business reasons.'); ## no extract maketext
    print $lh->maketext( ## no extract maketext
        'I am not to be parsed for various undisclosed business reasons.'
    …
    # parse API format string:
    if ($str =~ m/maketext\(/) { ## no extract maketext
    …
    # mock maketext for testing:
    sub maketext { ## no extract maketext

Even if you are not parsing perl code you can use it (i.e. the #s are part of the notation and happen to work as comments in perl).

    $('#msg').text( LOCALE.maketext('I am a localized string!') );
    $('#msg').text( LOCALE.maketext('I am not to be parsed for various undisclosed business reasons.'); // ## no extract maketext
    $('#msg').text( LOCALE.maketext( // ## no extract maketext
        'I am not to be parsed for various undisclosed business reasons.'
    …
    // parse API format string:
    if (str.match(/maketext\(/)) { // ## no extract maketext
    …
    // mock maketext for testing:
    function maketext(…) { // ## no extract maketext

Any token-looking things after the notation on that line are also ignored.

    maketext('I am ignored.') ## no extract maketext: could also be maketext('I am also ignored')

DIAGNOSTICS

This module throws no warnings or errors of its own.

CONFIGURATION AND ENVIRONMENT

Text::Extract::MaketextCallPhrases requires no configuration files or environment variables.

DEPENDENCIES

Text::Balanced

String::Unquotemeta

Module::Want

INCOMPATIBILITIES

None reported.

CAVEATS

If the first thing following the "call" is a comment, the phrase will not be found.

This is because these are maketext-looking calls, not necessarily perl code. Thus interpreting what is a comment or not becomes very complex and context sensitive.

See "SEE ALSO" if you really need to support that convention (said convention seems rather silly but hey, its your code).

The result hash's values for that call are unknown (probably 'multiline' type and undef phrase). If that holds true then detecting one in the middle of your results stack is a sign of that condition.

BUGS AND LIMITATIONS

No bugs have been reported.

Please report any bugs or feature requests to bug-text-extract-maketextcallphrases@rt.cpan.org, or through the web interface at http://rt.cpan.org.

SEE ALSO

Locale::Maketext::Extract it is a driver based OO parser that has a more complex and extensible interface that may serve your needs better.

AUTHOR

Daniel Muey <http://drmuey.com/cpan_contact.pl>

LICENCE AND COPYRIGHT

Copyright (c) 2011, Daniel Muey <http://drmuey.com/cpan_contact.pl>. All rights reserved.

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See perlartistic.

DISCLAIMER OF WARRANTY

BECAUSE THIS SOFTWARE IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE SOFTWARE, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE SOFTWARE "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE SOFTWARE IS WITH YOU. SHOULD THE SOFTWARE PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR, OR CORRECTION.

IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE SOFTWARE AS PERMITTED BY THE ABOVE LICENCE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE SOFTWARE (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE SOFTWARE TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.