The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Lingua::Awkwords - randomly generates outputs from a given pattern

SYNOPSIS

  use feature qw(say);
  use Lingua::Awkwords;
  use Lingua::Awkwords::Subpattern;

  # V is a pre-defined subpattern, ^ filters out aa from the list
  # of two vowels that the two VV generate
  my $la = Lingua::Awkwords->new( pattern => q{ [VV]^aa } );

  say $la->render for 1..10;

  # define our own C, V
  Lingua::Awkwords::Subpattern->set_patterns(
      C => [qw/j k l m n p s t w/],
      V => [qw/a e i o u/],
  );
  # and a pattern somewhat suitable for Toki Pona...
  $la->pattern(q{
      [a/*2]
      (CV*5)^ji^ti^wo^wu
      (CV*2)^ji^ti^wo^wu
      [CV/*2]^ji^ti^wo^wu
      [n/*5]
  });

  say $la->render for 1..10;

DESCRIPTION

This is a Perl implementation of

http://akana.conlang.org/tools/awkwords/

though is not an exact replica of that parser;

http://akana.conlang.org/tools/awkwords/help.html

details the format that this code is based on. Briefly,

SYNTAX

[] or ()

Denote a unit or group; they are identical except that (a) is equivalent to [a/]--that is, it represents the possibility of generating the empty string in addition to any other terms supplied.

Units can be nested recursively. There is an implicit unit at the top level of the pattern.

/

Introduces a choice within a unit; without this [Vx] would generate whatever V represents (a list of vowels by default) followed by the letter x while [V/x] by contrast generates only a vowel or the letter x.

*

The asterisk followed by an integer in the range 1..INT_MAX weights the current term of the alternation, if any. That is, while [a/] generates each term with equal probability, [a/*2] would generate the empty string at twice the probability of the letter a.

^

The caret introduces a filter that must follow a unit (there is an implicit unit at the top level of a pattern). An example would be [VV]^aa or the equivalent VV^aa that (by default) generates two vowels, but replaces aa with the empty string. More than one filter may be specified.

A-Z

Capital ASCII letters denote subpatterns; several of these are set by default. See Lingua::Awkwords::Subpattern for how to customize them. V for example is by default equivalent to the more verbose [a/i/u].

"

Use double quotes to denote a quoted string; this prevents other characters (besides " itself) from being interpreted as some non- string value.

anything-else

Anything else not otherwise accounted for above is treated as part of a string, so ["abc"/abc] generates either the string abc or the string abc, as this is two ways of saying the same thing.

ATTRIBUTES

pattern

Awkword pattern. Without this supplied any call to render will throw an exception.

tree

Where the parse tree is stored.

FUNCTIONS

These can be called as Lingua::Awkwords::set_filter or can be imported via

  use Lingua::Awkwords qw(weights2str weights_from);
percentize hashref

Modifies the values of the given hashref to be percentages of the sum of the values. Will croak if sum is 0. Use this to help compare weights_from different corpus.

set_filter filter-value

Utility routine for use with walk. Returns a subroutine that sets the filter_with attribute to the given value.

  $la->walk( Lingua::Awkwords::set_filter('X') );
weights2str hash-reference

Constructs an awkwords choice string from a given hash-reference of values and weights, e.g.

  use Lingua::Awkwords qw(weights2str weights_from);

  weights2str( ( weights_from("toki sin li toki pona") )[-1] )

will return a weight string of

  a*1/i*4/k*2/l*1/n*2/o*3/p*1/s*1/t*2

that can then be used as a pattern for this module.

weights_from string-or-filehandle

Parses the frequency of characters appearing in the input string or filehandle, and returns four hash references, first, mid, last and all which contain the character counts of the first letters of the "words" in the input, characters that appear in the middle, end, and a tally of all three of these positions together.

"words" is used in scare quotes because there is "no generally accepted and completely satisfactory definition of what constitutes a word" (Philip Durkin. "The Oxford Guide to Etymology". p.37) and because instead syllables could be fed in and then patterns generated using those syllable-specific weights.

METHODS

new

Constructor. Typically this should be passed a pattern argument.

parse_string pattern

Returns the parse tree of the given pattern without setting the tree attribute. "COMPLICATIONS" shows one use for this.

render

Returns a string render of the awkword pattern. This may be the empty string if filters have removed all the text.

walk callback

Provides a means to recurse through the parse tree, where every object in the tree will call the callback with $self as the sole argument, and then if necessary iterate through all of the possibilities contained by itself calling walk on each of those.

COMPLICATIONS

More complicated structures can be built by attaching parse trees to subpatterns. For example, Toki Pona could be extended to allow optional diphthongs (mostly in the second syllable) via

  use feature qw(say);
  use Lingua::Awkwords::Subpattern;
  use Lingua::Awkwords;
  
  my $cv  = Lingua::Awkwords->parse_string(q{
      CV^ji^ti^wo^wu
  }); 
  my $cvv = Lingua::Awkwords->parse_string(q{
      CVV^ji^ti^wo^wu^aa^ee^ii^oo^uu
  });

  Lingua::Awkwords::Subpattern->set_patterns(
      A => $cv,
      B => $cvv,
      C => [qw/j k l m n p s t w/],
      V => [qw/a e i o u/],
  );

  my $tree = Lingua::Awkwords->new( pattern => q{
      [ a[B/BA/BAA/A/AA/AAA] / [AB/ABA/ABAA/A/AA/AAA] ] [n/*5]
  });

  say join ' ', map { $tree->render } 1 .. 10;

The default filter of the empty string can be problematical, as one may not know whether a filter has been applied to the result, or the word may be filtered into an incorrect form. Consult the eg/ directory of this module's distribution for example code that customizes the filter value.

Code that makes use of non-ASCII encodings may need appropriate settings made, e.g. to use the locale for input and output and to allow UTF-8 in the program text.

  use open IO  => ':locale';
  use utf8;

  Lingua::Awkwords::Subpattern->set_patterns(
      S => [qw/... UTF-8 data here .../],
  );

BUGS

Reporting Bugs

Please report any bugs or feature requests to bug-lingua-awkwords at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Lingua-Awkwords.

Patches might best be applied towards:

https://github.com/thrig/Lingua-Awkwords

Known Issues

There are various incompatibilities with the original version of the code; these are detailed in the parser module as they concern how e.g. weights are parsed.

See also the "Known Issues" section in all the other modules in this distribution.

SEE ALSO

Lingua::Awkwords::ListOf, Lingua::Awkwords::OneOf, Lingua::Awkwords::Parser, Lingua::Awkwords::String, Lingua::Awkwords::Subpattern

AUTHOR

thrig - Jeremy Mates (cpan:JMATES) <jmates at cpan.org>

COPYRIGHT AND LICENSE

Copyright (C) 2017 by Jeremy Mates

This program is distributed under the (Revised) BSD License: http://www.opensource.org/licenses/BSD-3-Clause