The London Perl and Raku Workshop takes place on 26th Oct 2024. If your company depends on Perl, please consider sponsoring and/or attending.

NAME

String::Markov - A Moose-based, text-oriented Markov Chain module

VERSION

version 0.005

SYNOPSIS

  my $mc = String::Markov->new();

  $mc->add_files(@ARGV);

  print $mc->generate_sample . "\n" for (1..20);


  my $mc = String::Markov->new(order => 1, sep => ' ');

  for my $stanza (@The_Rime_of_the_Ancient_Mariner) {
        $mc->add_sample($stanza);
  }
  
  print $mc->generate_sample;

DESCRIPTION

String::Markov is a Moose-based Markov Chain module, designed to easily consume and produce text.

ATTRIBUTES

order

The order of the chain, i.e. how much past state is used to determine the next state. The default of 2 is for reasonable for constructing new names/words when splitting into characters, or for long-ish works when splitting into words.

split_sep

How states are split. This value (or sep; see "new()") is passed directly as the first argument of "split" in perlfunc, so using ' ' has special semantics. Regular expressions will work as well, but be aware that any matched characters are discarded.

join_sep

How to re-join states. This value (or sep; see "new()") is passed directly as the first argument of "join" in perlfunc. In addition, it is used to build keys for internal hashes. This can cause problems in cases where split_sep() produces sequences like 'ae', 'io', 'a', 'ei', 'o', or 'ae', 'i', 'o', which will all turn into 'aeio' with the default if ''. If join_sep is '*' instead, then three unique keys result: 'ae*io', 'a*ei*o', and 'ae*i*o'. See "add_sample()".

null

What is used to track the beginning and end of a sample. The default of "\0" should work for UTF-8 text, but may cause problems with UTF-16 or other encodings.

normalize

Whether to normalize Unicode strings. This value, if true, is passed as the first argument to Unicode::Normalize::normalize. The default 'C' should do what most people expect, but it may be the case that 'D' is what you want. If you're not using Unicode, set this to undef.

do_chomp

Whether to "chomp" in perlfunc lines when reading files. See "add_files()".

METHODS

new()

  # Defaults
  my $mc = String::Markov->new(
        order     => 2,
        sep       => '',
        split_sep => undef,
        join_sep  => undef,
        null      => "\0",
        normalize => 'C',
        do_chomp  => 1,
  );

The sep argument doesn't correlate to an attribute, but is used to initialize split_sep or join_sep if either is undefined.

See "ATTRIBUTES".

split_line()

This is the method "add_sample()" calls when it is passed a non-ref argument. It returns an array of states (usually individual characters or words) that are used to build the Markov Chain model.

The default implementation is equivalent to:

  sub split_line {
        my ($self, $sample) = @_;
        $sample = normalize($self->normalize, $sample) if $self->normalize;
        return split($self->split_sep, $sample);
  }

This method can be overridden to deal with unusual data.

add_sample()

This method adds samples to build the Markov Chain model. It takes a single argument, which can be either a string or an array reference. If the argument is an array reference, its elements are directly used to update the Markov Chain. If it is a string, add_sample() uses the split_line() method to create an array of states, and then updates the Markov Chain.

Note that this function generates hash keys for the transition matrix. The keys are built according to the order, null, and join_sep attributes, so if an instance is created with:

  my $mc = String::Markov->new(null => '!', order => 2, join_sep => '*');
  $mc->add_sample($_) for (@sample_lines);

Then the internal transition matrix might look like:

  {
    '!*!' => { 'A' => 5, 'B' => 7, ... }, # Initial state
    '!*A' => { ... },
    '!*B' => { ... },
    ...
    'x*y' => { '!' => 4 },                # always end after 'xy'
    'y*z' => { '!' => 3, 'q' => 2 },      # sometimes end after 'yz'
    ...
  }

add_files()

This is a simple convenience method, designed to replace code like:

  while(<>) { chomp; $mc->add_sample($_) }

It takes a list of file names as arguments, and adds them line-by-line.

generate_sample()

This method returns a sequence of states, generated from the Markov Chain using the Monte Carlo method.

If called in scalar context, the states are joined with join_sep before being returned.

SEE ALSO

Algorithm::MarkovChain

AUTHOR

Grant Mathews <gmathews@cpan.org>

COPYRIGHT AND LICENSE

This software is Copyright (c) 2014 by Grant Mathews.

This is free software, licensed under:

  The Artistic License 2.0 (GPL Compatible)