The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Unicode::Wrap - Unicode Line Breaking

SYNOPSIS

  use Unicode::Wrap;

  $wrapper = new Unicode::Wrap( line_length => 75 );
  @lines = $wrapper->break_lines($long_string);

  use Unicode::Wrap qw/ text_properties lb_class class_properties /;

  @break_classes = map { lb_class $_ } split //, $long_string;
  @break_properties = class_properties(@break_classes);
  @break_properties = text_properties($long_string);
  @best_breaks = find_breaks($long_string);

ABSTRACT

This module implements UAX#14: Line Breaking Properties. It goes through a text string, classifies each character and computes a length for each. When the line gets too long, it's separated.

DESCRIPTION

All of the functions described here can be called procedurally or as an object method.

new(parameters)

This constructs a new wrapping object. Parameters:

line_length

Specifies the length of a line (in whatever units you want to use)

emergency_break

If set, and there are no breaking opportunities before the line_length is reached, an 'emergency' break will be inserted at this position. Generally this should be set to line_length (or 1, since it won't be used until line_length is reached anyway).

If emergency_break is not set, no emergency breaks will be inserted, which could result in some really long lines if no line-breaking opportunity exists.

break_lines($text)

This will break $text up into individual lines. Newlines are preserved but none will be added. Use this if you need an implementation of UAX#14 that just breaks lines up without re-assembling them into a text string.

LOW-LEVEL FUNCTIONS

If you need finer control over your own line-breaking, there's a few other functions that can be used to obtain character classifications and breaking properties for a set of characters.

Feel free to override some of these functions in descendent classes to fine-tune the behavior of this module. Some classifications and breaking properties require language-specific input and presently that's the only way to provide it.

lb_class($character)

Returns the Line Breaking classification of the character passed.

  print lb_class("a");          # AL
  print $self->lb_class("5");   # NU
class_properties(@character_classes)

Accepts a list of character classes (e.g. 'AL' or 'NU') and returns an identically-sized array of breaking properties (for the location immediately following the character at that index; no break is permitted at the start of a line). The value of each property is a number from 0 to 3 (with constants defined in the Unicode::Wrap namespace):

  0  FORBIDDEN  No break is permitted after this position
  1   INDIRECT  A break is permitted after this position
  2     DIRECT  A break is permitted after this position
  3   REQUIRED  A break is required after this position

The values INDIRECT and DIRECT are the same for all intents and purposes, but actually have a subtle difference in that an indirect break is allowed simply because there's a space in that position. A direct break opportunity allows a break under any circumstances. But you don't need to worry about the difference by this point.

Required breaks occur primarily after newlines.

text_properties($text)

This behaves like class_properties, but instead of working with a list of pre-determined classes, it classifies your $text. It will return a list (one element for each character) representing where breaks can and cannot occur.

This might be the most useful function for someone wanting to build a more intelligent line-wrapping algorithm on top of this. You could scan through the returned list of break opportunities and figure out how you want to do your own wrapping.

find_breaks($text)

This is similar to text_properties, but actually attempts to apply line lengths to find the best breaks for each line. It will return a list of indexes to the start of each new line (minus the first). Use break_lines to go the rest of the way and actually break the string up into lines.

BUGS

This module can be slow. It's a pure-Perl implementation that goes through an expensive classification process per character.
Some language-specific processing is needed in some areas to better classify characters or to identify where breaking opportunities exist. A notable example arises around quotation marks. UAX#14 forbids breaks before and after quotation marks, since they require cues from the language to determine if it's OK to break there. There is no extensible facility to add these cues aside from subclassing.

SEE ALSO

http://www.unicode.org/reports/tr14/

Unicode Standard Annex #14: Line Breaking Properties

Text::Wrap, perlunicode

AUTHOR

David NESTING <david@fastolfe.net>

Copyright (c) 2003 David Nesting. All Rights Reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.