Text::Lossy - Lossy text compression
Version 0.03
use Text::Lossy; my $lossy = Text::Lossy->new; $lossy->add('whitespace'); my $short = $lossy->process($long); my $lossy = Text::Lossy->new->add('lower', 'punctuation'); # Chaining usage $lossy->process($long); # In place $lossy->process(); # Filters $_ in place
Text::Lossy is a collection of text filters for lossy compression. "Lossy compression" changes the data in a way which is irreversible, but results in a smaller file size after compression. One of the best known lossy compression uses is the JPEG image format.
Text::Lossy
Note that this module does not perform the actual compression itself, it merely changes the text so that it may be compressed better.
This code is currently alpha software. Anything can and will change, most likely in a backwards-incompatible manner. You have been warned.
Text::Lossy uses an object oriented interface. You create a new Text::Lossy object, set the filters you wish to use (described below), and call the "process" method on the object. You can call this method as often as you like. In addition, there is a method which produces a closure, an anonymous subroutine, that acts like the process method on the given object.
New filters can be added with the "register_filters" class method. Each filter is a subroutine which takes a single string and returns this string filtered.
Selector methods are not automatically added; this is the responsibility of the code registering the filters, if desired.
my $lossy = Text::Lossy->new();
The constructor for a new lossy text compressor. The constructor is quite light-weight; the only purpose of a compressor object is to accept and remember a sequence of filters to apply to text.
The constructor takes no arguments.
my $new_text = $lossy->process( $old_text );
This method takes a single text string, applies all the selected filters to it, and returns the filtered string. Filters are selected via "add" or equivalently via the selector methods below; see FILTERS.
The text is upgraded to character semantics via a call to utf8::upgrade, see utf8. This will not change the text you passed in, nor should it have too surprising an effect on the output.
utf8::upgrade
$lossy->add( 'lower', 'whitespace' );
This method takes a list of filter names and adds them to the filter list of the filter object, in the order given. This allows a programmatic selection of filters, for example via command line. Returns the object for method chaining.
If the filter is unknown, an exception is thrown. This may happen when you misspell the name, or forgot to use a module which registers the filter, or forgot to register it yourself.
$lossy->clear();
Remove the filters from the filter object. The object will behave as if newly constructed. Returns the object for method chaining.
my @names = $lossy->list();
List the filters added to this object, in order. The names (not the code) are returned in a list.
my $code = $lossy->as_coderef(); $new_text = $code->( $old_text );
Returns a code reference that closes over the object. This code reference acts like a bound "process" method on the constructed object. It can be used in places like Text::Filter that expect a code reference that filters text.
The code reference is bound to the object, not a particular object state. Adding filters to the object after calling as_coderef will also change the behaviour of the code reference.
as_coderef
The following filters are defined by this module. Other modules may define more filters. Each of these filters can be added to the set via the "add" method.
Corresponds exactly to the lc builtin in Perl, up to and including its Unicode handling.
Collapses any whitespace (\s in regular expressions) to a single space, U+0020. Whitespace at the beginning and end of the text is stripped; you may need to add some to account for line continuations or a new line marker at the end, or use the "whitespace_nl" filter below.
\s
U+0020
A variant of the "whitespace" filter that leaves newlines on the end of the text alone. Other whitespace at the end will get collapsed into a single newline. If the text does not end in whitespace that contains a new line, it is removed completely, as before.
This filter is most useful if you are creating a Unix-style text filter, and do not want to buffer the entire input before writing the (only) line to stdout. The newline at the end will allow downstream processes to work on new lines, too. Otherwise, this filter is not quite as efficient as the whitespace filter.
stdout
Any newlines in the middle of text are collapsed to a space, too. This is especially useful if you are reading in "paragraph mode", e.g. $/ = '', as you will get one long line per former paragraph.
$/ = ''
Strips punctuation, that is anything matching \p{Punctuation}. It is replaced by nothing, removing it completely.
\p{Punctuation}
A variant of "punctuation" that replaces punctuation with a space character, U+0020, instead of removing it completely. This is usually less efficient for compression, but retains more readability, for example in the presence of URLs or email addresses.
Leaves the first and last letters of a word alone, but replaces the interior letters with the same set, sorted by the sort function. This is done on the observation (source uncertain at the time) that words can still be made out if the letters are present, but in a different order, as long as the outer ones remain the same.
This filter may not work as proposed with every language or writing system. Specifically, it uses end-of-word matches \b to determine which letters to leave alone.
\b
These methods are not called on a filter object, but on the class Text::Lossy itself. They are typically concerned with the filters that can be added to filter objects.
Text::Lossy->register_filters( change_stuff => \&Other::Module::change_text, remove_ps => sub { my ($text) = @_; $text =~ s{[Pp]}{}; return $text; }, );
Adds one or more named filters to the set of available filters. Filters are passed in an anonymous hash. Previously defined mappings may be overwritten by this function. Specifically, passing undef as the code reference removes the filter.
undef
my @filters = Text::Lossy->available_filters();
Lists the available filters at this point in time, specifically their names as used by "add" and "register_filters". The list is sorted alphabetically.
A filter is a subroutine which takes a single parameter (the text to be converted) and returns the filtered text. The text may also be changed in-place, as long as it is returned again.
These filters are then made available to the rest of the system via the "register_filters" function.
The Text::Filter module provides an infrastructure for filtering text, but no actual filters. It can be used with Text::Lossy by passing the result of "as_coderef" as the filter parameter.
filter
Nothing exported or exportable; use the OO interface instead.
This code strives to be completely Unicode compatible. All filters aim to "do the right thing" on non-ASCII strings. Any failure to handle Unicode should be considered a bug; please report it.
Ben Deutsch, <ben at bendeutsch.de>
<ben at bendeutsch.de>
None known so far.
Please report any bugs or feature requests to bug-text-lossy at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Text-Lossy. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
bug-text-lossy at rt.cpan.org
You can find documentation for this module with the perldoc command.
perldoc Text::Lossy
You can also look for information at:
RT: CPAN's request tracker (report bugs here)
http://rt.cpan.org/NoAuth/Bugs.html?Dist=Text-Lossy
AnnoCPAN: Annotated CPAN documentation
http://annocpan.org/dist/Text-Lossy
CPAN Ratings
http://cpanratings.perl.org/d/Text-Lossy
Search CPAN
http://search.cpan.org/dist/Text-Lossy/
Copyright 2012 Ben Deutsch.
This program is free software; you can redistribute it and/or modify it under the terms of either: the GNU General Public License as published by the Free Software Foundation; or the Artistic License.
See http://dev.perl.org/licenses/ for more information.
To install Text::Lossy, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Text::Lossy
CPAN shell
perl -MCPAN -e shell install Text::Lossy
For more information on module installation, please visit the detailed CPAN module installation guide.