The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

HTML::HTML5::ToText - convert HTML to plain text

SYNOPSIS

 my $dom = HTML::HTML5::Parser->load_html(IO => \*STDIN);
 print HTML::HTML5::ToText
     ->with_traits(qw/ShowLinks ShowImages RenderTables/)
     ->new()
     ->process($dom);

DESCRIPTION

The HTML::HTML5::ToText module itself produces a pretty boring conversion of HTML to text, but thanks to Moose and MooseX::Traits it can easily be composed with "traits" that improve the output.

Compositor

with_traits(@traits)

This class method creates a new class that composes HTML::HTML5::ToText with each trait given, returning the name of that class. That class will be a subclass of HTML::HTML5::ToText.

Traits are taken to be in the "HTML::HTML5::ToText::Trait" namespace unless overridden by prefixing the trait with "+".

Constructors

  • new(%attrs)

    Creates a new instance of the class.

  • new_with_traits(traits => \@traits, %attrs)

    Shortcut for:

     HTML::HTML5::ToText->with_traits(@traits)->new(%attrs)

Attributes

As per usual for Moose classes, accessor methods are provided for each attribute, and attributes may be set in the constructor.

HTML::HTML5::ToText does not actually provide any attributes, but some traits may.

Methods

  • process($node)

    Processes an XML::LibXML::Node and returns a string. May be called as a class or object method.

    Because process likes to perform some alterations to the DOM tree, as a first stage it makes a clone of the DOM tree (so that it can leave the original intact). If you don't care about any changes to the tree, and want to save a bit of CPU, then you can suppress the cloning by passing a true value as a second argument to process.

     HTML::HTML5::ToText->process($node, 'no_clone')
  • process_string($string)

    As per process, but first parses the string with HTML::HTML5::Parser. The second argument (for cloning) does not exist as cloning is not needed in this case.

There are also methods named (in upper-case) after every element defined in HTML5: STRONG($node), DL($node), IMG($node) and so on, which process($node) delegates to; and a textnode($node) method which is the equivalent for text nodes. These are the methods which traits tend to modify.

EXTENDING

MooseX::Traits makes it pretty easy to cleanly extend this module. Say for example, we want to add the feature where the HTML <del> element is output as the empty string. (The default behavious treats it rather like <div>.)

 {
   package Local::SkipDEL;
   use Moose::Role;
   override DEL => sub { '' };
 }
 
 print HTML::HTML5::ToText
   -> with_traits(qw/ShowLinks ShowImages +Local::SkipDEL/)
   -> process_string($html);

Or maybe we want to force <big> elements into uppercase?

 {
   package Local::Embiggen;
   use Moose::Role;
   around BIG => sub
   {
     my ($orig, $self, $elem) = @_;
     return uc $self->$orig($elem);
   };
 }
 
 print HTML::HTML5::ToText
   -> with_traits(qw/+Local::Embiggen/)
   -> process_string($html);

Share your examples of extending HTML::HTML5::ToText at https://bitbucket.org/tobyink/p5-html-html5-totext/wiki/Extending.

BUGS

Please report any bugs to http://rt.cpan.org/Dist/Display.html?Queue=HTML-HTML5-ToText.

SEE ALSO

HTML::HTML5::Parser, HTML::HTML5::Table.

HTML::HTML5::ToText::Trait::RenderTables, HTML::HTML5::ToText::Trait::ShowImages, HTML::HTML5::ToText::Trait::ShowLinks, HTML::HTML5::ToText::Trait::TextFormatting.

Similar Modules on CPAN

  • HTML::FormatText

    About 15 years old, and still maintained, this falls into the "mature" category. This module is based on HTML::Tree, so its HTML parser may not behave as closely to modern browsers as HTML::HTML5::Parser's parsing, but its conversion to text seems somewhat better than HTML::HTML5::ToText's default output (i.e. with no traits applied).

    At the time of writing, its bug queue on rt.cpan.org lists eight bugs, some quite serious. However, since being taken over by its latest maintainer, there seems to be progress being made on them.

    Fairly extensible, but not in the mix-and-match traits way allowed by HTML::HTML5::ToText.

  • HTML::FormatText::WithLinks

    An extension of HTML::FormatText.

  • HTML::FormatText::WithLinks::AndTables

    An extension of HTML::FormatText::WithLinks.

    The code that deals with tables is pretty crude compared with HTML::HTML5::ToText::Trait::RenderTables. It doesn't support colspan, rowspan, or the <th> element.

  • LEOCHARRE::HTML::Text

    Very basic conversion; basically just tag stripping using regular expressions.

  • HTML::FormatExternal

    Passes HTML through external command-line tools such as `lynx`. Obviously this has limited portability.

AUTHOR

Toby Inkster <tobyink@cpan.org>.

THANKS

Everyone behind Moose. No way I could have done all this in a few hours without Moose's strange brand of meta-programming!

COPYRIGHT AND LICENCE

This software is copyright (c) 2012-2013 by Toby Inkster.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.

DISCLAIMER OF WARRANTIES

THIS PACKAGE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE.