The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

    html2plain.pl - HTML to plain text converter
    

SYNOPSIS

    html2plain.pl [options] [source directory ...]

  Options:

    --html-ext                HTML file identifying filename extension
    --out-ext                 output filename extension
    --out-dir                 output directory
    --N-per-out-dir           # of records per output directory
    --source-encoding         the encoding of the HTML files
    --[no]assert-html         assert that the document is HTML
    --[no]symbolic-char-entities-to-chars
                              convert symbolic character entities to UTF-8
                              characters
    --[no]numerical-char-entities-to-chars
                              convert numerical character entities to UTF-8
                              characters
    --[no]clean-whitespace    remove redundant whitespace
    --[no]assert-assumptions  assert that the document is in UTF-8 and contains
                              before actually converting to plain text
    --help                    brief help message
    --man                     full documentation
    --[no]warnings            warnings output flag
    

OPTIONS

--html-ext
    Sets the HTML file identifying filename extension. 
    Default value: 'html'.
--out-ext
    Sets the output filename extension. 
    Default value: 'plain'.
--out-dir
    Sets the output directory. Default value: '.'.
--N-per-out-dir
    Sets the # of records per output directory. Default value: 1000.
--source-encoding
    Specifies the encoding of the HTML files. Default value undef,
    which means that the encoding is guessed for each document.
--[no]assert-html
    Specifies whether it is asserted that the document actually looks like
    HTML before trying to convert. Default: yes.
--[no]symbolic-char-entities-to-chars
    Specifies whether symbolic character entities are converted to 
    UTF-8 characters. Default: yes.
--[no]numerical-char-entities-to-chars
    Specifies whether numerical character entities are converted to 
    UTF-8 characters. Default: yes.
--[no]clean-whitespace
    Specifies whether redundant whitespace is removed from the output.
    Default: yes.
--[no]assert-assumptions
    Specifies whether assumptions about the source are validated before
    trying to convert (that it is in UTF-8 (converted to internally) and
    contains no '\0's. Default: yes.
--help
    Prints a brief help message and exits.
--man
    Prints the manual page and exits.
--[no]warnings
    Output (or suppress) warnings. Default value: yes.

DESCRIPTION

    Goes recursively through the HTML files under the source directory
    and converts their textual content to plain text files. 
    The output is in UTF-8.