NAME

HTML::Latex - Creates a Latex file from an HTML file.

SYNOPSIS

 use HTML::Latex

 my $parser = new HTML::Latex($conffile);
 $parser->set_option(\%options);
 $parser->add_package(@packages);
 $parser->ban_tag(@banned);
 $parser->set_log($logfile);

 # Option 1:
 foreach my $uri (@ARGV) {
    my ($htmlfile,$latexfile) = $parser->html2latex($uri);
 }

 # Option 2:
 foreach my $uri (@ARGV) {
    my $in = IO::File->new("< $uri");
    my $out = IO::File->new("> $uri.tex");
    $parser->html2latex($in,$out);
 }

 # Option 3:
 my $html_string = join("\n",<>);
 my $tex_string = $parser->parse_string($html_string,1);

 # Option 4:
 my $html_string = join("",@ARGV);
 my $tex_string = $parser->parse_string($html_string);

print $tex_string;

DESCRIPTION

This class is used to create a text file in Latex format from a file in HTML format. Use the class as follows:

1. Create a new HTML::Latex object.

2. Override any options using set_option(), add_package(), ban_tag(), or set_log().

3. Run html2latex() on a file or URL.

4. Do whatever you want with the filename that was returned.

METHODS

$p = HTML::Latex->new($conffile)

Creates a new HTML::Latex object. It parses the configuation file $conffile to set attributes. The format of that file can be found in the CONFIGURATION FILE section.

Example:

    my $parser = HTML::Latex->new();
($htmlfile,$latexfile) = $p->html2latex($in,$out)

$in is any URL or filename or FileHandle. If it is a URL, it is mirrored locally. The local location is returned as $htmlfile. The method produces a Latex file $latexfile.

Locally mirrored files are all stored in the "store" directory which can be set with either set_option() or in the configuration file. See store under the OPTIONS section for more details.

A mirrored file will automatically be re-downloaded when the URL is updated. If it has not been updated, html2latex() will use the local file only.

Also, html2latex() defaults to index.html when a file is not given. For instance, if you used html2latex(http://slashdot.org), then the url http://slashdot.org/index.html would be used.

Example:

    my($htmlfile,$latexfile) =
       $parser->html2latex('report01.html');
$tex_string = $p->parse_string($html_string [,$full])

$html_string is an HTML string. $tex_string is a LaTeX string. If $full is 0, then any <HTML> and <BODY> tags are ignored, and the string is just plain tex. If $full is 1, then <HTML> and <BODY> tags are implicitly added. Basically, it's a choice as to whether or not $tex_string has a LaTeX preamble in it.

my @old_values = $p->set_option(\%options)

Sets on option. For a description of options, see the OPTION section below. Returns an list of all the old values based on the keys of %options.

Example:

    $parser->set_option({border => 0, debug => 1});
$p->add_package(@packages)

Adds packages to the list used by \usepackage{} in Latex. The defaults are fullpage, graphicx, and url.

Example:

    $parser->add_package('doublespace');
$p->add_head(@heads)

Adds options to the list used by \documentclass[OPTIONS]{article} in Latex. Font is automatically put there, so don't put it there yourself.

Example:

    $parser->add_head('twocolumn');
$p->ban_tag(@banned)

Add @banned to the list of tags that html2latex() will ignore. This overrides tag definitions in the configuration file. By default, the <CODE> tag is banned. That is because some people were using <PRE><CODE></CODE></PRE>, which can be really bad if both are parsed.

Example:

    $parser->ban_tag('code');
my $filehandle = $p->set_log($logfile)

Have errors and messages printed to the filename or FileHandle or IO::File $logfile. By default, things are printed to STDERR. set_log() returns the FileHandle of the log file.

Example:

    my $filehandle = $parser->set_log('report01.log');

CONFIGURATION FILE

The configuration file is a very simple XML file. The root element is <conf>. Nested inside are four tags: <tag> <package> <ban> <options>.

tag

<tag> has 2 attributes: name and type. Inside of <tag> is nested zero to many <tex> tags. Each of these is described below.

name

The name attributes assigns the other values (type and tex) to an HTML tag of a certain name.

type

The type of a tag basically tells html2latex() how to handle it. Internally, this assigns the tag to a certain handler.

tex

When handling a tag, html2latex must know what TeX string to replace the HTML tags with. This is done with the use of <tex>tex string</tex>. Different types require 0,1,or 2 such tags nested inside of <tag>. You can think of <tex> tags as arguments to pass to a type handler. Internally, that is what it is.

Extraneous White space is ignored; do not rely upon it. \N is replaced with newlines. Everything else is just as you type it.

tag examples

For a lot of examples, just look at the default configuration file, html2latex.xml. We will go over 1 example in detail. This example is for the HTML <B> tag.

    <tag name="b" type="command">
        <tex>textbf</tex>
    </tag>

This text tells html2latex() to treat the <B> tag as a TeX command. It gives it the additional argument of 'textbf'. html2latex() will call the command_handler('textbf') and the output will be \textbf{NESTED DATA}.

package

For each <package>package_name</package> given, package_name is added to the list printed in the Latex file. For instance, the lines

    <package>fullpage</package>
    <package>graphicx</package>
    <package>url</package>

adds the packages fullpage, graphicx, and url. The package 'fullpage' is often recommended.

For each <head>head</head> given, head is added to the list of options printed in the \documentclass command. For instance, the line

    <head>twocolumn</head>

creates the command \documentclass[10pt,twocolumn]{article}.

ban

<ban> will make html2latex ignore a tag. For instance, the line

   <ban>code</ban>

makes html2latex() ignore <code> even though it has a definition in the configuration file. This can be useful to turn on/turn off tags when trying different configurations.

options

Inside of <options> are a number of other tags. Each is described below in OPTIONS. The value inside of a given <OPTION> </OPTION> provides a default value that can be overridden with command-line options. For instance, <font>10</font> will set the default font size to 10.

TYPES

There are a number of different types of HTML tags support by HTML::Latex. The list is: command, environment, single, other, table, image, and ignore. Each are described below. TEX1 and TEX2 mean the first and second value given by <tex>. NAME is given by the name attribute. VALUE is the value nested within an HTML tag.

command

 HTML Key:       <NAME>VALUE</NAME>
 HTML Example:   <B>Foo</B>
 TeX  Key:       \TEX1{VALUE}
 TeX  Example:   \textbf{Foo}

environment

 HTML Key:       <NAME>VALUE</NAME>
 HTML Example:   <OL>Foo</OL>
 TeX  Key:       \begin{TEX1} VALUE \end{TEX1}
 TeX  Example:   \begin{enumerate} Foo \end{enumerate}

single

 HTML Key:       <NAME>VALUE
 HTML Example:   <LI>Foo
 TeX  Key:       \TEX1 VALUE
 TeX  Example:   \item Foo

other

 HTML Key:       <NAME>VALUE</NAME>
 HTML Example:   <DT>Foo
 TeX  Key:       TEX1 VALUE TEX2
 TeX  Example:   \item[Foo]

kill

 HTML Key:       <NAME>VALUE</NAME>
 HTML Example:   <SCRIPT>javascript.garbage()</SCRIPT>
 TeX  Key:       ""
 TeX  Example:   ""

This is of particular fun because any nested HTML tags are also ignored. Good for removing unwanted javascript.

table

This should be applied if and only if a tag is of type TABLE,TR, or TD.

image

This should be applied if and only if a tag is of type IMG.

ignore

Do nothing. Has the same affect as banning a tag.

OPTIONS

store

"store" is the directory that mirrored files are stored in. It is ~/.html2latex by default. In side of this directory are subdirectories representing the HOST in a URL and the path from that HOST. For instance, if you used html2latex(http://slashdot.org/path/to/file.html, it would store the file as ~/.html2latex/slashdot.org/path/to/file.html.

cache

This will force html2latex to use cached files if possible. It always caches anyway, and uses the cached file if the network file has not changed. This just forces the use of the local file if available.

document_class

Set the documentclass to use. Any valid latex document class is valid. Examples are report, book, and article. article is the default. If an invalid document class is used, the output latex file will not compile.

paragraph

True uses HTML-style paragraphs. They leave a newline between paragraphs. False uses TeX-style paragraphs. They have no newline, but indent the first line of every paragraph. Default is true.

font

Set the font size. Can be 10,11, or 12. Do not try anything else. html2latex will not check it, but the latex file will not compile (at least I think not). Default is 12.

image

Set the scale for images in the latex file. This is useful because some images in HTML or much to big to fit on a page. Default is 1.0. Scale can be any non-zero positive floating point number; large numbers are not recommended.

border

True means table borders are on. False mean they are off. This is always overridden by HTML attributes.

mbox

html2latex() will put a tex \mbox around all of the tables it creates. I do not know why, but with a lot of tables (especially nested ones), the tex and pdf output will work better. So, if you do not like your output with tables, try this. True means on, false means off. Default is false.

debug

The bigger the number set, the more the debugging info printed. 0 means things relevant to the user. 1 means things that trace some code. 2 or greater means dumping data structures.

Extending

Extending HTML::Latex basically means making a new tag work. Usually, this would call for writing a new handler. If a present handler will suffice, then you can stip to the 3rd step. It's very simple to do so. There are 3 easy steps:

Write the function.

Write a function (preferably ending in '_handler'). It's input is 1 HTML::Element and several tex strings. The type of HTML::Element and the value of the strings is set in the XML config file. Your furtions responsibilty is to return a TeX string representing the HTML::Element and all of it's children elements.

The children are very easy to take care of. The string representing the children elements is obtained by calling texify($html_element). So, the function really only has to worry about the current HTML::Element.

In particular, it must return that comes before and goes after the string represting the current HTML::Element. So, if you wanted a handler that print \TAG as the TeX for any <TAG> in HTML and a special TEX value given in the config file for </TAG>, then the handler would look like this:

 sub my_handler{
     my ($html_element,$tex) = @_;
     return '\' . $html_element->tag() . texify($html_element) . $tex;
 }

In this example, one TEX parameter was passed in by the XML config file. The handler return what comes before the children concatenated with the texify-ed children texified with what comes after the children. See the documentation for HTML::Element for all of the things you can do with them.

Assign a tag type to a handler.

Just add an entry to %types below. It should have a type name as a key and a reference to your handler as a value. Following our example, we could add the line:

    "my_type"     =>    \&my_handler,

To %types.

Add support in the configuration file.

The format of the configuration file is in XML and can be found above under CONFIGURATION FILE. The default XML file is at the bottom of Latex.pm under __DATA__. Basically, for every tag you want to use your new handler, use <tag> as follows:

 <tag name="TAG_NAME" type="my_type">
     <tex>TEX_PARAMATER</tex>
 </tap>

TAG_NAME is, of course, the tag name. "my_type" is the name of the type you assigned your handler to. TEX_PARAMATER is the value that gets placed under $tex in the example handler.

That's it. Now HTML::Latex should obey the new handler and behave correctly.

NOTES

In you call html2latex() on several URLs any filename given after a URL will continue to use the latest HOST given. Also, files default to index.html, regardless of what the server thinks. So, if you use:

 html2latex(http://slashdot.org)
 html2latex(foo.html)
 html2latex(http://linuxtoday.net)
 html2latex(bar.html)

html2latex() will try to grab http://slashdot.org/index.html, http://slashdot.org/foo.html, http://linuxtoday.net/index.html, and http://linuxtoday.net/bar.html

BUGS

* Anything between <TABLE> and <TR> and <TD> is ignored. I do not

* Anything between <OL> or <UL> and <LI> will not be ignored, but will really mess Latex up.

3 POD Errors

The following errors were encountered while parsing the POD:

Around line 217:

You forgot a '=back' before '=head2'

Around line 231:

=back without =over

Around line 393:

You forgot a '=back' before '=head1'