html2alvis - HTML to Alvis XML converter


    html2alvis [options] [source directory ...]


    --html-ext                 HTML file identifying filename extension
    --meta-ext                 meta file identifying filename extension
    --out-dir                  output directory
    --N-per-out-dir            # of records per output directory
    --meta-encoding            the encoding of the meta files
    --html-encoding            the encoding of all HTML files
    --html-encoding-from-meta  take the encoding of the HTML files from
                               the meta files (attribute 'detected-charset')
    --[no]original             include original document?
    --help                     brief help message
    --man                      full documentation
    --[no]warnings             warnings output flag


    Sets the HTML file identifying filename extension. 
    Default value: 'html'.
    Sets the  meta file identifying filename extension.
    The meta file syntax is

          <feature name>\t<feature value>\n

    Special features are url,title,date,detectedCharSet.
    Default value: 'meta'.
    Sets the output directory. Default value: '.'.
    Sets the # of records per output directory. Default value: 1000.
    Specifies the encoding of all meta files. Default value 'iso-8859-1'.
    Specifies the encoding of all HTML files. Default value 'iso-8859-1'.
    Default: undef (meaning 'guess').
    Specifies whether the encoding of an HTML file should be read from
    the corresponding meta file. If no information is given there,
    --html-encoding is used, if that is not given, the encoding is guessed.
    Default: no.
    Shall the original document be included in the output? Default
    value: yes.
    Prints a brief help message and exits.
    Prints the manual page and exits.
    Output (or suppress) warnings. Default value: yes.


    Goes recursively through the files under the source directory
    and converts them to Alvis XML files. Meta information (such
    as the URL or the detected character set, title of the document
    etc.) can be given in a separate meta file, one per each document,
    recognized by the shared basename. E.g. the HTML document is
    called foo.original and the meta information is in foo.meta.
    In this case html2alvis should be called like this:
          html2.alvis --html-ext original --meta-ext meta