pdf2ntvml [-E defaultinputencoding] [-b] [-n] [-t 256] [-f files]

PDF to XML converter for use with NexTrieve.

Attributes

XML <filename> attribute contains the filename of the HTML file.

XML <title> attribute contains the title of the HTML-file (if any)

Text-types

XML <title> text-type contains the title of the HTML-file (if any)

Example Output

 <document>
  <attributes>
   <filename>example.pdf</filename>
   <title>Example</title>
  </attributes>
  <text>
   <title>Example</title>
   Text found in the PDF-file
  </text>
 </document>

Why <title> both an attribute as well as a text-type?

A match for a word in a query in the <title> text-type can be very much more significant than when that word would be found in the body text. However, if you want to display the information of a hit, it is handy to have the title (and the filename) of the document available as well. That is why the title is added as an attribute as well, even though it can not be used for constraining a query (at least, not yet).

Usage

 pdf2ntvml -f file1 file2 file3 > xml

 pdf2ntvml <files.list > xml

Example

Convert all .pdf files in the "doc_root" directory to XML and store that XML in the file "xml".

 pdf2ntvml -f doc_root/*.pdf >xml

Example

Index all of the files located by find command.

 find / --iregex '*.pdf' | pdf2ntvml | docseq | ntvindex -

Requirements

Requires the availability of the NexTrieve.pm module and associated modules as found on CPAN (http://www.cpan.org/). Also requires the availability of the "pdfinfo" and "pdftotext" programs of the xpdf package, located at http://www.foolabs.com/xpdf/ .

Parameter settings

-f file1 file2 file3: read filenames from command line

If you want to specify the filenames from the command line rather than pipe them from STDIN, that is possible by specifying the -f parameter, followed by the list of files you want to process. This is additional to any filenames piped through STDIN.

-E defaultinputencoding

When processing PDF, many older PDF-files do not contain the information needed for XML to determine which character encoding is being used. By specifying the -E parameter, you can specify which encoding should be assumed if no encoding information is found. The default is "iso-8859-1".

-n

If you are merging multiple runs of this script into the same file, you do not need the <?xml...?> processor instruction to be repeated. By specifying this flag, the processor instruction will _not_ be emitted to the XML stream.

-b

If you are merging results of multiple runs, you may also want the <ntv:docseq> container not to be emitted. Specifying the -b (for "bare XML") flag does just that.

-t 256: maximum length for title attribute

Some PDF out there in the world contains very long titles. This is done by some people to get higher rankings, as many search engines value text in a title more than text in a body (or only search in the title at all).

Experience has shown that titles of more than 10K are not uncommon. This however causes all sorts of problems in the display of the hitlists (where the title is one of the attributes returned) and in general it brings down the performance.

The -t parameter allows you to put a maximum length of the title as stored as an attribute (and therefore returned in the hitlist). It does not alter the length of the title stored as a texttype.

The default for -t is 0, indicating not limiting of text.

AUTHOR

Elizabeth Mattijsen, <liz@dijkmat.nl>.

Please report bugs to <perlbugs@dijkmat.nl>.

	Global
`s`	Focus search bar
`?`	Bring up this help dialog

	GitHub
`g` `p`	Go to pull requests
`g` `i`	go to github issues (only if github is preferred repository)

	POD
`g` `a`	Go to author
`g` `c`	Go to changes
`g` `i`	Go to issues
`g` `d`	Go to dist
`g` `r`	Go to repository/SCM
`g` `s`	Go to source
`g` `b`	Go to file browse

	Search terms
module: (e.g. module:Plugin)
distribution: (e.g. distribution:Dancer auth)
author: (e.g. author:SONGMU Redis)
version: (e.g. version:1.00)

pdf2ntvml [-E defaultinputencoding] [-b] [-n] [-t 256] [-f files]

Attributes

Text-types

Example Output

Why <title> both an attribute as well as a text-type?

Usage

Example

Example

Requirements

Parameter settings

-f file1 file2 file3: read filenames from command line

-E defaultinputencoding

-n

-b

-t 256: maximum length for title attribute

AUTHOR

COPYRIGHT

SEE ALSO

pdf2ntvml [-E defaultinputencoding] [-b] [-n] [-t 256] [-f files]

Attributes

Text-types

Example Output

Why <title> both an attribute as well as a text-type?

Usage

Example

Example

Requirements

Parameter settings

-f file1 file2 file3: read filenames from command line

-E defaultinputencoding

-n

-b

-t 256: maximum length for title attribute

AUTHOR

COPYRIGHT

SEE ALSO

Module Install Instructions