Renard::Curie::Data::PDF - Retrieve PDF image and text data via MuPDF's mutool
version 0.001
_call_mutool( @args )
Helper function which calls mutool with the contents of the @args array.
mutool
@args
Returns the captured STDOUT of the call.
STDOUT
This function dies if mutool unsuccessfully exits.
get_mutool_pdf_page_as_png($pdf_filename, $pdf_page_no)
This function returns a PNG stream that renders page number $pdf_page_no of the PDF file $pdf_filename.
$pdf_page_no
$pdf_filename
get_mutool_text_stext_raw($pdf_filename, $pdf_page_no)
This function returns an XML string that contains structured text from page number $pdf_page_no of the PDF file $pdf_filename.
The XML format is defined by the output of mutool looks like this (for page 23 of the pdf_reference_1-7.pdf file):
pdf_reference_1-7.pdf
<document name="test-data/test-data/PDF/Adobe/pdf_reference_1-7.pdf"> <page width="531" height="666"> <block bbox="261.18 616.16394 269.77765 625.2532"> <line bbox="261.18 616.16394 269.77765 625.2532"> <span bbox="261.18 616.16394 269.77765 625.2532" font="MyriadPro-Semibold" size="7.98"> <char bbox="261.18 616.16394 265.50037 625.2532" x="261.18" y="623.2582" c="2"/> <char bbox="265.50037 616.16394 269.77765 625.2532" x="265.50037" y="623.2582" c="3"/> </span> </line> </block> <block bbox="225.78 88.20229 305.18158 117.93829"> <line bbox="225.78 88.20229 305.18158 117.93829"> <span bbox="225.78 88.20229 305.18158 117.93829" font="MyriadPro-Bold" size="24"> <char bbox="225.78 88.20229 239.5176 117.93829" x="225.78" y="111.93829" c="P"/> <char bbox="239.5176 88.20229 248.4552 117.93829" x="239.5176" y="111.93829" c="r"/> <char bbox="248.4552 88.20229 261.1128 117.93829" x="248.4552" y="111.93829" c="e"/> <char bbox="261.1128 88.20229 269.28238 117.93829" x="261.1128" y="111.93829" c="f"/> <char bbox="269.28238 88.20229 281.93997 117.93829" x="269.28238" y="111.93829" c="a"/> <char bbox="281.93997 88.20229 292.50958 117.93829" x="281.93997" y="111.93829" c="c"/> <char bbox="292.50958 88.20229 305.18158 117.93829" x="292.50958" y="111.93829" c="e"/> </span> </line> </block> </page> </document>
Simplified, the high-level structure looks like:
<page> -> [list of blocks] <block> -> [list of blocks] a block is either: - stext <line> -> [list of lines] (all have same baseline) <span> -> [list of spans] (horizontal spaces over a line) <char> -> [list of chars] - image TODO
get_mutool_text_stext_xml($pdf_filename, $pdf_page_no)
Returns a HashRef of the structured text from from page number $pdf_page_no of the PDF file $pdf_filename.
See the function get_mutool_text_stext_raw for details on the structure of this data.
get_mutool_page_info_raw($pdf_filename)
Returns an XML string of the page bounding boxes of PDF file $pdf_filename.
The data is in the form:
<document> <page pagenum="1"> <MediaBox l="0" b="0" r="531" t="666" /> <CropBox l="0" b="0" r="531" t="666" /> <Rotate v="0" /> </page> <page pagenum="2"> ... </page> </document>
get_mutool_page_info_xml($pdf_filename)
Returns a HashRef containing the page bounding boxes of PDF file $pdf_filename.
See function get_mutool_page_info_raw for information on the structure of the data.
fun get_mutool_outline_simple($pdf_filename)
Returns an array of the outline of the PDF file $pdf_filename as an ArrayRef[HashRef] which corresponds to the items attribute of Renard::Curie::Model::Outline.
ArrayRef[HashRef]
items
Project Renard
This software is copyright (c) 2016 by Project Renard.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.
To install Renard::Curie, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Renard::Curie
CPAN shell
perl -MCPAN -e shell install Renard::Curie
For more information on module installation, please visit the detailed CPAN module installation guide.