The Perl Advent Calendar needs more articles for 2022. Submit your idea today!
                my $abs_outputname = $abs_tmp.'/'.$tmpid.'_page_%04d.pdf';
                print STDERR " abs outputname format : $abs_outputname\n" if DEBUG;
                

                my @args = ($self->_pdftk, $abs_tmp_pdf,'burst','output',$abs_outputname );
                unless( system(@args) == 0 ){
                        warn("pdftk burst fails... system @args - $?");
                        $self->{abs_pages} = [];
                        return $self->{abs_pages};
                }       

                print STDERR " pdftkburst ok for $abs_tmp_pdf\n" if DEBUG;

                opendir(DIR, $abs_tmp);
                my @abs_pages = map { $_=~s/^/$abs_tmp\//; $_ } 
         sort grep { m/$tmpid\_page_\d+\.pdf/  } readdir DIR;
                closedir DIR;

                unless( scalar @abs_pages) {
                        warn("no pages in $abs_pdf"); # or just warn() ?
                        $self->{abs_pages} = [];
                        return $self->{abs_pages};
                }       


                if (DEBUG){
                        print STDERR "pagefiles:\n";
                        map { print STDERR " $_\n" } @abs_pages;
                }

NAME

PDF::OCR::Thorough - DEPRECATED extract text fom pdf document resorting to ocr as needed

SYNOPSIS

        use PDF::OCR::Thorough;

        my $abs_pdf = '/home/myself/file.pdf';

        my $p = new PDF::OCR::Thorough($abs_pdf);

        my $text = $p->get_text;

DEPRECATED

This module is deprecated by PDF::OCR2, please do not use this code in new applications.

DESCRIPTION

Unlike PDF::OCR which assumes each page in the pdf document is a page scan- This script is more "thorough".

How it works

   1) The original.pdf is copied to tmp.pdf

   2) tmp.pdf is split into page1.pdf page2.pdf etc..

   3) For each pageX.pdf, first we try reading with pdftotext, 
      if the result is too small we try to read with Image::OCR::Tesseract.

   4) The output of each is merged with newpage chars.

The output to STDOUT is all the text of all pages, but it is separated with newpage characters. These can be matched with a regex \f

   my @page = split(/\f/, $output );

Please note the PDF::API2 is used to check that the pdf data is valid.

This is part of the PDF::OCR Package.

METHODS

new()

argument is the abs path to the pdf you want to read text from.

        my $p = new PDF::OCR::Thorough('/home/myself/myfile.pdf');

If the file is not there or the pdf data is corrupt, warns and returns undef.

pdf_data_ok()

Takes no argument, checks if the pdf is ok, if PDF::API2 can open it. This is called by constructor.

pages()

Returns number of page files extracted.

abs_tmp()

Returns abs path to the temp dir created. This is where the copy of your file resides, together with any images extracted, and page files extracted.

get_ocr()

Argument is abs path to image file. Returns ocr text. This is also cached in object.

abs_pdf()

Abs path to your original pdf provided as argument to constructor.

filename()

Returns filename of the original pdf provided as argument to constructor.

abs_tmp_pdf()

returns abs path to where the temp copy of the pdf is

abs_images()

optional argument is abs path to a page file ( see abs_pages() ). if no argument provided, returns abs paths to all images extracted from all pages.

get_page_text()

argument is page number or abs path to page file (there is no page 0) returns text inside See also get_text()

get_text()

returns all text in all pages, separated by \f newpage chars. See also get_page_text()

abs_pages()

returns abs paths to burst pdf pages

force_ocr()

argument is boolean 1/0 force extracting images and running ocr even if pdftotext finds content returns value

You would want to set this to 1 if you expect your iamge to contain both text and large images perhaps with text also, and you want both extracted.

DESTROY

will call cleanup() if DEBUG is not on and temp dir is in tmp

cleanup()

removes all temp content pretty rough, uses File::Path::rmtree() returns true.

CAVEATS

DEPRECATED.

Will not work with a corrupted pdf file. But it does test for that, so if it doesn't work, you know if it's because the PDF doc is messed up according to PDF::API2.

SEE ALSO

PDF::OCR2 - supercedes this module. PDF::OCR - parent package. PDF::API2 - excellent pdf api.

REQUIREMENTS

File::Copy, PDF::API2, PDF::GetImages, Image::OCR::Tesseract, File::Which

NON PERL REQUIREMENTS

tesseract pdftk xpdf pdftotext

AUTHOR

Leo Charre leocharre at cpan dot org

COPYRIGHT

Copyright (c) 2009 Leo Charre. All rights reserved.

LICENSE

This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, i.e., under the terms of the "Artistic License" or the "GNU General Public License".

DISCLAIMER

This package is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

See the "GNU General Public License" for more details.

1 POD Error

The following errors were encountered while parsing the POD:

Around line 280:

=pod directives shouldn't be over one line long! Ignoring all 3 lines of content