- CRIT AND SUGGESTIONS
- SEE ALSO
PDF::OCR2 - extract all text and all image ocr from pdf
use PDF::OCR2; my $p = PDF::OCR2->new('./path/to/file.pdf'); my $text_all = $p->text; my @text_pages = $p->text; my $page_object = $p->page(1);
This is meant to replace PDF::OCR. The backend complexity of this process has been isolated in modules:
PDF::GetImages PDF::Burst Image::OCR::Tesseract PDF::OCR2::Pages - in this distro.
Why not just modify PDF::OCR?? This is such a massive breakdown of code hierachy and interdependency, and such a different interface, that this made more sense. PDF::OCR was ok. But it was messy and really, after some discussion- this is a lot better.
Argument is path to pdf file. If there are errors opening the file, warns and returns undef.
Takes no argument. In scalar context, returns text of all pages, joined with a pagebreak \f character. In list context, returns text of pages one per element.
Argument is page number (starting with 1) or abs path to temporary page file. Returns PDF::OCR2::Page object. Croaks if you ask for an invalid number.
Returns number of pages. Number of temporary files. Calls PDF::Burst.
This only works on posix.
You call text() and you get a fatal. Loading a 'corrupt' pdf with PDF::API2 can trigger an error such as this;
Malformed xref in PDF file at /usr/lib/perl5/site_perl/5.8.8/PDF/API2/Basic/PDF/File.pm line 1198.
This happens because.. All pdfs are equal, but some pdfs are more equal than others. There's fifty kinds of pdf doc versions, etc. Sometimes the pdf is deemed to be corrupt by PDF::API2.
You can "fix" this problem with pdftk..
pdftk $in $out
But, this means modifying the original pdf, which is sketchy.
Maybe if the xref table is bad, we should run the operation on a repaired copy!
- Try using another burst method
If you have errors with PDF::API2 saying the pdf is corrupt, likely via PDF::Burst.. Then try this:
use PDF::OCR2; PDF::Burst::BURST_METHOD = 'CAM_PDF'; # and then... my $pdf = PDF::OCR2->new('./pathtofile.pdf'); print $pdf->text;
- Enable checking the pdf.
If you suspect the pdf is broken, or only want to run this on pdf docs that check ok with PDF::API2
use PDF::OCR2; $PDF::OCR2::CHECK_PDF = 1; # and then... my $pdf = PDF::OCR2->new('./pathtofile.pdf'); print $pdf->text;
This is not enabled by default because it is more expensive. Maybe it should be.
By default this flag is on. We check the pdf with an eval to PDF::API2 to make sure the pdf does not have errors. This takes a small toll on performance. I suggest to leave it on.
By default this flag is off. If the pdf checks bad, we attempt to repair the pdf *to a copy of the file*- this file is put alongside the original and is named $filename_repaired_xref_table.pdf, once this is created, we check again.
So, if both CHECK_PDF and REPAIR_XREF flags are on; 1. the pdf is checked for correctness 2. if the pdf is bad, we attempt to fix to a copy of the file 3. if we can't make a fixed copy, we don't die, but warn and return undef
Thus, if check pdf fails, and repair xref flag is on, we are doing two evals, it could be argued this is expensive, and it is- but then- ocr is expensive, period.
The AUTHOR is open to any suggestions and requests.
PDF::Burst Split a pdf into pages.
PDF::GetImages Split a page into images.
PDF::OCR2::Page Part of this distro.
pdfcheck Included is a program that may be of use. It helps to check a pdf for problems, stats. Very alpha, useful though.
PDF::OCR - deprecated by this module.
Leo Charre leocharre at cpan dot org
These people have made useful inquiries, requests, critiques, code suggestions. Ultimately, they help develop this work.
Copyright (c) 2011 Leo Charre. All rights reserved.
This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, i.e., under the terms of the "Artistic License" or the "GNU General Public License".
This package is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
See the "GNU General Public License" for more details.