PDF::OCR - DEPRECATED get ocr and images out of a pdf file
use PDF::OCR; my $p = new PDF::OCR('/path/to/file.pdf'); my $text = $p->get_ocr;
use PDF::OCR; my $p = new PDF::OCR('/path/to/file.pdf'); my $images = $p->abs_images; # extract images, get list of paths for( @{$p->abs_images} ){ # get ocr content for each my $content = $p->get_ocr($_); print "image $_ had content: $content\n\n"; } my $ocrs = $p->get_ocr; # get ocr content for all as one scalar with pagebreaks print "$abs_pdf had [$ocrs]\n"; # get all content of all images as array ref, each element is one image text content my @ocrs = @{ $p->get_ocr_arrayref }; print "$abs_pdf had [@ocrs]\n";
This module is deprecated by PDF::OCR2, please do not use this code in new applications.
After much thought and discussion on perlmonks.org, it seemed the best thing was to deprecate this code and upload PDF::OCR2. PDF::OCR was offered with a development caveat. A lot of people ended up downloading and using PDF::OCR, and by the time I was ready to update, it was too radical an api change. I didn't want to break anybody's code.
Thanks to perlmonks.org for discussion and resolion on the matter.
Lets you get text out of pages in pdf documents.
The whole process does not change your original pdf in any way.
Please note this is only to get text out of images inside the pdf file, it does not check for genuine text inside the file- if any. For that please see PDF::OCR::Thorough
If you scan in paper documents into PDFs, like 'modern' office environments, then these modules are useful to you.
Argument is pdf file you want to run ocr on.
my $o = new PDF::OCR('/path/to/file.pdf');
This will copy the file to a tmp file.
Returns array ref with images extracted from the pdf.
Optional argument is abs path of image extracted from pdf. Returns ocr content.
If no argument is given, all image ocr contents are concatenated and returned as scalar (with pagebreak chars, can be regexed with \f).
Get all ocr images content as array ref. This is the text.
Erase temp file and all image files extracted. Called by DESTROY, unless DEBUG flag is on.
$PDF::OCR::DEBUG = 1;
Please notify the AUTHOR if you find any bugs.
DEPRECATED.
This module is for Unix type systems. It is not intended to run on other "systems" and no support for such will be added in the future. Attempting to install on an unsupported OS will throw an exception.
This module is in development, please notify the AUTHOR with any feedback.
Please see INSTALL help notes.
PDF::OCR2 - PDF::OCR successor. PDF::GetImages - get images out of pdf documents. Image::OCR::Tesseract - tesseract perl wrapper. PDF::API2 - excellent pdf api. http://code.google.com/p/tesseract-ocr/ - tesseract optical character recognition code.
Leo Charre leocharre at cpan dot org
Copyright (c) 2009 Leo Charre. All rights reserved.
This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, i.e., under the terms of the "Artistic License" or the "GNU General Public License".
This package is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
See the "GNU General Public License" for more details.
To install PDF::OCR, copy and paste the appropriate command in to your terminal.
cpanm
cpanm PDF::OCR
CPAN shell
perl -MCPAN -e shell install PDF::OCR
For more information on module installation, please visit the detailed CPAN module installation guide.