The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

PDF::Extract - Extracting sub PDF documents from a multi page PDF document

SYNOPSIS

 use PDF::Extract;
 $pdf=new PDF::Extract;
 $pdf->servePDFExtract( PDFDoc=>"c:/Docs/my.pdf", PDFPages=>"1-3 31-36" );

or

 use PDF::Extract;
 $pdf = new PDF::Extract( PDFDoc=>'C:/my.pdf' );
 $pdf->getPDFExtract( PDFPages=>$PDFPages );
 print "Content-Type text/plain\n\n<xmp>",  $pdf->getVars("PDFExtract");
 print $pdf->getVars("PDFError");
 
 or 
 
 # Extract and save, in the current directory,  all the pages in a pdf document
 use PDF::Extract;
 $pdf=new PDF::Extract( PDFDoc=>"test.pdf");
 $i=1;
 $i++ while ( $pdf->savePDFExtract( PDFPages=>$i ) );

DESCRIPTION

PDF Extract is a group of methods that allow the user to quickly grab pages as a new PDF document from a pre-existing PDF document.

With PDF::Extract a new PDF document can be:-

  • assigned to a scalar variable with getPDFExtract.

  • saved to disk with savePDFExtract.

  • printed to STDOUT as a PDF web document with servePDFExtract.

  • cached and served for a faster PDF web document service with fastServePDFExtract.

These four main methods can be called with or without arguments. The methods will not work unless they know the location of the original PDF document. PDFPages defaults to "1". There are no other default values.

There are four other methods that deal with setting and getting the public variables.

  • getPDFExtractVariables can return an array of variables.

  • getVars is an alias of getPDFExtractVariables

  • setPDFExtractVariables can set the public variables.

  • setVars is an alias of setPDFExtractVariables

METHODS

new PDF::Extract

Creates a new Extract object with empty state information ready for processing data both input and output. New can be called with a hash array argument.

 new PDF::Extract( PDFDoc=>"c:/Docs/my.pdf", PDFPages=>"1-3 31-36" )

This will cause a new PDF document to be generated unless there is an error. Extract->new() simply calls getPDFExtract() if there is an argument.

getPDFExtract

This method is the main workhorse of the package. It does all the PDF processing and sets PDFError if its unable to create a new PDF document. It requires PDFDoc and PDFPages to be set either in this call of before to function. It outputs a PDF document as a string or undef if there is an error.

To create an array of PDF documents, each consisting of a single page, from a multi page PDF document.

 $pdf = new PDF::Extract( PDFDoc=>'C:/my.pdf' );
 $i=1;
 while ( $pdf[$i++]=$pdf->getPDFExtract( PDFPages=>$i ) );

The lowest valid page number for PDFPages is 1. A value of undef will produce no output and raise an error. An error will be raised if the PDFPages values do not correspond to any pages.

savePDFExtract

This method saves its output to the directory defined for PDFCache. (see PDFCache) If PDFSaveAs is unset the new PDF's filename will be an amalgam of the original filename, the requested page numbers and the .pdf file type suffix. If more than one page is extracted into a new PDF the page numbers will be separated with an underscore "_" for individual pages, ".." for a range of pages. eg. my6.pdf for a single page (page 6) and my1_3..6.pdf for a multi page PDF (pages 1, 3, 4, 5, 6)

 $pdf->savePDFExtract(PDFPages=>"1 3-6", PDFDoc=>'C:/my.pdf', PDFCache=>"C:/myCache" );

If there is an error then an error page will be served and savePDFExtract will return a "0". Otherwise savePDFExtract will return "1" and the saved PDF location and file name will be "C:/myCache/my1_3..5.pdf".

servePDFExtract

This method serves its output to STDOUT with the correct header for a PDF document served on the web.

 $pdf = PDF::Extract->new(
            PDFDoc=>'C:/my.pdf', 
            PDFErrorPage=>"C:/myErrorPage.html" );
 $pdf->servePDFExtract( PDFPages=>1);

If there is an error then an error page will be served and servePDFExtract will return "0". Otherwise servePDFExtract will return "1"

fastServePDFExtract

This method serves its output to STDOUT with the correct header for a PDF document served on the web.

If PDFSaveAs is unset the new PDF's filename will be an amalgam of the original filename, the requested page numbers and the .pdf file type suffix. If more than one page is extracted into a new PDF the page numbers will be separated with an underscore "_" for individual pages, ".." for a range of pages. eg. my6.pdf for a single page (page 6) and my1_3..6.pdf for a multi page PDF (pages 1, 3, 4, 5, 6). If there is an error then an error page will be served and fastServePDFExtract will return "0". fastServePDFExtract will return "1" on success.

 $pdf->setVars(
            PDFDoc=>'C:/my.pdf', 
            PDFCache=>"C:/myCache", 
            PDFErrorPage=>"C:/myErrorPage.html",
            PDFPages=>1);
 unless ($pdf->fastServePDFExtract ) {   
    # there was an error  
    $error=$pdf->getVars("PDFError") ;
 }

getPDFExtractVariables

Get any of the public variables using a list of the variables to get

 ($error,$found)=$pdf->getPDFExtractVariables( "PDFError", "PDFPagesFound");

This method returns an an array of variables corresponding to the named variables passed in as arguments. If a variable is undefined then its returned value will be undefined.

getVars

This methos is an alias for getPDFExtractVariables. Get any of the public variables using a list of the variables to get

 @vars=$pdf->getVars( @varNames );

This method returns an an array of variables corresponding to the named variables passed in as arguments. If a variable is undefined then its returned value will be undefined.

setPDFExtractVariables

Set any of the public variables using a hash of the variables and their values.

 ($doc,$pages)=$pdf->setPDFExtractVariables(PDFDoc=>'C:/my.pdf', PDFPages=>1);

This method sets the variables specified in the argument hash. They return an array of the new values set.

setVars

This methos is an alias for setPDFExtractVariables. Set any of the public variables using a hash of the variables and their values.

 @vars=$pdf->setVars( %vars );

This method sets the variables specified in the argument hash. They return an array of the new values set.

VARIABLES

PDFDoc (set and get)

 $file=$pdf->getVars("PDFDoc");

This variable contains the path to the last original PDF document accessed by getPDFExtract, savePDFExtract, servePDFExtract and fastServePDFExtract. PDFDoc will be an empty string if there was an error.

PDFPages (set and get)

 $pages=$pdf->setVars( PDFPages =>"1 18-23");
 or
 $pages=$pdf->getVars("PDFPages");

This variable contains a list of pages to extract from the original PDF document accessed by getPDFExtract, savePDFExtract, servePDFExtract and fastServePDFExtract. Use the join function to create a list of pages from an array. Such a an array of pages sent from a multi select box on a web form. PDFPages will default to "1" if unset or there is an error processing the pages string.

 PDFPages => join( " ", $cgi->param( "PDFPages" )),

PDFCache (set and get)

 $cachePath=$pdf->setVars( PDFCache =>"C:/myCache");
 or
 $cachePath=$pdf->getVars("PDFCache");

This variable, if set, should contain the FULL PATH to the PDF document cache. This value is used by savePDFExtract and fastServePDFExtract method calls. PDFCache will be an empty string if there was an error in setting the value. If PDFCache path does not exist an attempt will be made to create it recursively. Any directories that need to be created will be created with permissions of 0x777. PDFCache defaults to ".", the current directory.

PDFSaveAs (set and get)

 $filename=$pdf->setVars( PDFSaveAs =>"myFileName");
 or
 $filename=$pdf->getVars("PDFSaveAs");
 

If PDFSaveAs is unset the new PDF's filename will be an amalgam of the original filename, the requested page numbers and the .pdf file type suffix. If more than one page is extracted into a new PDF the page numbers will be separated with an underscore "_" for individual pages, ".." for a range of pages. eg. my6.pdf for a single page (page 6) and my1_3..6.pdf for a multi page PDF (pages 1, 3, 4, 5, 6)

Setting PDFSaveAs to something other than "" or 0 will cause the output to be named with the content of PDFSaveAs. The .pdf filename extension and any path informationwill be stripped from the variable if set. PDFFilename will contain the actual filename used for the last extracted pdf'.

PDFErrorPage (set and get)

 $errorPagePath=$pdf->setVars("PDFErrorPage"=>"C:/myError.html");
 or
 $errorPagePath=$pdf->getVars("PDFErrorPage");

PDFErrorPage is a text file that can be used as a template for the error page. If the PDFErrorPage contains [PDFError], the word PDFError surrounded by square brackets, then the error description will replace [PDFError]. Otherwise you can devise a generic error description and describe remedial actions to be taken by the viewer.

If this variable is not set then a default error page will be used. The default page has a message in red at the top, "There is system problem in processing your PDF Pages request.", and then a description of the actual error follows underneath in black.

PDFExtract (get only)

 $out=$pdf->getVars("PDFExtract");

This variable contains the last PDF document processed by getPDFExtract, savePDFExtract, servePDFExtract and fastServePDFExtract. PDFExtract will be an empty string if there was an error.

PDFPagesFound (get only)

 $pagesFound=$pdf->getVars("PDFPagesFound");
 or
 @pages = split ", ", $pdf->getVars("PDFPagesFound");

This variable contains a comma seperated list of the page numbers that were selected and found within the original PDF document. PDFPagesFound will be a undefined if there was an error in finding any pages.

PDFPageCount (get only)

 $pageCount=$pdf->getVars("PDFPageCount");

This variable contains the number of the pages that were selected and found within the original PDF document. PDFPageCount will be an empty string if there was an error in finding any pages.

PDFFileName (get only)

 $filename=$pdf->getVars("PDFFilename");
 

This variable will contain the actual filename. If PDFSaveAs is unset the new PDF's filename will be an amalgam of the original filename, the requested page numbers and the .pdf file type suffix. If more than one page is extracted into a new PDF the page numbers will be separated with an underscore "_" for individual pages, ".." for a range of pages. eg. my6.pdf for a single page (page 6) and my1_3..6.pdf for a multi page PDF (pages 1, 3, 4, 5, 6). If PDFSaveAs is set then PDFSaveAs will be used to construct PDFFilename. The full path to the extracted pdf file can be obtained by -

 $fullpath = $pdf->getVars("PDFCache") ."/". $pdf->getVars("PDFFilename");
 or
 ($path,$filename) = $pdf->getVars("PDFCache","PDFFilename");

PDFError (get only)

 $error=$pdf->getVars("PDFError");

This variable contains a string describing the errors if any in processing the original PDF file. PDFError is guarenteed to be set if getPDFExtract, savePDFExtract, servePDFExtract or fastServePDFExtract fail and return a "0". PDFError will be an empty string if there was no error.

PDFDebug (set for method call duration only)

 $pdf->setVars(
            PDFDoc=>'C:\docs\pdf', 
            PDFPages=>"2 6-8 ",
            PDFDebug=>1);

This really a directive and not a true variable. It is used to debug the setting of variables in a PDF::Extract method call. PDFDebug as used above will print:-

 These variables are to be set
        PDFDoc="C:\docs\pdf/"
        PDFPages="2 6-8 "
        PDFDebug="1"
 These variables have been set
        PDFCache="C:/myCache"
        PDFFilename="2_6..8_.pdf"
        PDFPagesFound=""
        PDFDoc=""
    PDFPages="2, 6, 7, 8"
        PDFPageCount=""
        PDFExtract=""
        PDFError="PDF document "" not found at C:/Perl/site/lib/PDF/Extract.pm line 467"

NOTES

This version of PDF::Extract has been designed to produce output to the PDF Standard as defined in the PDF Reference Seventh Edition.

However some third party PDF applications require a non standard feature of PDF documents. Namely: The sequential numbering of objects starting at zero.

PDF::Extract treats a PDF file as a flat file, for speed of processing, and consequently knows nothing of PDF objects. Objects extracted remain exactly as they were in the original document. These objects are not renumbered. There will be gaps in the object number sequence. This is allowed in the specification. Only the catalog and page tree objects are altered.

See the web site if you need information how to make PDF documents comply with what your third party PDF application expects.

BUGS

There is a bug that Jon Schaeffer reported that had to do with some font resources not being found in the extracted PDF. The source of the bug has, as yet, not been found. If you find such a bug can you email a one page original pdf that can produce a PDF extract that has this bug.

Please report any bugs you find.

AUTHOR

Noel Sharrock <mailto:nsharrok@lgmedia.com.au>

PDF::Extract's home page http://www.lgmedia.com.au/page.aspx?ID=8

Forum for users and developers has been hacked and database no longer exists. There are some sad folk around.

SUPPORT

Much thanks to:-

 Lyman Byrd for his welcome programming suggestions and editorial comments on the POD.
 Michael Cox for his suggestion of PDFSaveAs and for the time he spent in testing the module.
 Alberto Accomazzi for sharing his time and his knowledge of Unixish PDF voodoo magick.
 Stefano Capuzzimato for correcting some stuff in the regexes he found.
 Geert Theys for finding a small bug and supplying an excelent solution.
 Jon Schaeffer for help with finding a solution to a bug in extracting Adobe 6+ pages.
 Dario Santini for reporting a bug at http://rt.cpan.org//Ticket/Display.html?id=33707
 Patrick Bourdon suggested several fixes for undefind string concatination warnings.

COPYRIGHT

Copyright (c) 2005 by Noel Sharrock. All rights reserved.

LICENSE

This package is free software; you can redistribute it and/or modify it under the same terms as Perl itself, i.e., under the terms of the ``Artistic License'' or the ``GNU General Public License''.

The C library at the core of this Perl module can additionally be redistributed and/or modified under the terms of the ``GNU Library General Public License''.

DISCLAIMER

This package is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

See the ``GNU General Public License'' for more details.

PDF::Extract - Extracting sub PDF documents from a multipage PDF document