- IMPORTANT NOTE on INSTALLATION
- pg bulkCopy.pl
- Module Pg::BulkCopy
- Using Pg::BulkCopy
- Methods for Pg::BulkCopy
- Troubleshooting and Issues:
- LICENSE AND COPYRIGHT
Pg::BulkCopy - Deprecated use https://github.com/ossc-db/pg_bulkload/
Back in 2010 when I wrote Pg::BulkCopy I couldn't find a functional bulk load utility for Postgres. At this time I think that pg_bulkload is a much better utility for this.
I've written a short article on installing it at my personal blog techinfo.brainbuz.org/?p=429, which will help you get started.
If you are installing this through cpan please be aware that you will not be running any tests. The tests are in t2 and not run by any of the CPAN clients. To run the tests you need to download the test data http://www.cpan.org/authors/id/B/BR/BRAINBUZ/Pg-BulkCopyTest-0.16.tar.gz. Pg::BulkCopy requires postgres to be installed on the same machine although it uses dbi it requires a shared directory that postgres and it can both write to. If you do not want to setup and run the tests it is recommended that you at least perform a validation trial
The utility script pg_BulkCopy.pl was written to provide postgreSQL with a convient bulk loading utility. The script is implemented as a wrapper and a module (pg_BulkCopy.pl) so that other programmers may easily incorporate the two useful methods LOAD and DUMP directly into other perl scripts.
The DUMP Method invokes postgres' COPY TO command, and does nothing useful in addition except copying the dump from the temp directory (because postgres may not have permission on where you want the file). You can choose Tab Delimited Text or CSV with whatever delimiter you want and a Null string of your choice.
The LOAD Method is much more interesting, it breaks the load file into chunks of 10000 (configurable) records in the temp directory and tries to COPY FROM, if it fails, it parses the error message for the line number, then it removes the failed line to a rejects file and tries again. As with DUMP you can select the options supported by the postgres COPY command, you can also set a limit on bad records (default is 10).
Command Line Arguments to script:
file|filename|f table|t load|l dump|d iscsv|csv dbistring|dbistr|ds dbiuser|dbiusr|du dbipass|dp workingdir|working|w tmpdir|tmp batchsize|batch|b errorlog|error|e maxerrors|errors|max debug|dbg trunc|truncate|tr help|h|? read|r [to read additional variables out of a file]
Format of a Parameter file (specified with --read):
[options] filename : blob1.tsv load : 1 dump : iscsv : dbistring : DBI:Pg:dbname=pg_bulkcopy_test;host=127.0.0.1 dbiuser : postgres dbipass : postgres table : testing workingdir : /psql_scr/testing/tdata/ tmpdir : /psql_scr/testing/tmp/ batchsize : errorlog : maxerrors : 50 debug : 2 trunc : 1
Example command line
pg_bulkcopy.pl --filename more1.tsv --iscsv 0 --dbistring "DBI:Pg:dbname=pg_bulkcopy_test;host=127.0.0.1" --dbiuser postgres --dbipass postgres --table testing --workingdir /tempdata --tmpdir /mytempfiles --debug 1
Command Line to load all values from bulkcopy.conf
pg_bulkcopy.pl --read "bulkcopy.conf"
Command Line to load values from bulkcopy.conf but provide or override some values from the command line. Values given on the command line take precedence over conflicting values read from file.
pg_bulkcopy.pl --read "bulkcopy.conf" --dbistring "DBI:Pg:dbname=pg_bulkcopy_test;host=127.0.0.1" --dbiuser postgres --dbipass postgres
These are the parameters needed to establish a dbi connection.
filename is a tab or comma seperated values text file containing the data to be imported or exported. table indicates the table in the connected database to be used for the operation.
Boolean values of 0 or 1. An iscsv value of 1 indicates the file is csv. The default iscsv value of 0 indicates tab seperated. The default operation is load (load = 1), setting dump to 1 will set load to 0 and cause the program to dump instead. trunc causes an explicit truncation (deletion) of all data in the table prior to the requested operation (not useful with dump).
workingdir is where the file, reject and log files will be written, unless the full path/filename is specified it is also expected to find/write the file for the operation here. errorlog is the name of a file to write information about problems to, this will default to <filename>.log.
tempdir where the temporary working files will be written to. tempdir defaults to /tmp. Do not overlook tempdir, the user executing the script and the uid that postgres is running under must have rw permissions here and the default creation mask must permit access to each other's newly created files!
batchsize controls the size of the chunks used for loading, the default is 10,000. With clean data a larger batch size will spead processing, with dirty data smaller batches will improve performance. Every time an error is encountered the offending record needs to be eliminated from the batch, which is currently done inneficiently by re-writing the file.
maxerrors tells the program to abort if too many errors are found. The default is 10.
debug can disable or increase the amount of error logging done. 0 disables error logging, normal is 1.
All methods used by pg_BulkCopy.pl are provided by Pg::BulkCopy. The method names follow the convention of explicitely defined methods in caps and methods created by Moose in lowercase.
This utility is specific to postgreSQL. It is a console application for the server itself. The postgres process must be able to access the data files through the local file system on the server. The utility is targeted towards recent versions of postgres running on unix-like operating systems, if you need to run it on Windows good luck and let me know if you succeed!
my $PGBCP = Pg::BulkCopy->new( dbistring => $dbistr, dbiuser => $dbiuser, dbipass => $dbipass, table => $table, filename => $filename, workingdir => "$tdata/", iscsv => 1, maxerrors => 10, errorlog => 'myload.log', );
The above example shows the creation of a new BulkCopy object. It requires the dbi information and the name of a table. workingdir will default to /tmp, and filename is required. The default behaviour is Tab Seperated so iscsv is only required for a csv import, icsv => 0 will explicitely request tsv. errorlog is defaulted to $workingdir/pg_BulkCopy.ERR and can be safely omitted. To disable all logging set debug => 0, to log everything set it to 2. maxerrers defaults to 10, setting 0 changes it to an arbitrary large number. All properties have a getter $PGBCP->dbistring() will tell you what the string is, while this property is Read Only, most properties are also a setter to change the value without creating a new object.
The methods DUMP, TRUNC, and LOAD do most of the work. You can use errcode and errstr to findout about the results.
The tests in the distributions t2 folder can be referred to for additional examples.
Returns the dbi connection, initializing it if necessary.
If the Trunc option is specified, delete all records from table with the postgres TRUNCATE command, instead of carrying out a LOAD or DUMP operation.
The main subroutine for importing bulk data into postgres.
The main subroutine for exporting bulk data into postgres.
Write to the log file being used by Pg::BulkCopy. Takes a scalar value or an array, items in an array are written on seperate lines. Remember that if debug is 0 nothing will ever be logged.
Gets or sets the maximum errors in a job. Setting 0 actually sets an arbitrary large number instead. The default is 10
Gets or sets the batch size. If there are few errors in the source a large batch size is appropriate, if there are many errors a smaller batch will speed processing. The default is 10000.
Returns the last error in numeric or string form. Generally this is just passed back from dbi.
Toggle between Command Tab seperated input. The default is tab seperated, 0.
The workingdir is where Pg::BulkCopy will look for the data file and where it will write any reject or log files. The tempdir is a scratch directory which both the script user and the postgres user have read write access. The default of both workingdir and tempdir is /tmp. Finally a file name for input or output is needed.
Is a moose component, it is run "after new".
is used internally for outputting to stderr and the log file.
The most persistent problem in getting Pg::BulkCopy to work correctly is permissions. First one must deal with hba.conf. Then once you are able to connect as the script user to psql and through a dbi connection you must deal with the additional issue that you are probably not running the script as the account postgres runs under. The account executing the script must be able to read and execute the script directories, read and write the working directory and the temp directory. Finally the account running the Postgres server must be able to read and write in the temp directory (which is defaulted to /tmp).
To deal proactively with permissions issues I recommend the following steps. Check umask in /etc/profile, and change it to something like 002 (which gives owner and group read/write other read). Create a group containing the users of the script and the postgres user. On the directory where you are running the scripts, the temp directory and the one containing data use chmod to set the Special bit (chmod g+s). Make sure that the directory and any pre-existing files have the correct group set. Touch a file as a user in the group and confirm that the group is set to the group and not the user. Other options are to use the ACL feature to manage permissions or to try running the script as the postgres user.
There is currently an issue I haven't resolved with a quoted csv input test file. The next features I expect to work on involve supporting csv headers and field reordering, which will also make the feature available for tsv files.
CSV Headers are not supported yet, you'll need to chop them off yourself. Field reordering also isn't supported. These features will be reconsidered for later versions.
To properly test the module and script it is necessary to have an available configured database. So that the bundle can be installed silently through a cpan utility session very few tests are run during installation. Proper testing must be done manually. Due to the size of the test data it has been removed to a seperate archive, Pg-BulkCopyTest which must be downloaded seperately from cpan. Normally the contents would be restored to the tdata directory.
First make sure that the account you are using for testing has sufficient rights on the server. The sql directory contains a few useful scripts for creating a test database. On linux a command like this should be able to create the database:
psql postgres <
dbitest.pl adds a row to your new database and then deletes it, use dbitest to verify your dbi string and that it can access the database.
Edit the file t2/test.conf. You will need to provide the necessary dsn values for the dbi connection.
If necessary modify harness.sh from the distribution directory as appropriate and execute it to run the tests.
<brainbuz at brainbuz.org>
Please report any bugs or feature requests to
bug-pg-bulkcopy at rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Pg-BulkCopy. I will be notified, and then you'll automatically be notified of progress on your bug as I make changes.
You can find documentation for this module with the perldoc command.
You can also look for information at:
RT: CPAN's request tracker
AnnoCPAN: Annotated CPAN documentation
Copyright 2010,2012 John Karr.
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; version 3 or at your option any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
A copy of the GNU General Public License is available in the source tree; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.