- SAMPLE USAGE
- SEE ALSO
- CLASS FUNCTIONS
- AUTHOR and COPYRIGHT
dbmerge - merge all inputs in sorted order based on the the specified columns
dbmerge --input A.fsdb --input B.fsdb [-T TemporaryDirectory] [-nNrR] column [column...]
or cat A.fsdb | dbmerge --input - --input B.fsdb [-T TemporaryDirectory] [-nNrR] column [column...]
or dbmerge [-T TemporaryDirectory] [-nNrR] column [column...] --inputs A.fsdb [B.fsdb ...]
Merge all provided, pre-sorted input files, producing one sorted result. Inputs can both be specified with
--input, or one can come from standard input and the other from
--xargs, each line of standard input is a filename for input.
Inputs must have identical schemas (columns, column order, and field separators).
Unlike dbmerge2, dbmerge supports an arbitrary number of input files.
Because this program is intended to merge multiple sources, it does not default to reading from standard input. If you wish to list - as an explicit input source.
Also, because we deal with multiple input files, this module doesn't output anything until it's run.
dbmerge consumes a fixed amount of memory regardless of input size. It therefore buffers output on disk as necessary. (Merging is implemented a series of two-way merges, so disk space is O(number of records).)
dbmerge will merge data in parallel, if possible. The <--parallelism> option can control the degree of parallelism, if desired.
Expect that input filenames are given, one-per-line, on standard input. (In this case, merging can start incrementally.
Delete the source files after they have been consumed. (Defaults off, leaving the inputs in place.)
- -T TmpDir
where to put tmp files. Also uses environment variable TMPDIR, if -T is not specified. Default is /tmp.
- --parallelism N or -j N
Allow up to N merges to happen in parallel. Default is the number of CPUs in the machine.
- --endgame (or --noendgame)
Enable endgame mode, extra parallelism when finishing up. (On by default.)
Sort specification options (can be interspersed with column names):
- -r or --descending
sort in reverse order (high to low)
- -R or --ascending
sort in normal order (low to high)
- -n or --numeric
- -N or --lexical
This module also supports the standard fsdb options:
Enable debugging output.
- -i or --input InputSource
Read from InputSource, typically a file name, or
-for standard input, or (if in Perl) a IO::Handle, Fsdb::IO or Fsdb::BoundedQueue objects.
- -o or --output OutputDestination
Write to OutputDestination, typically a file name, or
-for standard output, or (if in Perl) a IO::Handle, Fsdb::IO or Fsdb::BoundedQueue objects.
- --autorun or --noautorun
By default, programs process automatically, but Fsdb::Filter objects in Perl do not run until you invoke the run() method. The
--(no)autorunoption controls that behavior within Perl.
Show full manual.
#fsdb cid cname 11 numanal 10 pascal
#fsdb cid cname 12 os 13 statistics
These two files are both sorted by
cname, and they have identical schemas.
dbmerge --input a.fsdb --input b.fsdb cname
cat a.fsdb | dbmerge --input b.fsdb cname
#fsdb cid cname 11 numanal 12 os 10 pascal 13 statistics # | dbmerge --input a.fsdb --input b.fsdb cname
$filter = new Fsdb::Filter::dbmerge(@arguments);
Create a new object, taking command-line arguments.
Internal: set up defaults.
Internal: parse command-line arguments.
Internal: pretty-print a filename or Fsdb::BoundedQueue.
$out = $self->segment_next_output($output_type)
Internal: return a Fsdb::IO::Writer as $OUT that either points to our output or a temporary file, depending on how things are going.
The $OUTPUT_TYPE can be 'final' or 'ipc' or 'file'.
$out = $self->segment_cleanup($file);
Internal: Clean up a file, if necessary. (Sigh, used to be function pointers, but not clear how they would interact with threads.)
$id = $self->_unique_id()
Generate a sequence number for debugging.
$out = $self->segments_merge2_run($out_fn, $is_final_output, $in0, $in1);
Internal: do the actual merge2 work (maybe our parent put us in a thread, maybe not).
Internal: put $WORK on the queue at $DEPTH, updating the max count.
Merge queued files, if any.
Also release any queued threads.
Internal: read new filenames to process (from stdin) and send them to the work queue.
Making a separate Fred to handle xargs is a lot of work, but it guarantees it comes in on an IO::Handle that is selectable.
Internal: Merge queued files, if any. Iterates over all depths of the merge tree, and handles any forked threads.
Merging is done in a binary tree is managed through the
_work queue. It has an array of
depth entries, one for each level of the tree.
Items are processed in order at each level of the tree, and only level-by-level, so the sort is stable.
Parallelism is also managed through the
_work queue, each element of which consists of one file or stream suitable for merging. The work queue contains both ready output (files or BoundedQueue streams) that can be immediately handled, and pairs of semaphore/pending output for work that is not yet started. All manipulation of the work queue happens in the main thread (with
We start a thread to handle each item in the work queue, and limit parallelism to the
_max_parallelism, defaulting to the number of available processors.
There two two kinds of parallelism, regular and endgame. For regular parallelism we pick two items off the work queue, merge them, and put the result back on the queue as a new file. Items in the work queue may not be ready. For in-progress items we wait until they are done. For not-yet-started items we start them, then wait until they are done.
Endgame parallelism handles the final stages of a large merge. When there are enough processors that we can start a merge jobs for all remaining levels of the merge tree. At this point we switch from merging to files to merging into
Fsdb::BoundedQueue pipelines that connect merge processes which start and run concurrently.
The final merge is done in the main thread so that that the main thread can handle the output stream and recording the merge action.
Internal: setup, parse headers.
Internal: run over each rows.
Copyright (C) 1991-2018 by John Heidemann <email@example.com>
This program is distributed under terms of the GNU general public license, version 2. See the file COPYING with the distribution for details.