split_data_file.pl
A program to split a data file by rows based on common data values.
split_data_file.pl [--options] <filename>
File options: -i --in <filename> (txt bed gff gtf vcf refFlat ucsc etc) -p --prefix <text> output file prefix (input basename) Splitting options: -x --index <column_index> column with values to split upon -t --tag <text> use VCF/GFF attribute -m --max <integer> maximum number of items per output file General options: -z --gz compress output file -v --version print version and exit -h --help show extended documentation
The command line flags and descriptions:
Specify the file name of a data file. It must be a tab-delimited text file. The file may be compressed with gzip.
Optionally provide a filename prefix for the output files. The default prefix is the input filename base name. If no prefix is desired, using just the values as filenames, then set the prefix to 'none'.
Provide the index number of the column or dataset containing the values used to split the file. If not specified, then the index is requested from the user in an interactive mode.
Provide the attribute tag name that contains the values to split the file. Attributes are supported by GFF and VCF files. If splitting a VCF file, please also provide the column index. The INFO column is index 7, and sample columns begin at index 9.
Optionally specify the maximum number of data lines to write to each file. Each group of specific value data is written to one or more files. Enter as an integer; underscores may be used as thousands separator, e.g. 100_000.
Indicate whether the output files should be compressed with gzip. Default behavior is to preserve the compression status of the input file.
Print the version number.
Display the POD documentation
This program will split a data file into multiple files based on common values in the data table. All rows with the same value will be written into the same file. A good example is chromosome, where all data points for a given chromosome will be written to a separate file, resulting in multiple files representing each chromosome found in the original file. The column containing the values to split and group should be indicated; if the column is not sepcified, it may be selected interactively from a list of column headers.
This program can also split files based on an attribute tag in GFF or VCF files. Attributes are often specially formatted delimited key value pairs associated with each feature in the file. Provide the name of the attribute tag to split the file. Since attributes may vary based on the feature type, an interactive list is not supplied from which to choose the attribute.
If the max argument is set, then each group will be written to one or more files, with each file having no more than the indicated maximum number of data lines. This is useful to keep the file size reasonable, especially when processing the files further and free memory is constrained. A reasonable limit may be 100K or 1M lines.
The resulting files will be named using the basename of the input file, appended with the unique group value (for example, the chromosome name) demarcated with a #. If a maximum line limit is set, then the file part number is appended to the basename, padded with zeros to three digits (to assist in sorting). Each file will have duplicated and preserved metadata. The original file is preserved.
This program is intended as the complement to 'join_data_files.pl'.
Timothy J. Parnell, PhD Howard Hughes Medical Institute Dept of Oncological Sciences Huntsman Cancer Institute University of Utah Salt Lake City, UT, 84112
This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.
To install Bio::ToolBox, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Bio::ToolBox
CPAN shell
perl -MCPAN -e shell install Bio::ToolBox
For more information on module installation, please visit the detailed CPAN module installation guide.