The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Bio::ToolBox::Data::file - File functions to Bio:ToolBox::Data family

DESCRIPTION

File methods for reading and writing data files for both Bio::ToolBox::Data and Bio::ToolBox::Data::Stream objects. This module should not be used directly. See the respective modules for more information.

DESCRIPTION

These are methods for providing file IO for the Bio::ToolBox::Data data structure. These file IO methods work with any generic tab-delimited text file of rows and columns. It also properly handles comment, metadata, and column-specific metadata custom to Bio::ToolBox programs. Special file formats used in bioinformatics, including for example GFF and BED files, are automatically recognized by their file extension and appropriate metadata added.

Files opened using these subroutines are stored in a specific complex data structure described below. This format allows for data access as well as records metadata about each column (dataset) and the file in general. This metadata helps preserve a "history" of the dataset: where it came from, how it was collected, and how it was processed.

Additional subroutines are also present for general processing and output of this data structure.

The data file format is described below, and following that a description of the data structure.

RECOGNIZED FILE FORMATS

Bio::ToolBox will recognize a number of standard bioinformatic file formats, almost all of which are recognized by their extension. Recognition is NOT guaranteed if an alternate file extension is used!!!!

These formats include

BED

These include file extensions .bed, .bedgraph, and .bdg. Bed files must have 3-12 columns. BedGraph files must have 4 columns.

GFF

These include file extensions .gff, .gff3, and .gtf. The specific format may also be recognized by the gff-version pragma. These files must have 9 columns.

UCSC tables

These include file extensions .refFlat, .genePred, and .ucsc. In some cases, a simple .txt can also be recognized if the file matches the expected file structure. Different formats are typically recognized by the number of columns, and can include simple refFlat, gene prediction, extended gene prediction, and known Gene tables. The Bin column may or may not be present.

Peak files

These include file extensions .narrowPeak and .broadPeak. These are special "BED6+4" file formats.

CDT

These include file extension .cdt. Cluster data files used with Cluster 3.0 and Treeview.

SGR

Rare file format of chromosome, position, score. File extension .sgr.

TEXT

Almost any tab-delimited text file with a .txt extension can be loaded.

Compressed files

File extension .gz and .bz2 are recognized as compressed files. Compressed files are usually read through an external decompression program. All of the above formats can be loaded as compressed files.

DEFAULT BIO::TOOLBOX DATA TEXT FILE FORMAT

When not writing to a defined format, e.g. BED or GFF, a Bio::ToolBox Data structure is written as a simple tab-delimited text file, with the first line being the column header names. Such files are easily parsed by other programs.

If additional metadata is included in the Data object, then these are written as comment lines, prefixed by a "# ", before the table. Metadata can describe the data within the table with regards to its type, source, methodology, history, and processing. The metadata is designed to be read by both human and computer. Opening files without this metadata will result in basic default metadata assigned to each column.

Some common metadata lines that are specifically recognized are listed below.

Feature

The Feature describes the types of features represented on each row in the data table. These can include gene, transcript, genome, etc.

Database

The name of the database used in generation of the feature table. This is often also the database used in collecting the data, unless the dataset metadata specifies otherwise.

Program

The name of the program generating the data table and file. It usually includes the whole path of the executable.

Column

The next header lines include column specific metadata. Each column will have a separate header line, specified initially by the word 'Column', followed by an underscore and the column number (0-based). Following this is a series of 'key=value' pairs separated by ';'. Spaces are generally not allowed. Obviously '=' or ';' are not allowed or they will interfere with the parsing. The metadata describes how and where the data was collected. Additionally, any modifications performed on the data are also recorded here.

A list of common column metadata keys is shown.

name

The name of the column. This should be identical to the table header.

database

Included if different from the main database indicated above.

window

The size of the window for genome datasets

step

The step size of the window for genome datasets

dataset

The name of the dataset(s) from which data is collected. Comma delimited.

start

The starting point for the feature in collecting values

stop

The stopping point of the feature in collecting values

extend

The extension of the region in collecting values

strand

The strandedness of the data collected. Values include 'sense', 'antisense', or 'none'

method

The method of collecting values

log2

boolean indicating the values are in log2 space or not

USER METHODS REFERENCE

These methods are generally available to Bio::ToolBox::Data objects and can be used by the user.

load_file

This will load a file into a new, empty Data table. This function is called automatically when a filename is provided to the new() function. The existence of the file is first checked (appending common missing extensions as necessary), metadata and column headers processed and/or generated from default settings, the content loaded into the table, and the structure verified. Error messages may be printed if the structure or format is inconsistent or doesn't match the expected format, e.g a file with a .bed extension doesn't match the UCSC specification. Pass the name of the filename.

taste_file

Tastes, or checks, a file for a certain flavor, or known gene file formats. Useful for determining if the file represents a known gene table format that lacks a defined file extension, e.g. UCSC formats. This can be based on the file extension, metadata headers, and/or file contents from the first 10 lines. Returns two strings: the first is a generic flavor, and the second is a more specific format, if applicable. Generic flavor values will be one of `gff`, `bed`, `ucsc`, or `undefined`. These correlate to specific Parser adapters. Specific formats could be any number of possibilities, for example `undefined`, `gtf`, `gff3`, `narrowPeak`, `genePred`, etc.

sample_gff_type_list

Checks the different types of features available in a GFF formatted file. It will temporarily open the file, read the first 1000 lines or so, and compile a list of the values in the 3rd column of the GFF file. It will return a comma-delimited string of these values upon success, suitable for regular expression checking. Pass the name of the GFF file to check.

add_file_metadata

Add or update the file metadata to a Bio::ToolBox::Data object. This will automatically parse the path, basename, and recognized file extension. Pass the file name.

write_file
save

This method will write out a Bio::ToolBox::Data structure to file. Zero or more values may be passed to the method.

Pass no values, and the filename stored in the metadata will be used in writing the file, effectively overwriting itself. No filename will generate an error.

Pass a single value representing the filename to write. The current working directory is assumed if no path is provided in the filename.

Pass an array of key => values for fine control of the write process. Keys include the following:

  filename => A scalar value containing the name of the file to 
              write. This value is required for new data files and 
              optional for overwriting existing files (the filename 
              stored in the metadata is used). Appropriate extensions 
              are added (e.g, .txt, .gz, etc) as neccessary. 
  format   => A string to indicate the file format to be written.
              Acceptable values include 'text', and 'simple'.
              Text files are text in nature, include all metadata, and
              usually have '.txt' extensions. Simple files are
              tab-delimited text files without metadata, useful for
              exporting data. If the format is not specified, the
              extension of the passed filename will be used as a
              guide. The default behavior is to write standard text
              files.
  gz       => A value (2, 1, or 0) indicating whether the file 
              should be written through a gzip filter to compress. If 
              this value is undefined, then the file name is checked 
              for the presence of the '.gz' extension and the value 
              set appropriately. Default is false (no compression).
              Set to 1 to use ordinary gzip, or set to 2 to use block 
              gzip (bgzip) compression for tabix compatibility.
  simple   => A boolean value (1 or 0) indicating whether a simple 
              tab-delimited text data file should be written. This is 
              an old alias for setting 'format' to 'simple'.

The method will return the real name of the file written if the write was successful. The filename may be modified slightly as necessary, for example append or change the file extension to match the specified file format.

open_to_read_fh

This subroutine will open a file for reading. If the passed filename has a .gz extension, it will appropriately open the file through a gunzip filter.

Pass the subroutine the filename. It will return a scalar reference to the open filehandle. The filehandle is an IO::Handle object and may be manipulated as such.

Example

        my $filename = 'my_data.txt.gz';
        my $fh = Bio::ToolBox::Data::file->open_to_read_fh($filename);
        while (my $line = $fh->getline) {
                # do something
        }
        $fh->close;
        
open_to_write_fh

This subroutine will open a file for writing. If the passed filename has a .gz extension, it will appropriately open the file through a gzip filter.

Pass the subroutine three values: the filename, a boolean value indicating whether the file should be compressed with gzip, and a boolean value indicating that the file should be appended. The gzip and append values are optional. The compression status may be determined automatically by the presence or absence of the passed filename extension; the default is no compression. The default is also to write a new file and not to append.

If gzip compression is requested, but the filename does not have a .gz extension, it will be automatically added. However, the change in file name is not passed back to the originating program; beware!

The subroutine will return a scalar reference to the open filehandle. The filehandle is an IO::Handle object and may be manipulated as such.

Example

        my $filename = 'my_data.txt.gz';
        my $gz = 1; # compress output file with gzip
        my $fh = Bio::ToolBox::Data::file->open_to_write_fh($filename, $gz);
        # write to new compressed file
        $fh->print("something interesting\n");
        $fh->close;

OTHER METHODS

These methods are used internally by Bio::ToolBox::Core and other objects are not recommended for use by general users.

parse_headers

This will determine the file format, parse any metadata lines that may be present, add metadata and inferred column names for known file formats, and determine the table column header names. This is automatically called by "load_file", and generally need not be called.

Pass a true boolean option if there were no headers in the file.

add_data_line

Parses a text line from the file into a Data table row. Pass the text line.

check_file

This subroutine confirms the existance of a passed filename. If not immediately found, it will attempt to append common file extensions and verify its existence. This allows the user to pass only the base file name and not worry about missing the extension. This may be useful in shell scripts. Pass the file name.

add_column_metadata

Parse a column metadata line from a file into a Data structure.

add_gff_metadata

Add default column metadata for a GFF file. Specify which GFF version. A second boolean value can be passed to force the method.

add_bed_metadata

Add default column metadata for a BED file. Specify the number of BED columns. Pass a second boolean to force the method.

add_peak_metadata

Add default column metadata for a narrowPeak or broadPeak file. Specify the number of columns. Pass a second boolean to force the method.

add_ucsc_metadata

Add default column metadata for a UCSC refFlat or genePred file. Specify the number of columns to define the format. Pass a second boolean to force the method.

add_sgr_metadata

Add default column metadata for a SGR file. Pass a boolean to force the method.

add_standard_metadata

Add default column metadata for a generic file. Pass the text line containing the tab-delimited column headers.

standard_column_names

Returns an anonymous array of standard file format column header names. Pass a value representing the file format. Values include gff, bed12, bed6, bdg, narrowpeak, broadpeak, sgr, ucsc16, ucsc15, genepredext, ucsc12, knowngene, ucsc11, genepred, ucsc10, refflat.

SEE ALSO

Bio::ToolBox::Data

AUTHOR

 Timothy J. Parnell, PhD
 Howard Hughes Medical Institute
 Dept of Oncological Sciences
 Huntsman Cancer Institute
 University of Utah
 Salt Lake City, UT, 84112

This package is free software; you can redistribute it and/or modify it under the terms of the Artistic License 2.0.